#StackBounty: #estimation #inference #fisher-information #efficiency Deriving C-R inequality from H-C-R bound

Bounty: 50

As mentioned in the title, I want to derive the Cramer-Rao Lower bound from the Hammersly-Chapman-Robbins lower bound for the variance of a statistic $T$.
The statement for the H-C-R lower bound is the following,

Let $mathbf{X} sim f_{theta}(.)$ where $theta in Theta subseteq mathbb{R}^k.$ Suppose $T(mathbf{X})$ is an unbiased estimator of $tau(theta)$ where $tau colon Theta to mathbb{R}$. Then we have,
begin{equation}
text{Var}{theta}(T) ge displaystyle sup{Delta in mathcal{H}{theta}}, displaystyle frac{[tau(theta + Delta)]^2}{mathbb{E}{theta}left(frac{f_{theta + Delta}}{f_{theta}} – 1right)^2}
end{equation
}

where $mathcal{H}_{theta} = {alpha in Theta colon text{ support of } f text{ at } theta + alpha subseteq text{ support of } f text{ at } theta}$

Now when $k = 1$ and the regularity conditions hold, taking $Delta to 0$ gives the following inequality,
begin{equation}
text{Var}{theta}(T) ge displaystyle frac{[tau'(theta)]^2}{mathbb{E}{theta} left( frac{partial }{partial theta} log f_{theta}(mathbf{X}) right)^2}
end{equation
}

which is exactly the C-R inequality for univariate case.

However, I want to derive the general form of C-R inequality from the H-C-R bound, i.e. when $k > 1$. But, I have not been able to do it. Though, I was able to figure out that we would have to use $mathbf{0} in mathbb{R}^k$ instead of $0$ and $|Delta|$ to obtain the derivatives, which was obvious anyways, I couldn’t get to any expression remotely similar to the C-R inequality. One of the difficulty arises while dealing with the squares. Since for the univariate case, we were able to take the limit inside and as a result got the square of the derivative. While, for the latter case, we cannot take the limit inside, because the derviate in this case would be a vector and we will have the expression containg the square of a vector which is absurd.

I want to know how to derive the C-R inequality in the latter case?


Get this bounty!!!

#StackBounty: #bayesian #estimation #inference #prevalence Optimization of pool size and number of tests for prevalence estimation via …

Bounty: 100

I’m trying to devise a protocol for pooling lab tests from a cohort in order to get prevalence estimates using as few reagents as possible.

Assuming perfect sensitivity and specificity (if you want to include them in the answer is a plus), if I group testing material in pools of size $s$ and given an underneath (I don’t like term “real”) mean probability $p$ of the disease, the probability of the pool being positive is:

$$p_w = 1 – (1 – p)^s$$

if I run $w$ such pools the probability of having $k$ positive wells given a certain prevalence is:

$$p(k | w, p) = binom{N}{k} (1 – (1 – p)^k)^s(1 – p)^{s(w-k)}$$

that is $k sim Binom(w, 1 – (1 – p)^s)$.

To get $p$ I just need to maximize the likelihood $p(k | w, p)$ or use the formula $1 – sqrt[s]{1 – k/w}$ (not really sure about this second one…).

My question is, how do I optimize $s$ (maximize) and $w$ (minimize) according to a prior $p$ in order have the most precise estimates, below a certain level of error?


Get this bounty!!!

#StackBounty: #regression #hypothesis-testing #multiple-regression #estimation #linear-model Multiple Linear Regression Coefficient Est…

Bounty: 100

A multiple linear regression model is considered. It is assumed that $$ Y_i = beta_1x_{i1} + beta_2x_{i2} + beta_3x_{i3} + epsilon_i$$ where $epsilon$-s are independent and have the same normal distribution with zero expectation and unknown variance $sigma^2$. 100 measurements are made, i.e $i = 1,2,…, 100.$ The explanatory variables take the following values: $x_{i1} = 2$ for $1 leq i leq 25$ and $0$ otherwise, $x_{i2} = sqrt{2}$ for $26 leq i leq 75$ and $0$ otherwise, $x_{i3} = 2$ for $76 leq i leq 100$ and $0$ otherwise.

a) Let $hat{beta_1},hat{beta_2}, hat{beta_3}$ be least squares estimators of $beta_1, beta_2, beta_3$. Prove that in the considered case $hat{beta_1},hat{beta_2}, hat{beta_3}$ are independent, and $$Var(hat{beta_1}) = Var(hat{beta_3}) = Var(hat{beta_3})$$ Do these properties hold in the general case? If not, give counterexamples.

b) Perform a test for $$H_0: beta_1 + beta_3 = 2beta_2$$vs.$$H_1: beta_1 + beta_3 neq 2beta_2$$ The significance level is 0.05. The least squares estimates of $beta_1, beta_2$ and $beta_3$ are $0.9812, 1.8851$ and $3.4406$, respectively. The unbiased estimate of the variance $sigma^2$ is $3.27$.

For a) I know the OLS estimator for $hat{beta} = (X^TX)^{-1}X^Ty$, and $Var(hat{beta}) = sigma^2 (X^TX)^{-1}$. But I don’t know how to attain explicit expressions for each of the coefficients from this. Although it seems quite clear that the estimators are independent, for instance $P(hat{beta_3} = beta_3, hat{beta_1} = 0, hat{beta_2} = 0) = P(hat{beta_3} = beta_3)$ but I don’t how to write a proper proof. I believe the estimators are generally dependent and have unequal variance, but can’t come up with any particular examples.

For b) not sure what test-statistic to use (t or F) and how to set it up. Also don’t know the standard errors of the coefficients


Get this bounty!!!

#StackBounty: #time-series #estimation #stationarity #garch #estimators Moving estimators for nonstationary time series, like loglikeli…

Bounty: 50

While in standard (“static”) e.g. ML estimation we assume that all values are from a distribution of the same parameters, in practice we often have nonstationary time series: in which these parameters can evolve in time.

It is usually considered by using sophisticated models like GARCH conditioning sigma with recent errors and sigmas, or Kalman filters – they assume some arbitrary hidden mechanism.

I have recently worked on a simpler and more agnostic way: use moving estimator, like loglikelihood with exponentially weakening weights of recent values:
$$theta_T=argmax_theta l_Tqquad textrm{for}qquad l_T= sum_{t<T}eta^{T-t} ln(rho_theta(x_t)) $$
intended to estimate local parameters, separately on each position. We don’t assume any hidden mechanism, only shift the estimator.

For example it turns out that EPD (exponential power distribution) family $rho(x) propto exp(-|x|^kappa)$, which covers Gaussian ($kappa=2$) and Laplace ($kappa=1$) distributions, can have cheaply made such moving estimator (plots below), getting much better loglikelihood for daily log-returns of Dow Jones companies (100 years DJIA, 10 years individual), even exceeding GARCH: https://arxiv.org/pdf/2003.02149 – just using the $(sigma_{T+1})^kappa=eta (sigma_{T})^kappa+(1-eta)|x-mu|^kappa$ formula: replacing estimator as average with moving estimator as exponential moving average:

enter image description here

I have also MSE moving estimator for adaptive least squares linear regression: page 4 of https://arxiv.org/pdf/1906.03238 – can be used to get adaptive AR without Kalman filter, also analogous approach for adaptive estimation of joint distribution with polynomials: https://arxiv.org/pdf/1807.04119

Are such moving estimators considered in literature?

What applications they might be useful for?


Get this bounty!!!

#StackBounty: #estimation #markov-process #transition-matrix Estimating model for transition probabilities of a Markov Chain

Bounty: 100

Suppose that I have a Markov chain with $S$ states evolving over time. I have $S^2times T$ values of the transition matrix, where $T$ is the number of time periods. I also have $K$ matrices $X$ of $Ttimes S$ values of (independent) variables, where $K$ is the number of variables that I can use to explain the transition probabilities ($p_{ij}$ are my dependent variables and the matrices $X_k$ are the independent variables).

Remember that $sum_j p_{ij}=1$ for each $t$.

In the end, I am looking for panel models to explain the transition probabilities, where the parameters are constant over time and (with maybe exception the constant) the parameters are also constant over different transition probabilities.

Just to be clear … Consider the following example … Imagine that an animal prefers to stay in places that there are food and water. Let the $Ttimes S$ matrix $X_F$ the matrix that tells the amount of food in each place $sin S$ and in each time $tin T$ and $X_W$ the matrix that tells the amount of water in each place $sin S$ and in each time $tin T$.

I want to use $X_F$ and $X_S$ to explain the transition probabilities. I do have the values of the transition probabilities over time and I want to use these matrices to explain their values.

I think I can design a kind of fixed effect logit model for each state in $S$. However, I would have to estimate $S$ logit models. I believe that the probabilities $p_{ij}$ and $p_{ji}$ should not be estimated in different models, since they seem to be related.

Any hints? Are there solutions in the literature of such kind of problem?


Get this bounty!!!

#StackBounty: #estimation #bias #central-limit-theorem Estimation and hypothesis testing for the difference in squared bias for two ran…

Bounty: 50

My Question:

Let $X_t$ and $Y_t$ denote two time-series random variables, both of which are estimates of the random variable $theta_t$. Let $U_t = X_t – theta_t$, and $V_t = Y_t – theta_t$. The bias of each variable is thus $mathbb{E} U_t$ and $mathbb{E} V_t$. I’m interested in estimation and hypothesis testing on the difference in squared bias, that is:
begin{equation}
gamma = (mathbb{E} U_t)^2 – (mathbb{E} V_t)^2
end{equation}

I thought this would be a fairly common problem, but some googling has not revealed much at all. I’m interested in any suggestions users have for estimating $gamma$, and conducting hypothesis tests.

My Current Approach:

Using the difference in squares, we get:
begin{equation}
gamma = (mathbb{E} U_t – mathbb{E} V_t)(mathbb{E} U_t + mathbb{E} V_t)
end{equation}

Let $bar{X} = frac{1}{T} sum_{t=1}^T X_t$, then we have the estimator:
begin{equation}
hat{gamma} = (bar{U} – bar{V})(bar{U} + bar{V})
end{equation}

Assuming that $U_t$ and $V_t$ obey the minimum conditions for Central Limit Theorem (CLT), e.g. bounded moments and weak dependence, the sample means converge in probability to their corresponding moments, and so using Slutsky’s theorem $hat{gamma} rightarrow gamma$ (in probability).

I think a CLT also applies. Multiplying by $sqrt{T}$ gives $ sqrt{T} (bar{U} – bar{V}) (bar{U} + bar{V})$. The first term $sqrt{T} (bar{U} – bar{V})$ will obey a CLT (when centred and scaled by an appropriate standard deviation estimator), and if the second term $bar{U} + bar{V}$ conveges in probability to a non-zero value, then by an application of Cramer’s theorem $hat{gamma} rightarrow mathcal{N}$.

This solution is not totally satisfactory. In practice, the $bar{U} + bar{V}$ converging to a non-zero value can be an issue. Under simulation, if this expression is converging to something close to zero, the simulated distribution definitely looks a bit pointy for a Normal. And this is a situation that could easily arise, if $X_t$ and $Y_t $ are both close to unbiased.

So, any ideas on how I can improve this estimator?


Get this bounty!!!

#StackBounty: #estimation #bias #central-limit-theorem #squared Estimation and hypothesis testing for the difference in squared bias fo…

Bounty: 50

My Question:

Let $X_t$ and $Y_t$ denote two time-series random variables, both of which are estimates of the random variable $theta_t$. Let $U_t = X_t – theta_t$, and $V_t = Y_t – theta_t$. The bias of each variable is thus $mathbb{E} U_t$ and $mathbb{E} V_t$. I’m interested in estimation and hypothesis testing on the difference in squared bias, that is:
begin{equation}
gamma = (mathbb{E} U_t)^2 – (mathbb{E} V_t)^2
end{equation}

I thought this would be a fairly common problem, but some googling has not revealed much at all. I’m interested in any suggestions users have for estimating $gamma$, and conducting hypothesis tests.

My Current Approach:

Using the difference in squares, we get:
begin{equation}
gamma = (mathbb{E} U_t – mathbb{E} V_t)(mathbb{E} U_t + mathbb{E} V_t)
end{equation}

Let $bar{X} = frac{1}{T} sum_{t=1}^T X_t$, then we have the estimator:
begin{equation}
hat{gamma} = (bar{U} – bar{V})(bar{U} + bar{V})
end{equation}

Assuming that $U_t$ and $V_t$ obey the minimum conditions for Central Limit Theorem (CLT), e.g. bounded moments and weak dependence, the sample means converge in probability to their corresponding moments, and so using Slutsky’s theorem $hat{gamma} rightarrow gamma$ (in probability).

I think a CLT also applies. Multiplying by $sqrt{T}$ gives $ sqrt{T} (bar{U} – bar{V}) (bar{U} + bar{V})$. The first term $sqrt{T} (bar{U} – bar{V})$ will obey a CLT (when centred and scaled by an appropriate standard deviation estimator), and if the second term $bar{U} + bar{V}$ conveges in probability to a non-zero value, then by an application of Cramer’s theorem $hat{gamma} rightarrow mathcal{N}$.

This solution is not totally satisfactory. In practice, the $bar{U} + bar{V}$ converging to a non-zero value can be an issue. Under simulation, if this expression is converging to something close to zero, the simulated distribution definitely looks a bit pointy for a Normal. And this is a situation that could easily arise, if $X_t$ and $Y_t $ are both close to unbiased.

So, any ideas on how I can improve this estimator?


Get this bounty!!!

#StackBounty: #mathematical-statistics #estimation The difference of normal means is also minimax?

Bounty: 50

Let $X_i sim N(xi, sigma^2)$ and $Y_i sim N(eta, tau^2)$ for known $sigma^2$ and $tau^2$.

I know that $bar{X}$ and $bar{Y}$ are minimax under squared error loss since their variance is fixed, and a sequence of Bayes estimators can be constructed such that their Bayes risk converge to the maximum risk of $bar{X}$ and $bar{Y}$.

I am wondering how I can show $delta(X,Y) =bar{Y}-bar{X}$ is also minimax for $eta-xi$?

Essentially I need to show that for any $T(X,Y)$, we have

$$sup_{xi,eta} E(T(X,Y)-(eta-xi))^2 leq sup_{eta,xi} E((bar{Y}-bar{X})-(eta-xi))^2 = frac{sigma^2+tau^2}{n}$$

The only thing I can think of doing is constructing another sequence of priors whose Bayes risks converge to the RHS. However, this seems kind of tedious now that we’re in 2 dimensions. I feel like there’s a “trick” here that I should be using?


Get this bounty!!!

#StackBounty: #estimation #causality #identifiability What does it mean to "non-parametrically" identify a causal effect with…

Bounty: 100

I am wondering, within the context of causal inference, what it means to “non-parametrically” identify a causal effect within the super-population perspective. For example, in Hernan/Robins Causal Inference Book Draft:

https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/2019/02/hernanrobins_v1.10.38.pdf

It defines non-parametric identification on pg. 43 and 123 as:

…identification that does not require any modeling assumptions when
the size of the study population is quasi-infinite. By acting as if we
could obtain an unlimited number of individuals for our studies, we
could ignore random fluctuations and could focus our attention on
systematic biases due to confounding, selection, and measurement.
Statisticians have a name for problems in which we can assume the size
of the study population is effectively infinite: identification
problems.

I understand the identification part to mean that under the strong ignorability assumption, there is only ONE way for the observed data to correspond to a causal effect estimand. What confuses me is why we need to assume the size of the study is quasi-infinite.

For example, in the book it gives an example of a 20 person study where each subject was representative of 1 billion identical subjects, and to view the hypothetical super-population as that of 20 billion people. Specifically, on pg. 13 it states that:

… we will assume that counterfactual outcomes are
deterministic and that we have recorded data on every subject in a
very large (perhaps hypothetical) super-population. This is equivalent
to viewing our population of 20 subjects as a population of 20 billion
subjects in which 1 billion subjects are identical to the 1st subject, 1 billion
subjects are identical to the 2nd subject, and so on.

My confusion here is what it means to assume a single person is representative of 1 billion identical individuals. Is it assuming that each of the 1 billion are identical with respect to their outcomes and treatment only, but differ with respect to the covariates? Or is it assuming the individual is a summary measure of the 1 billion? My instinct is that the notion of the 1 billion is entertaining the fact we may draw many times without having a case where we have a lack of samples. I.e., small sample sizes result in more unstable estimates.

Essentially, what is so crucial about assuming there are many identical individuals in the “background”, if they are just going to be the same as a patient you observe? What happens or breaks down if instead of the 1 billion, we only had 2 identical individuals?

Thank you for any insight.


Get this bounty!!!