#StackBounty: #distributions #variance #bootstrap #resampling #transform Variance after resampling uniformly from a sample from a norma…

Bounty: 50

I have recently been looking into the Bootstrap, and I was wondering, if I were to have a sample $X={x_1,x_2,dots,x_N}$ that has $N$ samples, all i.i.d coming from a normal distribution, $N(mu,sigma)$.

In the Bootstrap, $B$ resamples, $X_b={x_{b1},x_{b2},dots,x_{bN}}$, are obtained by resampling uniformly from $X$ (I know that there are variations with bias-corrections and accelerations, but let’s keep that aside for now).

So, my question is, how can I relate the variance of $x_1$ to $x_{b1}$? I have a feeling that they are two different random variables now that the uniform resampling with $tfrac{1}{N}$ selection probability.

Here is how I think of it. $x_i$ comes from a continuous distribution, that is normal. $x_{bi}$, however, comes from a discrete distribution, that is a uniform selection of $x_1, x_2, dots, x_N$. I think that the probability space of $x_i$ maps somehow into that of $x_{bi}$, but I do not know what sort of a relationship is between them. I am pretty sure it is some form of a bijective mapping $x_i to x_{bi}$, where $x_{bi}=f(x_i)$, but I cannot think of a way to map “normally distributed numbers to make them uniformly distributed”.

My purpose is to simply understand how the resampling process works, how it affects the variance, and whether the variance of $x_{bi}$ can be related to the variance of $x_i$? I am just interested in how it works.


Get this bounty!!!

#StackBounty: #anova #variance #panel-data Disentangling and estimating between-day and between-time variability in a longitudinal study

Bounty: 50

In a longitudinal study, the hand grip strength of subjects is measured. Each subject is measured at 7 times ($T_1-T_7$):

  • $T_1-T_6$ are measurements at different days and different but fixed hours (at 7, 10, 13, 16, 20, 21 hours).
  • The starting time was randomized. For example, one subject’s $T_1$ could be 13 o’clock whereas another subject would start at 16 o’clock. All participants continued from their starting time according to the sequence. Example: A subject who started at 10 would continue with measurements at 13, 16, 20, 21 and 7 hours.
  • $T_7$ is measured at a different day but at the same time as $T_6$. Which time is measured twice is randomized (e.g. for one subject $T_6=T_7=10$ hours, for another $T_6=T_7=21$ hours etc.). This is a direct consequence of randomizing the starting time described above.

At each time, one measurement per subject is taken so that there are 7 data points for each participant.

As I understand it, the measurements at $T_6$ and $T_7$ capture the between-day variability whereas the measurements at $T_1-T_6$ capture the between-day and the between-time variability.

Question: Is it possible to disentangle the between-day and the between-time variability with these data and if so, how?

I thought of using a mixed model or a repeated measures ANOVA but I’m unsure how to set up the model to estimate the two sources of variability separately (if it is possible at all). Thanks for your time.


Get this bounty!!!

#StackBounty: #variance #experiment-design #estimators #paired-data A "contradictory" derivation of the variance of two estim…

Bounty: 50

Consider a paired-experiment setting. I have a variable $Z={1,2,…,N}$, based on which, I paired the subjects. For each value of $Z$, I have 2 subjects, and randomly assign one of them to the Treatment and the other to the Control. So it looks like this

Z  T  C 
1  .  .
2  .  .
...
N  .  .

Let $Y^T$ be random variable denoting the metric of the response variable when Treatment is given and likewise for $Y^C$.

Now we could pretend that we didn’t conduct the paired-experiment and we pool the data together (assuming we just collect $2N$ people and randomly assigned half of them to T, the rest to C and it happens that the variable $Z$ is balanced). Then we could use the estimator
$$mathcal{E}1=frac{1}{|T|}sum{iin T}Y^T_i-frac{1}{|C|}sum_{iin C}Y^C_i=frac{1}{N}left(sum_{i=1}^NY^T_i-sum_{i=1}^NY^C_iright)$$
to estimate the population treatment effect. Assuming $Y^T$ and $Y^C$ are uncorrelated, then we have
$$Var[mathcal{E}_1]=frac{sigma^2_T}{N}+frac{sigma^2_C}{N},$$
where $Var[Y^T]=sigma^2_T$ and $Var[Y^C]=sigma^2_C$.

Now if we utilize the paired-experiment design, we could use another estimator
$$mathcal{E}2=sum{k}frac{|T_k|+|C_k|}{|T|+|C|}bigg(frac{1}{|T_k|}sum_{iin T_k}Y^A_i-frac{1}{|C_k|}sum_{jin C_k}Y^B_jbigg)=frac{1}{N}sum_{i=1}^N(Y^T_i-Y^C_i),$$
where $k$ is the value that $Z$ takes and in this example $|T_k|=|C_k|=1,~forall~k$.

Because we assume $Y^T$ and $Y^C$ are uncorrelated, then we have
$$Var[mathcal{E}_2]=frac{sigma^2_T}{N}+frac{sigma^2_C}{N}=Var[mathcal{E}_1]$$
So in this example, paired-experiment does not help us to get an estimator with lower variance. And both $mathcal{E}_1$ and $mathcal{E}_2$ are unbiased estimator of the population treatment effect and in this case, they are identical.

However, the following derivation shows that $Var[mathcal{E}2]<Var[mathcal{E}_1]$ and I don’t know where goes wrong
begin{align*}
&Var(Y^T)=sigma^2_T=mathbb{E}[Var(Y^T|Z)]+Var(mathbb{E}[Y^T|Z])\
&Var(Y^C)=sigma^2_C=mathbb{E}[Var(Y^C|Z)]+Var(mathbb{E}[Y^C|Z])
end{align*}
Then, let $w_k=frac{|T_k|}{|T|}=frac{|C_k|}{|C|}=1/N$
begin{align*}
Var(mathcal{E}_2)&=sum
{k}bigg(frac{|T_k|+|C_k|}{|T|+|C|}bigg)^2Varbigg(frac{1}{|T_k|}sum_{iin T_k}Y^T_i-frac{1}{|C_k|}sum_{jin C_k}Y^C_jbigg)\
&=sum_{k}bigg(frac{|T_k|+|C_k|}{|T|+|C|}bigg)^2bigg(frac{1}{|T_k|}Var(Y^T|Z=k)+frac{1}{|C_k|}Var(Y^C|Z=k)bigg)\
&=sum_{k}w_k^2bigg(frac{1}{w_k|T|}Var(Y^T|Z=k)+frac{1}{w_k|C|}Var(Y^C|Z=k)bigg)\
&=frac{1}{|T|}sum_{k}w_kVar(Y^T|Z=k)+frac{1}{|C|}sum_{k}w_kVar(Y^C|Z=k)\
&=frac{1}{|T|}mathbb{E}[Var(Y^T|Z)]+frac{1}{|C|}mathbb{E}[Var(Y^C|Z)]<frac{sigma^2_T}{|T|}+frac{sigma^2_C}{|C|}=Var(mathcal{E}_1)
end{align*}


Get this bounty!!!

#StackBounty: #correlation #variance #standard-deviation #kurtosis Correlation coefficient with standard deviation

Bounty: 50

I quite often find myself testing hypotheses in which the standard deviation of one (Normally distributed) variable is linked to (the mean of) another variable. I would like to be able to express the strength of this association by means of an index between [-1, 1], similar in spirit to a correlation coefficient. I feel like I can’t be the first one with this problem, so my first question is: does something like this exist? My second question is whether something I’ve come up with myself seems reasonable.

To express the problem more precisely, let $Z$ be a normally distributed variable:
$$
Z sim Nleft(0,sigma^2right)
$$
where the standard deviation $sigma$ is a linear function of some other variables:
$$
sigma=Xbeta+varepsilon
$$
where $X={x_1, x_2, …, x_p}$ is a set of predictor variables, and $beta$ is a vector of linear coefficients on these predictors. So compared to the familiar linear model, the difference is that we now have a linear prediction for the second, rather than the first moment of the distribution of $Z$.

Given some observations of $Z$ and $X$, we can find the maximum likelihood estimate of $beta$, which we’ll denote $hat{beta}$. Now the question is, how much of the ‘variance in variance’ of $Z$ is explained by this linear model? This leads me to the idea of using kurtosis. That is, because $Z$ is distributed as a mixture of Normals with different SDs and a common mean, it will be leptokurtic and thus have excess kurtosis w.r.t. a Normal distribution with constant variance. However, if we divided each observation of $Z$ by its SD (i.e. $dot{Z_i}=frac{Z_i}{sigma_i}$, where $sigma_i=X_ibeta$), we should be able to reduce its kurtosis (to the point where, if the changes in variance of $Z$ are perfectly predicted by our fitted model, we should be able to get rid of 100% of the excess kurtosis).

So the index I’m proposing (analogous to $R^2$) is:
$$
xi^2 = 1 – frac{left|text{Kurt}[Z/hat{sigma}]-3right|}{text{Kurt}[Z]-3}
$$
where $hat{sigma}=Xhat{beta}$. If our model explains no “variance in variance” at all, then the kurtosis should be just as high after we transform $Z$ as before, in which case $xi^2=0$. If we managed to explain away all the changes in variance, then ${Z}/{hat{sigma}}$ should be perfectly Normally distributed (with kurtosis of 3), and thus $xi^2=1$.

Does that seem reasonable? Did I just re-invent the wheel (or a dumb version of a wheel)?


Get this bounty!!!

#StackBounty: #self-study #variance #bias Find the MSE of a true response and its predicted value using OLS estimation

Bounty: 100

From Theodoridis’ Machine Learning, exercise 3.26.

Consider, once more, the same regression as that of Problem 3.8, but with $boldsymbolSigma_{boldsymboleta} = mathbf{I}_N$.

For context, this is the regression model
$$y_n = boldsymboltheta^{T}mathbf{x}n + eta_ntext{, } qquad n = 1, 2, dots, N$$
where $boldsymboleta sim mathcal{N}(mathbf{0}, boldsymbolSigma
{boldsymboleta})$ and the $mathbf{x}_n$ are considered fixed.

Compute the MSE of the predictions $mathbb{E}[(y-hat{y})^2]$, where $y$ is the true response and $hat{y}$ is the predicted value, given a test point $mathbf{x}$ and using the LS estimator
$$hat{boldsymboltheta}=(mathbf{X}^{T}mathbf{X})^{-1}mathbf{X}^{T}mathbf{y}text{.}$$
The LS estimator has been obtained via a set of $N$ measurements, collected in the (fixed) input matrix $mathbf{X}$ and $mathbf{y}$ [….] The expectation $mathbb{E}[cdot]$ is taken with respect to $y$, the training data $mathcal{D}$, and the test points $mathbf{x}$. Observe the dependence of the MSE on the dimensionality of the space.

Hint: Consider, first, the MSE, given the value of a test point $mathbf{x}$, and then take the average over all the test points.

My attempt:

Theodoridis shows on p. 81 that the generalization error $mathbb{E}{ymidmathbf{x}}mathbb{E}{mathcal{D}}left[left(y-f(mathbf{x};mathcal{D})right)^2right]$ at $mathbf{x}$ is
$$mathrm{MSE}(mathbf{x}) = sigma^2_{eta}+mathbb{E}{mathcal{D}}left[left(f(mathbf{x};mathcal{D})-mathbb{E}{mathcal{D}}f(mathbf{x};mathcal{D})right)^2right]+left(mathbb{E}{mathcal{D}}f(mathbf{x};mathcal{D})-mathbb{E}[ymidmathbf{x}]right)^2$$
Setting $$hat{y}=f(mathbf{x};mathcal{D})=mathbf{x}^{T}(mathbf{X}^{T}mathbf{X})^{-1}mathbf{X}^{T}mathbf{y}$$
we obtain
$$mathbb{E}
{mathcal{D}}hat{y}=mathbf{x}^{T}mathbb{E}{mathcal{D}}left[(mathbf{X}^{T}mathbf{X})^{-1}mathbf{X}^{T}mathbf{y}right]=mathbf{x}^{T}(mathbf{X}^{T}mathbf{X})^{-1}mathbf{X}^{T}mathbf{X}boldsymbolbeta=mathbf{x}^{T}boldsymboltheta$$
so $$left(f(mathbf{x};mathcal{D})-mathbb{E}
{mathcal{D}}f(mathbf{x};mathcal{D})right)^2 = left{mathbf{x}^{T}left[(mathbf{X}^{T}mathbf{X})^{-1}mathbf{X}^{T}mathbf{X}mathbf{y}-boldsymbolthetaright]right}^2$$
This looks disgusting (and not easily simplified), so I’m guessing I’m doing something wrong. We also have $sigma^2_eta = 1$ by assumption.

Edit: This appears to be solved in the 12th printing (Jan 2017) of Elements of Statistical Learning by Hastie et al. (see here), equation (2.47) on p. 37, but they seem to have skipped showing the details (not to mention, I find the notation confusing).


Get this bounty!!!

#StackBounty: #probability #variance #random-matrix Variance of Random Matrix

Bounty: 50

Let’s consider independent random vectors $hat{boldsymboltheta}_i$, $i = 1, dots, m$, which are all unbiased for $boldsymboltheta$ and that
$$mathbb{E}left[left(hat{boldsymboltheta}_i –
boldsymbolthetaright)^{T}left(hat{boldsymboltheta}_i –
boldsymbolthetaright)right] = sigma^2text{.}$$ Let
$mathbf{1}_{n times p}$ be the $n times p$ matrix of all ones.

Consider the problem of finding
$$mathbb{E}left[left(hat{boldsymboltheta} –
boldsymbolthetaright)^{T}left(hat{boldsymboltheta} –
boldsymbolthetaright)right]$$ where $$hat{boldsymboltheta} =
dfrac{1}{m}sum_{i=1}^{m}hat{boldsymboltheta}_itext{.}$$

My attempt is to notice the fact that $$hat{boldsymboltheta} = dfrac{1}{m}underbrace{begin{bmatrix}
hat{boldsymboltheta}1 & hat{boldsymboltheta}_2 & cdots & hat{boldsymboltheta}_m
end{bmatrix}}
{mathbf{S}}mathbf{1}{m times 1}$$
and thus
$$text{Var}(hat{boldsymboltheta}) = dfrac{1}{m^2}text{Var}(mathbf{S}mathbf{1}
{m times 1})text{.}$$
How does one find the variance of a random matrix times a constant vector? You may assume that I am familiar with finding variances of linear transformations of a random vector: i.e., if $mathbf{x}$ is a random vector, $mathbf{b}$ a vector of constants, and $mathbf{A}$ a matrix of constants, assuming all are comformable,
$$mathbb{E}[mathbf{A}mathbf{x}+mathbf{b}] = mathbf{A}mathbb{E}[mathbf{x}]+mathbf{b}$$
$$mathrm{Var}left(mathbf{A}mathbf{x}+mathbf{b}right)=mathbf{A}mathrm{Var}(mathbf{x})mathbf{A}^{prime}$$


Get this bounty!!!

#StackBounty: #variance #pooling #dependent Pooled variance of correlated samples

Bounty: 100

I got two samples originating from the following multivariate

$$
(X_1, X_2) sim mathcal{N}left(0,
left[ begin{matrix}
sigma^2 & rhosigma^2 \
rhosigma^2 & sigma^2
end{matrix} right]
right)
$$

(I am using this multivariate normal to simulate an autoregressive process)

What I am trying to check is what happens to the total variance of the pooled sample $X$ when considering $X_1$ and $X_2$ independent instead of correlated by $rho$.

I can compute the pooled variance of two independent samples pretty easily using the weighted mean of variance of each sample

$$
sigma_X = frac{nsigma + nsigma}{2n} = sigma
$$

But I can’t find any lead on how to compute the pooled variance when the samples are correlated. I tried finding a solution using the general expression of the variance of a sample but I just end up with the weighted mean of variances. I am missing the moment where the independence of samples is assumed, could someone help me with this ?

Here are my computation

Let’s have $(X_1, X_2)$ two samples from a multivariate with means $(0,0)$, variances $(sigma_1^2, sigma_2^2)$, covariance $sigma_{1,2}$ and of sample size $(n, m)$.
Let’s now have $X$ the pooled sample of size $p = n + m$ ordered so that elements $1:n$ are elements of $X_1$ and elements $n+1:n+m$ are elements of $X_2$. I am trying to estimate $sigma_X$ the variance of $X$

begin{align}
sigma_X &= frac{1}{p}sum_{i = 1}^{p} (x_i – mu)^2 \
&= frac{1}{p}sum_{i = 1}^{p} x_i^2 \
&= frac{1}{p}left( sum_{i = 1}^{n} x_i^2 + sum_{i = n+1}^{p=n+m} x_i^2 right) \
&= frac{1}{p}left( nsigma_1 + msigma_2 right)\
&= frac{nsigma_1 + msigma_2}{p}
end{align}

Using the assumption that the mean of $X$ is zero because

begin{align}
mu &= frac{1}{p}sum_{i=1}^{p}x_i \
&= frac{1}{p} left( sum_{i=1}^{n}x_i + sum_{i=n+1}^{p=n+m}x_iright) \
&= frac{1}{p} left( n0 + m0right) \
&= 0
end{align}


Get this bounty!!!

#StackBounty: #regression #multiple-regression #variance #residuals #bias Is it possible to decompose fitted residuals into bias and va…

Bounty: 50

I’d like to classify data points as either needing a more complex model, or not needing a more complex model.

My current thinking is to fit all the data to a simple linear model, and observe the size of the residuals to make this classification.

I then did some reading about the bias and variance contributions to error, and realized that if I could calculate bias directly, it might be a better measure then working with the total error (residual or standardized residual).

Is it possible to estimate bias directly with a linear model? With or without test data? Would cross validation help here?

If not, can one use an averaged bootstrapping ensemble of linear models (I think it’s called bagging) to approximate bias?


Get this bounty!!!

#StackBounty: #self-study #variance #sampling #mean Relationship between variance of mean and mean variance

Bounty: 50

In ranked set sampling, we select $n$ random sets, each of size $n$. Then we choose the largest unit from the 1st set, 2nd largest from the 2nd set, and thus $n$th largest from the $n$th set. This sampling procedure was first introduced by McIntyre (1952). The reference is A method for unbiased selective sampling, using ranked sets. Australian Journal of Agricultural Research, 3(4), 385-390. In the Method section (page 2) of this paper, it is written that

The variance of the mean of five quadrats one from each subdistribution is one-
fifth of the mean variance of these distributions. This may be contrasted with the variance of the mean of five random samples, that is, one-fifth of the variance of the parent population.

Can anyone please illustrate how does the variance of the mean of five quadrats one from each subdistribution equal one-fifth of the mean variance of these distributions?

And also, what does this sentence "This may be contrasted with the variance of the mean of five random samples, that is, one-fifth of the variance of the parent population." mean?


Get this bounty!!!

#StackBounty: #bayesian #variance #random-forest Using MCMC to generate a synthetic training set

Bounty: 50

I have a specific question about an important point made in [http://arxiv.org/pdf/1507.06173.pdf, Sec. 4]. To summarize, let’s consider a signal model

$overrightarrow{R} sim p(overrightarrow{R} vert t)$,

where $overrightarrow{R} in mathbb{R}^n$ collects some noisy sensor measurements and is modeled as a multivariate Gaussian random variable with mean

$mathbb{E}[overrightarrow{R} vert t] = overrightarrow{mu}(t)$

and variance

$mathbb{V}[overrightarrow{R} vert t] = Sigma(overrightarrow{mu}(t))$.

The elements of $overrightarrow{mu}(t)$ and $Sigma(overrightarrow{mu})$ are (known) nonlinear functions of $t$. The goal is to find an estimator for the unknown $t$. The authors propose to use the MCMC algorithm to estimate the posterior $p(t vert overrightarrow{R})$ given some prior $p(t)$ and then compute the Bayesian mean

$displaystylehat{t}=int t, p(t vert overrightarrow{R}), dt$

Since running MCMC would be too slow in a real time image processing application, they build a training set and then use a random forest classifier to make predictions as follows.
First, a value $t_i$ is sampled from the prior, then $overrightarrow{R}_i$ is sampled from the conditional distribution (likelihood):

$t_i sim p(t)$,

$overrightarrow{R}_i sim p(overrightarrow{R} vert t_i)$.

Now the Bayesian mean is computed from the posterior:

$displaystylehat{t}_i=int t, p(t vert overrightarrow{R}_i), dt$

This process is repeated to build the training set

$(overrightarrow{R}_i, displaystylehat{t}_i)qquad i=0,ldots,N$.

Now to my question: one could as well use the training set

$(overrightarrow{R}_i, t_i)$

where $t_i$ is the value sampled from the prior, therefore avoiding to run MCMC. According to the authors this would increase the variance of the output of the random forest regression algorithm. Is there a formal way to prove that? In other words, how can I estimate the variance of the output produced by the regression algorithms obtained using the two different training sets?


Get this bounty!!!