#StackBounty: #correlation #variance #standard-deviation #kurtosis Correlation coefficient with standard deviation

Bounty: 50

I quite often find myself testing hypotheses in which the standard deviation of one (Normally distributed) variable is linked to (the mean of) another variable. I would like to be able to express the strength of this association by means of an index between [-1, 1], similar in spirit to a correlation coefficient. I feel like I can’t be the first one with this problem, so my first question is: does something like this exist? My second question is whether something I’ve come up with myself seems reasonable.

To express the problem more precisely, let $Z$ be a normally distributed variable:
$$
Z sim Nleft(0,sigma^2right)
$$
where the standard deviation $sigma$ is a linear function of some other variables:
$$
sigma=Xbeta+varepsilon
$$
where $X={x_1, x_2, …, x_p}$ is a set of predictor variables, and $beta$ is a vector of linear coefficients on these predictors. So compared to the familiar linear model, the difference is that we now have a linear prediction for the second, rather than the first moment of the distribution of $Z$.

Given some observations of $Z$ and $X$, we can find the maximum likelihood estimate of $beta$, which we’ll denote $hat{beta}$. Now the question is, how much of the ‘variance in variance’ of $Z$ is explained by this linear model? This leads me to the idea of using kurtosis. That is, because $Z$ is distributed as a mixture of Normals with different SDs and a common mean, it will be leptokurtic and thus have excess kurtosis w.r.t. a Normal distribution with constant variance. However, if we divided each observation of $Z$ by its SD (i.e. $dot{Z_i}=frac{Z_i}{sigma_i}$, where $sigma_i=X_ibeta$), we should be able to reduce its kurtosis (to the point where, if the changes in variance of $Z$ are perfectly predicted by our fitted model, we should be able to get rid of 100% of the excess kurtosis).

So the index I’m proposing (analogous to $R^2$) is:
$$
xi^2 = 1 – frac{left|text{Kurt}[Z/hat{sigma}]-3right|}{text{Kurt}[Z]-3}
$$
where $hat{sigma}=Xhat{beta}$. If our model explains no “variance in variance” at all, then the kurtosis should be just as high after we transform $Z$ as before, in which case $xi^2=0$. If we managed to explain away all the changes in variance, then ${Z}/{hat{sigma}}$ should be perfectly Normally distributed (with kurtosis of 3), and thus $xi^2=1$.

Does that seem reasonable? Did I just re-invent the wheel (or a dumb version of a wheel)?


Get this bounty!!!

#StackBounty: #self-study #variance #bias Find the MSE of a true response and its predicted value using OLS estimation

Bounty: 100

From Theodoridis’ Machine Learning, exercise 3.26.

Consider, once more, the same regression as that of Problem 3.8, but with $boldsymbolSigma_{boldsymboleta} = mathbf{I}_N$.

For context, this is the regression model
$$y_n = boldsymboltheta^{T}mathbf{x}n + eta_ntext{, } qquad n = 1, 2, dots, N$$
where $boldsymboleta sim mathcal{N}(mathbf{0}, boldsymbolSigma
{boldsymboleta})$ and the $mathbf{x}_n$ are considered fixed.

Compute the MSE of the predictions $mathbb{E}[(y-hat{y})^2]$, where $y$ is the true response and $hat{y}$ is the predicted value, given a test point $mathbf{x}$ and using the LS estimator
$$hat{boldsymboltheta}=(mathbf{X}^{T}mathbf{X})^{-1}mathbf{X}^{T}mathbf{y}text{.}$$
The LS estimator has been obtained via a set of $N$ measurements, collected in the (fixed) input matrix $mathbf{X}$ and $mathbf{y}$ [….] The expectation $mathbb{E}[cdot]$ is taken with respect to $y$, the training data $mathcal{D}$, and the test points $mathbf{x}$. Observe the dependence of the MSE on the dimensionality of the space.

Hint: Consider, first, the MSE, given the value of a test point $mathbf{x}$, and then take the average over all the test points.

My attempt:

Theodoridis shows on p. 81 that the generalization error $mathbb{E}{ymidmathbf{x}}mathbb{E}{mathcal{D}}left[left(y-f(mathbf{x};mathcal{D})right)^2right]$ at $mathbf{x}$ is
$$mathrm{MSE}(mathbf{x}) = sigma^2_{eta}+mathbb{E}{mathcal{D}}left[left(f(mathbf{x};mathcal{D})-mathbb{E}{mathcal{D}}f(mathbf{x};mathcal{D})right)^2right]+left(mathbb{E}{mathcal{D}}f(mathbf{x};mathcal{D})-mathbb{E}[ymidmathbf{x}]right)^2$$
Setting $$hat{y}=f(mathbf{x};mathcal{D})=mathbf{x}^{T}(mathbf{X}^{T}mathbf{X})^{-1}mathbf{X}^{T}mathbf{y}$$
we obtain
$$mathbb{E}
{mathcal{D}}hat{y}=mathbf{x}^{T}mathbb{E}{mathcal{D}}left[(mathbf{X}^{T}mathbf{X})^{-1}mathbf{X}^{T}mathbf{y}right]=mathbf{x}^{T}(mathbf{X}^{T}mathbf{X})^{-1}mathbf{X}^{T}mathbf{X}boldsymbolbeta=mathbf{x}^{T}boldsymboltheta$$
so $$left(f(mathbf{x};mathcal{D})-mathbb{E}
{mathcal{D}}f(mathbf{x};mathcal{D})right)^2 = left{mathbf{x}^{T}left[(mathbf{X}^{T}mathbf{X})^{-1}mathbf{X}^{T}mathbf{X}mathbf{y}-boldsymbolthetaright]right}^2$$
This looks disgusting (and not easily simplified), so I’m guessing I’m doing something wrong. We also have $sigma^2_eta = 1$ by assumption.

Edit: This appears to be solved in the 12th printing (Jan 2017) of Elements of Statistical Learning by Hastie et al. (see here), equation (2.47) on p. 37, but they seem to have skipped showing the details (not to mention, I find the notation confusing).


Get this bounty!!!

#StackBounty: #probability #variance #random-matrix Variance of Random Matrix

Bounty: 50

Let’s consider independent random vectors $hat{boldsymboltheta}_i$, $i = 1, dots, m$, which are all unbiased for $boldsymboltheta$ and that
$$mathbb{E}left[left(hat{boldsymboltheta}_i –
boldsymbolthetaright)^{T}left(hat{boldsymboltheta}_i –
boldsymbolthetaright)right] = sigma^2text{.}$$ Let
$mathbf{1}_{n times p}$ be the $n times p$ matrix of all ones.

Consider the problem of finding
$$mathbb{E}left[left(hat{boldsymboltheta} –
boldsymbolthetaright)^{T}left(hat{boldsymboltheta} –
boldsymbolthetaright)right]$$ where $$hat{boldsymboltheta} =
dfrac{1}{m}sum_{i=1}^{m}hat{boldsymboltheta}_itext{.}$$

My attempt is to notice the fact that $$hat{boldsymboltheta} = dfrac{1}{m}underbrace{begin{bmatrix}
hat{boldsymboltheta}1 & hat{boldsymboltheta}_2 & cdots & hat{boldsymboltheta}_m
end{bmatrix}}
{mathbf{S}}mathbf{1}{m times 1}$$
and thus
$$text{Var}(hat{boldsymboltheta}) = dfrac{1}{m^2}text{Var}(mathbf{S}mathbf{1}
{m times 1})text{.}$$
How does one find the variance of a random matrix times a constant vector? You may assume that I am familiar with finding variances of linear transformations of a random vector: i.e., if $mathbf{x}$ is a random vector, $mathbf{b}$ a vector of constants, and $mathbf{A}$ a matrix of constants, assuming all are comformable,
$$mathbb{E}[mathbf{A}mathbf{x}+mathbf{b}] = mathbf{A}mathbb{E}[mathbf{x}]+mathbf{b}$$
$$mathrm{Var}left(mathbf{A}mathbf{x}+mathbf{b}right)=mathbf{A}mathrm{Var}(mathbf{x})mathbf{A}^{prime}$$


Get this bounty!!!

#StackBounty: #variance #pooling #dependent Pooled variance of correlated samples

Bounty: 100

I got two samples originating from the following multivariate

$$
(X_1, X_2) sim mathcal{N}left(0,
left[ begin{matrix}
sigma^2 & rhosigma^2 \
rhosigma^2 & sigma^2
end{matrix} right]
right)
$$

(I am using this multivariate normal to simulate an autoregressive process)

What I am trying to check is what happens to the total variance of the pooled sample $X$ when considering $X_1$ and $X_2$ independent instead of correlated by $rho$.

I can compute the pooled variance of two independent samples pretty easily using the weighted mean of variance of each sample

$$
sigma_X = frac{nsigma + nsigma}{2n} = sigma
$$

But I can’t find any lead on how to compute the pooled variance when the samples are correlated. I tried finding a solution using the general expression of the variance of a sample but I just end up with the weighted mean of variances. I am missing the moment where the independence of samples is assumed, could someone help me with this ?

Here are my computation

Let’s have $(X_1, X_2)$ two samples from a multivariate with means $(0,0)$, variances $(sigma_1^2, sigma_2^2)$, covariance $sigma_{1,2}$ and of sample size $(n, m)$.
Let’s now have $X$ the pooled sample of size $p = n + m$ ordered so that elements $1:n$ are elements of $X_1$ and elements $n+1:n+m$ are elements of $X_2$. I am trying to estimate $sigma_X$ the variance of $X$

begin{align}
sigma_X &= frac{1}{p}sum_{i = 1}^{p} (x_i – mu)^2 \
&= frac{1}{p}sum_{i = 1}^{p} x_i^2 \
&= frac{1}{p}left( sum_{i = 1}^{n} x_i^2 + sum_{i = n+1}^{p=n+m} x_i^2 right) \
&= frac{1}{p}left( nsigma_1 + msigma_2 right)\
&= frac{nsigma_1 + msigma_2}{p}
end{align}

Using the assumption that the mean of $X$ is zero because

begin{align}
mu &= frac{1}{p}sum_{i=1}^{p}x_i \
&= frac{1}{p} left( sum_{i=1}^{n}x_i + sum_{i=n+1}^{p=n+m}x_iright) \
&= frac{1}{p} left( n0 + m0right) \
&= 0
end{align}


Get this bounty!!!

#StackBounty: #regression #multiple-regression #variance #residuals #bias Is it possible to decompose fitted residuals into bias and va…

Bounty: 50

I’d like to classify data points as either needing a more complex model, or not needing a more complex model.

My current thinking is to fit all the data to a simple linear model, and observe the size of the residuals to make this classification.

I then did some reading about the bias and variance contributions to error, and realized that if I could calculate bias directly, it might be a better measure then working with the total error (residual or standardized residual).

Is it possible to estimate bias directly with a linear model? With or without test data? Would cross validation help here?

If not, can one use an averaged bootstrapping ensemble of linear models (I think it’s called bagging) to approximate bias?


Get this bounty!!!

#StackBounty: #self-study #variance #sampling #mean Relationship between variance of mean and mean variance

Bounty: 50

In ranked set sampling, we select $n$ random sets, each of size $n$. Then we choose the largest unit from the 1st set, 2nd largest from the 2nd set, and thus $n$th largest from the $n$th set. This sampling procedure was first introduced by McIntyre (1952). The reference is A method for unbiased selective sampling, using ranked sets. Australian Journal of Agricultural Research, 3(4), 385-390. In the Method section (page 2) of this paper, it is written that

The variance of the mean of five quadrats one from each subdistribution is one-
fifth of the mean variance of these distributions. This may be contrasted with the variance of the mean of five random samples, that is, one-fifth of the variance of the parent population.

Can anyone please illustrate how does the variance of the mean of five quadrats one from each subdistribution equal one-fifth of the mean variance of these distributions?

And also, what does this sentence "This may be contrasted with the variance of the mean of five random samples, that is, one-fifth of the variance of the parent population." mean?


Get this bounty!!!

#StackBounty: #bayesian #variance #random-forest Using MCMC to generate a synthetic training set

Bounty: 50

I have a specific question about an important point made in [http://arxiv.org/pdf/1507.06173.pdf, Sec. 4]. To summarize, let’s consider a signal model

$overrightarrow{R} sim p(overrightarrow{R} vert t)$,

where $overrightarrow{R} in mathbb{R}^n$ collects some noisy sensor measurements and is modeled as a multivariate Gaussian random variable with mean

$mathbb{E}[overrightarrow{R} vert t] = overrightarrow{mu}(t)$

and variance

$mathbb{V}[overrightarrow{R} vert t] = Sigma(overrightarrow{mu}(t))$.

The elements of $overrightarrow{mu}(t)$ and $Sigma(overrightarrow{mu})$ are (known) nonlinear functions of $t$. The goal is to find an estimator for the unknown $t$. The authors propose to use the MCMC algorithm to estimate the posterior $p(t vert overrightarrow{R})$ given some prior $p(t)$ and then compute the Bayesian mean

$displaystylehat{t}=int t, p(t vert overrightarrow{R}), dt$

Since running MCMC would be too slow in a real time image processing application, they build a training set and then use a random forest classifier to make predictions as follows.
First, a value $t_i$ is sampled from the prior, then $overrightarrow{R}_i$ is sampled from the conditional distribution (likelihood):

$t_i sim p(t)$,

$overrightarrow{R}_i sim p(overrightarrow{R} vert t_i)$.

Now the Bayesian mean is computed from the posterior:

$displaystylehat{t}_i=int t, p(t vert overrightarrow{R}_i), dt$

This process is repeated to build the training set

$(overrightarrow{R}_i, displaystylehat{t}_i)qquad i=0,ldots,N$.

Now to my question: one could as well use the training set

$(overrightarrow{R}_i, t_i)$

where $t_i$ is the value sampled from the prior, therefore avoiding to run MCMC. According to the authors this would increase the variance of the output of the random forest regression algorithm. Is there a formal way to prove that? In other words, how can I estimate the variance of the output produced by the regression algorithms obtained using the two different training sets?


Get this bounty!!!

#StackBounty: #regression #machine-learning #variance #cross-validation #predictive-models Does $K$-fold CV with $K=N$ (LOO) provide th…

Bounty: 50

TL,DR: It appears that, contrary to oft-repeated advice, leave-one-out cross validation (LOO-CV) — that is, $K$-fold CV with $K$ (the number of folds) equal to $N$ (the number of training observations) — yields estimates of the generalization error that are the least variable for any $K$, not the most variable, assuming a certain stability condition on either the model/algorithm, the dataset, or both (I’m not sure which is correct as I don’t really understand this stability condition).

  • Can someone clearly explain what exactly this stability condition is?
  • Is it true that linear regression is one such “stable” algorithm, implying that in that context, LOO-CV is strictly the best choice of CV as far as bias and variance of the estimates of generalization error are concerned?

The conventional wisdom is that the choice of $K$ in $K$-fold CV follows a bias-variance tradeoff, such lower values of $K$ (approaching 2) lead to estimates of the generalization error that have more pessimistic bias, but lower variance, while higher values of $K$ (approaching $N$) lead to estimates that are less biased, but with greater variance. The conventional explanation for this phenomenon of variance increasing with $K$ is given perhaps most prominently in The Elements of Statistical Learning (Section 7.10.1):

With K=N, the cross-validation estimator is approximately unbiased for the true (expected) prediction error, but can have high variance because the N “training sets” are so similar to one another.

The implication being that the $N$ validation errors are more highly correlated so that their sum is more variable. This line of reasoning has been repeated in many answers on this site (e.g., here, here, here, here, here, here, and here) as well as on various blogs and etc. But a detailed analysis is virtually never given, instead only an intuition or brief sketch of what an analysis might look like.

One can however find contradictory statements, usually citing a certain “stability” condition that I don’t really understand. For example, this contradictory answer quotes a couple paragraphs from a 2015 paper which says, among other things, “For models/modeling procedures with low instability, LOO often has the smallest variability” (emphasis added). This paper (section 5.2) seems to agree that LOO represents the least variable choice of $K$ as long as the model/algorithm is “stable.” Taking even another stance on the issue, there is also this paper (Corollary 2), which says “The variance of $k$ fold cross validation […] does not depend on $k$,” again citing a certain “stability” condition.

The explanation about why LOO might be the most variable $K$-fold CV is intuitive enough, but there is a counter-intuition. The final CV estimate of the mean squared error (MSE) is the mean of the MSE estimates in each fold. So as $K$ increases up to $N$, the CV estimate is the mean of an increasing number of random variables. And we know that the variance of a mean decreases with the number of variables being averaged over. So in order for LOO to be the most variable $K$-fold CV, it would have to be true that the increase in variance due to the increased correlation among the MSE estimates outweighs the decrease in variance due to the greater number of folds being averaged over. And it is not at all obvious that this is true.

Having become thoroughly confused thinking about all this, I decided to run a little simulation for the linear regression case. I simulated 10,000 datasets with $N$=50 and 3 uncorrelated predictors, each time estimating the generalization error using $K$-fold CV with $K$=2, 5, 10, or 50=$N$. The R code is here. Here are the resulting means and variances of the CV estimates across all 10,000 datasets (in MSE units):

         k = 2 k = 5 k = 10 k = n = 50
mean     1.187 1.108  1.094      1.087
variance 0.094 0.058  0.053      0.051

These results show the expected pattern that higher values of $K$ lead to a less pessimistic bias, but also appear to confirm that the variance of the CV estimates is lowest, not highest, in the LOO case.

So it appears that linear regression is one of the “stable” cases mentioned in the papers above, where increasing $K$ is associated with decreasing rather than increasing variance in the CV estimates. But what I still don’t understand is:

  • What precisely is this “stability” condition? Does it apply to models/algorithms, datasets, or both to some extent?
  • Is there an intuitive way to think about this stability?
  • What are other examples of stable and unstable models/algorithms or datasets?
  • Is it relatively safe to assume that most models/algorithms or datasets are “stable” and therefore that $K$ should generally be chosen as high as is computationally feasible?


Get this bounty!!!

#StackBounty: #variance #average "Averaging" variances

Bounty: 50

I need to obtain some sort of “average” among a list of variances, but have trouble coming up with a reasonable solution. There is an interesting discussion about the differences among the three Pythagorean means (arithmetic, geometric, and harmonic) in this thread; however, I still don’t feel any of them would be a good candidate. Any suggestions?

P.S. Some context – These variances are sample variances from $n$ subjects, each of whom went through the same experiment design with roughly the same sample size $k$. In other words, there are $n$ sampling variances $sigma_1^2$, $sigma_2^2$, …, $sigma_n^2$, corresponding to those $n$ subjects. A meta analysis has been already performed at the population level. The reason I need to obtain some kind of “average” or “summarized” sample variance is that I want to use it to calculate an index such as ICC after the meta analysis.


Get this bounty!!!

#HackerRank: Correlation and Regression Lines solutions

import numpy as np
import scipy as sp
from scipy.stats import norm

Correlation and Regression Lines – A Quick Recap #1

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute Karl Pearson’s coefficient of correlation between these scores. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

physicsScores=[15, 12,  8,  8,  7,  7,  7,  6, 5,  3]
historyScores=[10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
print(np.corrcoef(historyScores,physicsScores)[0][1])
0.144998154581

Correlation and Regression Lines – A Quick Recap #2

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

Compute the slope of the line of regression obtained while treating Physics as the independent variable. Compute the answer correct to three decimal places.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

sp.stats.linregress(physicsScores,historyScores).slope
0.20833333333333331

Correlation and Regression Lines – A quick recap #3

Here are the test scores of 10 students in physics and history:

Physics Scores 15 12 8 8 7 7 7 6 5 3

History Scores 10 25 17 11 13 17 20 13 9 15

When a student scores 10 in Physics, what is his probable score in History? Compute the answer correct to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not leave any leading or trailing spaces. Your answer may look like: 0.255

This is NOT the actual answer – just the format in which you should provide your answer.

def predict(pi,x,y):
    slope, intercept, rvalue, pvalue, stderr=sp.stats.linregress(x,y);
    return slope*pi+ intercept

predict(10,physicsScores,historyScores)
15.458333333333332

Correlation and Regression Lines – A Quick Recap #4

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

Estimate the value of x when y = 7. Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

x=[i for i in range(0,20)]

'''
    4x - 5y + 33 = 0
    x = ( 5y - 33 ) / 4
    y = ( 4x + 33 ) / 5
    
    20x - 9y - 107 = 0
    x = (9y + 107)/20
    y = (20x - 107)/9
'''
t=7
print( ( 9 * t + 107 ) / 20 )
8.5

Correlation and Regression Lines – A Quick Recap #5

The two regression lines of a bivariate distribution are:

4x – 5y + 33 = 0 (line of y on x)

20x – 9y – 107 = 0 (line of x on y).

find the variance of y when σx= 3.

Compute the correct answer to one decimal place.

Output Format

In the text box, enter the floating point/decimal value required. Do not lead any leading or trailing spaces. Your answer may look like: 7.2

This is NOT the actual answer – just the format in which you should provide your answer.

http://www.mpkeshari.com/2011/01/19/lines-of-regression/

Q.3. If the two regression lines of a bivariate distribution are 4x – 5y + 33 = 0 and 20x – 9y – 107 = 0,

  • calculate the arithmetic means of x and y respectively.
  • estimate the value of x when y = 7. – find the variance of y when σx = 3.
Solution : –

We have,

4x – 5y + 33 = 0 => y = 4x/5 + 33/5 ————— (i)

And

20x – 9y – 107 = 0 => x = 9y/20 + 107/20 ————- (ii)

(i) Solving (i) and (ii) we get, mean of x = 13 and mean of y = 17.[Ans.]

(ii) Second line is line of x on y

x = (9/20) × 7 + (107/20) = 170/20 = 8.5 [Ans.]

(iii) byx = r(σy/σx) => 4/5 = 0.6 × σy/3 [r = √(byx.bxy) = √{(4/5)(9/20)]= 0.6 => σy = (4/5)(3/0.6) = 4 [Ans.]

variance= σ**2=> 16