#StackBounty: #machine-learning #correlation #linear-model #canonical-correlation Canonical correlation analysis with a tiny example an…

Bounty: 150

I’ve tried reading many explanations of CCA, and I don’t understand it. For example, on Wikipedia, it refers to two “vectors” $a$ and $b$ such that $rho = text{corr}(a^{top} X, b^{top} Y)$ is maximal. But if $a$ and $X$ are vectors, isn’t $a^{top} X$ a scalar? What does it mean for two scalars to be correlated?

Other explanations use matrix notation without any dimensions, e.g.

enter image description here

Can someone explain CCA with a small example and all the dimensions provided?


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #covariance #monte-carlo #measurement-error Assessing significance when measurements are…

Bounty: 50

I am trying to estimate a function $f(x)$ at $x=0.1, 0.2, 0.3, 0.4, 0.5$.
I make this estimate through some complex procedure based on a set of independent but noisy data (with known uncertainties). The outcome is a set of correlated estimates.

The null hypothesis of my study is that $f(x)=0$. I want to assess, based on any/all of these measurements, whether $f(x)neq 0$ at some/any $x$.

I have run $N$ Monte Carlo simulations of the measurement procedure by repeating the measurement with random realizations of the data (since the measurement uncertainty is unknown). Now I have $N$ sets of estimates at these five points. I can use the $N$ trials to estimate the mean and variance of each $f(x)$ as well as the covariance between the different measurements.

Taking the mean and standard deviation across all of the $N$ trials, I find that each of the measurements are roughly $f(x)=-0.2 pm 0.1$. At normal thresholds of significance, this would be a $2sigma$ result and therefore not significantly different from 0. However, since all of the measurements are below zero, maybe the result is significant. However, I do not know how to assess this, given that all of the measurements are correlated.


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #multiple-comparisons #spearman-rho How to compare of two Spearman correlations matrices?

Bounty: 50

I have two non-parametric rank correlations matrices emp and sim (for example, based on Spearman’s $rho$ rank correlation coefficient):

emp <- matrix(c(
1.0000000, 0.7771328, 0.6800540, 0.2741636,
0.7771328, 1.0000000, 0.5818167, 0.2933432,
0.6800540, 0.5818167, 1.0000000, 0.3432396,
0.2741636, 0.2933432, 0.3432396, 1.0000000), ncol=4)

sim <- matrix(c(
1.0000000, 0.7616454, 0.6545774, 0.3081403,
0.7616454, 1.0000000, 0.5360392, 0.3146167,
0.6545774, 0.5360392, 1.0000000, 0.3739758,
0.3081403, 0.3146167, 0.3739758, 1.0000000), ncol=4)

The emp matrix is the correlation matrix that contains correlations between the emprical values (time series), the sim matrix is the correlation matrix — the simulated values.

I have read the Q&A How to compare two or more correlation matrices?, in my case it is known that emprical values are not from normal distribution, and I can’t use the Box’s M test.

I need to test the null hypothesis $H_0$: matrices emp and sim are drawn from the same distribution.

Question. What is a test do I can use? Is is possible to use the Wishart statistic?

Edit.
Follow to Stephan Kolassa‘s comment I have done a simulation.

I have tried to compare two Spearman correlations matrices emp and sim with the Box’s M test. The test has returned

# Chi-squared statistic = 2.6163, p-value = 0.9891

Then I have simulated 1000 times the correlations matrix sim and plot the distribution of Chi-squared statistic $M(1-c)simchi^2(df)$.

enter image description here

After that I have defined the 5-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$. The defined 5-% quantile equals to

quantile(dfr$stat, probs = 0.05)
#       5% 
# 1.505046

One can see that the 5-% quantile is less that the obtained Chi-squared statistic: 1.505046 < 2.6163 (blue line on the fugure), therefore, my emp‘s statistic $M(1−c)$ does not fall in the left tail of the $(M(1−c))_i$.

Edit 2.
Follow to the second Stephan Kolassa‘s comment I have calculated 95-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$ (blue line on the fugure). The defined 95-% quantile equals to

quantile(dfr$stat, probs = 0.95)
#      95% 
# 7.362071

One can see that the emp‘s statistic $M(1−c)$ does not fall in the right tail of the $(M(1−c))_i$.

Edit 3. I have calculated the exact $p$-value (green line on the figure) through the empirical cumulative distribution function:

ecdf(dfr$stat)(2.6163)
[1] 0.239

One can see that $p$-value=0.239 is greater than $0.05$.

Edit 4.

Dominik Wied (2014): A Nonparametric Test for a Constant Correlation
Matrix, Econometric Reviews, DOI: 10.1080/07474938.2014.998152

Joël Bun, Jean-Philippe Bouchaud and Mark Potters (2016), Cleaning correlation matrices, Risk.net, April 2016

Li, David X., On Default Correlation: A Copula Function Approach (September 1999). Available at SSRN: https://ssrn.com/abstract=187289 or http://dx.doi.org/10.2139/ssrn.187289

G. E. P. Box, A General Distribution Theory for a Class of Likelihood Criteria. Biometrika. Vol. 36, No. 3/4 (Dec., 1949), pp. 317-346

M. S. Bartlett, Properties of Sufficiency and Statistical Tests. Proc. R. Soc. Lond. A 1937 160, 268-282

Robert I. Jennrich (1970): An Asymptotic χ2 Test for the Equality of Two
Correlation Matrices
, Journal of the American Statistical Association, 65:330, 904-912.

Edit 5.

The first founded paper that has no the assumption about normal distribution.

Reza Modarres & Robert W. Jernigan (1993) A robust test for comparing correlation matrices, Journal of Statistical Computation and Simulation, 46:3-4, 169-181


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #multiple-comparisons #spearman-rho #kendall-tau How to compare of two Spearman correlat…

Bounty: 50

I have two non-parametric rank correlations matrices emp and sim (for example, based on Spearman’s $rho$ rank correlation coefficient):

emp <- matrix(c(
1.0000000, 0.7771328, 0.6800540, 0.2741636,
0.7771328, 1.0000000, 0.5818167, 0.2933432,
0.6800540, 0.5818167, 1.0000000, 0.3432396,
0.2741636, 0.2933432, 0.3432396, 1.0000000), ncol=4)

sim <- matrix(c(
1.0000000, 0.7616454, 0.6545774, 0.3081403,
0.7616454, 1.0000000, 0.5360392, 0.3146167,
0.6545774, 0.5360392, 1.0000000, 0.3739758,
0.3081403, 0.3146167, 0.3739758, 1.0000000), ncol=4)

The emp matrix is the correlation matrix that contains correlations between the emprical values (time series), the sim matrix is the correlation matrix — the simulated values.

I have read the Q&A How to compare two or more correlation matrices?, in my case it is known that emprical values are not from normal distribution, and I can’t use the Box’s M test.

I need to test the null hypothesis $H_0$: matrices emp and sim are drawn from the same distribution.

Question. What is a test do I can use? Is is possible to use the Wishart statistic?

Edit.
Follow to Stephan Kolassa‘s comment I have done a simulation.

I have tried to compare two Spearman correlations matrices emp and sim with the Box’s M test. The test has returned

# Chi-squared statistic = 2.6163, p-value = 0.9891

Then I have simulated 1000 times the correlations matrix sim and plot the distribution of Chi-squared statistic $M(1-c)simchi^2(df)$.

enter image description here

After that I have defined the 5-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$. The defined 5-% quantile equals to

quantile(dfr$stat, probs = 0.05)
#       5% 
# 1.505046

One can see that the 5-% quantile is less that the obtained Chi-squared statistic: 1.505046 < 2.6163 (blue line on the fugure), therefore, my emp‘s statistic $M(1−c)$ does not fall in the left tail of the $(M(1−c))_i$.

Edit 2.
Follow to the second Stephan Kolassa‘s comment I have calculated 95-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$ (blue line on the fugure). The defined 95-% quantile equals to

quantile(dfr$stat, probs = 0.95)
#      95% 
# 7.362071

One can see that the emp‘s statistic $M(1−c)$ does not fall in the right tail of the $(M(1−c))_i$.

Edit 3. I have calculated the exact $p$-value (green line on the figure) through the empirical cumulative distribution function:

ecdf(dfr$stat)(2.6163)
[1] 0.239

One can see that $p$-value=0.239 is greater than $0.05$.

Edit 4.

Dominik Wied (2014): A Nonparametric Test for a Constant Correlation
Matrix, Econometric Reviews, DOI: 10.1080/07474938.2014.998152

Joël Bun, Jean-Philippe Bouchaud and Mark Potters (2016), Cleaning correlation matrices, Risk.net, April 2016

Li, David X., On Default Correlation: A Copula Function Approach (September 1999). Available at SSRN: https://ssrn.com/abstract=187289 or http://dx.doi.org/10.2139/ssrn.187289


Get this bounty!!!

#StackBounty: #correlation #mathematical-statistics #simulation #copula If I have a vector of $N$ correlated probabilities. How can I t…

Bounty: 300

My ultimate goal is to be able to have a way to generate a vector of size $N$ of correlated Bernoulli random variables. One way I am doing this is to use the Gaussian Coupla approach. However, the Gaussian Coupla approach just leaves me with a vector:

$$
(p_1, ldots, p_N) in [0,1]^N
$$

Suppose that I have generated $(p_1, ldots, p_N)$ such that the common correlation between them is $rho$. Now, how can I transform these into a new vector of $0$ or $1$’s? In other words, I would like:

$$
(X_1, ldots, X_N) in {0,1}^N
$$

but with the same correlation $rho$.

One approach I thought of was to assign a hard cutoff rule such that if $p_i < 0.5$, then let $X_i = 0$ and if $p_i geq 0.5$, then let $X_i = 1$.

This seems to work well in simulations in that it retains the correlation structure but it is very arbitrary to me what cutoff value should be chosen aside from $0.5$.

Another way is to treat each $X_i$ as a Bernoulli random variable with success probability $p_i$ and sample from it. However this approach seems to cause loss of correlation and instead of $rho$, I may get $frac{rho}{2}$ or $frac{rho}{3}$.

Does anyone have any thoughts or inputs into this? Thank you.


Get this bounty!!!

#StackBounty: #correlation #variance #standard-deviation #kurtosis Correlation coefficient with standard deviation

Bounty: 50

I quite often find myself testing hypotheses in which the standard deviation of one (Normally distributed) variable is linked to (the mean of) another variable. I would like to be able to express the strength of this association by means of an index between [-1, 1], similar in spirit to a correlation coefficient. I feel like I can’t be the first one with this problem, so my first question is: does something like this exist? My second question is whether something I’ve come up with myself seems reasonable.

To express the problem more precisely, let $Z$ be a normally distributed variable:
$$
Z sim Nleft(0,sigma^2right)
$$
where the standard deviation $sigma$ is a linear function of some other variables:
$$
sigma=Xbeta+varepsilon
$$
where $X={x_1, x_2, …, x_p}$ is a set of predictor variables, and $beta$ is a vector of linear coefficients on these predictors. So compared to the familiar linear model, the difference is that we now have a linear prediction for the second, rather than the first moment of the distribution of $Z$.

Given some observations of $Z$ and $X$, we can find the maximum likelihood estimate of $beta$, which we’ll denote $hat{beta}$. Now the question is, how much of the ‘variance in variance’ of $Z$ is explained by this linear model? This leads me to the idea of using kurtosis. That is, because $Z$ is distributed as a mixture of Normals with different SDs and a common mean, it will be leptokurtic and thus have excess kurtosis w.r.t. a Normal distribution with constant variance. However, if we divided each observation of $Z$ by its SD (i.e. $dot{Z_i}=frac{Z_i}{sigma_i}$, where $sigma_i=X_ibeta$), we should be able to reduce its kurtosis (to the point where, if the changes in variance of $Z$ are perfectly predicted by our fitted model, we should be able to get rid of 100% of the excess kurtosis).

So the index I’m proposing (analogous to $R^2$) is:
$$
xi^2 = 1 – frac{left|text{Kurt}[Z/hat{sigma}]-3right|}{text{Kurt}[Z]-3}
$$
where $hat{sigma}=Xhat{beta}$. If our model explains no “variance in variance” at all, then the kurtosis should be just as high after we transform $Z$ as before, in which case $xi^2=0$. If we managed to explain away all the changes in variance, then ${Z}/{hat{sigma}}$ should be perfectly Normally distributed (with kurtosis of 3), and thus $xi^2=1$.

Does that seem reasonable? Did I just re-invent the wheel (or a dumb version of a wheel)?


Get this bounty!!!

#StackBounty: #r #correlation #binomial #simulation #monte-carlo How to generate a variable which values derive from four binary and pa…

Bounty: 50

Suppose I want to simulate a survey variable in R which values derive from four binary cases $c_i$ with $i={1,2,3, 4}$, each with it’s probability $Pr(c_i=1)=p_i$. Let’s assume that the cases should correlate pairwise with each other in this (idealized) structure:

$begin{matrix}& c1 & c2 & c3 & c4\
c1 & 1 & -1 & 0 & 0\
c2 & -1 & 1 & 0 & 0\
c3 & 0 & 0 & 1 & -1\
c4 & 0 & 0 & -1 & 1
end{matrix}$

For the simulation I want to take $n$ draws (i. e. a sample with size $n$) from the correlation structure above, where in each draw the number of $c_i=1$ are added. This should result in something like $X=(0, 2, 3, 0, 2, 0, 4, 1, 2 ,0 , dots)$.

My attempt in R coding so far looks like the following. (Thereby I orientated myself on this tutorial, unfortunately it doesn’t fit to my needs until the end.)

set.seed(961)
mu <- rep(0, 4)
Sigma <- matrix(c(1, -1, 0, 0, 
                  -1, 1, 0, 0, 
                  0, 0, 1, -1, 
                  0, 0, -1, 1), nrow = 4, ncol = 4)
Sigma
#      [,1] [,2] [,3] [,4]
# [1,]    1   -1    0    0
# [2,]   -1    1    0    0
# [3,]    0    0    1   -1
# [4,]    0    0   -1    1

library(MASS)
rawvars <- mvrnorm(n = 1e4, mu = mu, Sigma = Sigma)

# cov(rawvars)

cor(rawvars)
#             [,1]         [,2]         [,3]         [,4]
# [1,]  1.000000000 -1.000000000 -0.006839596  0.006839597
# [2,] -1.000000000  1.000000000  0.006839597 -0.006839598
# [3,] -0.006839596  0.006839597  1.000000000 -1.000000000
# [4,]  0.006839597 -0.006839598 -1.000000000  1.000000000

pvars <- pnorm(rawvars)

Until here I think it looks good and it seems I am on the right way. In the following the author draws Poisson, exponential, and other data and of all things his code seems to be flawed in the binary example.

I made an attempt myself but I could not specify the probabilities $p_i$ so as a workaround I chose a value $p=.3$ which seems OK by rule of thumb, and it looks like this:

binvars <- qbinom(pvars, 4, .3) 

I then get following distribution:

summary(as.factor(binvars))
#    0     1     2     3     4 
# 9704 16283 10676  2994   343

hist(binvars)

enter image description here

Now I’m facing three major problems. First, how should the resulting vector resp. the distribution look like in general (e. g. with respect to the zeroes)? Second, does the attempt so far make sense at all? Third, how could I solve this problem to the end?

Any help is very appreciated.


Get this bounty!!!

#StackBounty: #correlation #estimation #bias #noise What is known about 2nd order estimation biases due to correlated noise between res…

Bounty: 100

I have recently run into a statistical bias in a type of analysis that I believe is somewhat common in my field, and not typically corrected for. I would like to know if this is more well-known in other fields or in statistics in general, and whether it has a standard name or solution (I have a solution but I wonder if this is the standard one, and if it is, I’d like to have a reference for it).

The bias arises when estimating the mean of, or fitting models to, data with noise that is correlated between response variables. I want to stress that I am not talking about correlated noise between measurements of the same variable, which seems to be a more widely known and discussed phenomenon (e.g. when doing a linear regression this would violate the assumption of independence between residuals).

To make it more concrete, suppose we have some data that obeys the following model:
$$
Y = F(X)+Gamma
$$
where $Y$, $F(X)$ and $E$ are $Ntimes M$ matrices, where $N$ is the number of (simultaneously measured) observations and $M$ the number of response variables. $Y$ are the observations of the response variables, $F(X)$ is some function of the independent variables, and $Gamma$ is Normally distributed noise with, covariance $SigmainRe^{Mtimes M}$ between response variables (i.e. between columns of $Gamma$).

The goal of the analysis is to estimate $F(X)$, which we won’t specify any further because I want to keep the statement of the problem quite general, but you could imagine this being a linear regression model, or even just a simple mean. Since $E[Gamma]=0$, the least-squares estimate of $F(X)$ is unbiased in the first order, i.e. $E[hat{F(X)}]=F(X)$. However, because noise is correlated between response variables, $E[frac{1}{N}Gamma^TGamma]=Sigmaneq0$, and therefore the covariance between the estimates $hat{F(X)}$ is not unbiased, but will instead be proportional to the noise covariance $Sigma$:
$$
E[hat{F(X)}^That{F(X)}]-F(X)^TF(X)proptoSigma
$$
This is a problem if you want to test the hypothesis that $E[{F(X)}^T{F(X)}]proptoSigma$, which happens to be a common hypothesis in my field, because your analysis is biased to find this hypothesized relationship in the estimates of $F(X)$, even if it doesn’t exist in reality. That is, you are biased to find estimates of $F(X)$ that are similar between response variables with correlated noise, even if the true similarity structure of $F(X)$ does not mimic the noise correlation structure.

The simple solution that I’ve come up with is to divide the data into independent halves $Y^{(1)}$ & $Y^{(2)}$, compute separate estimates $hat{F(X)}^{(1)}$ & $hat{F(X)}^{(2)}$ from these data partitions, and then calculate the covariance between those independent estimates. Since noise is independent by construction between $Y^{(1)}$ & $Y^{(2)}$, this means that the bias has disappeared, i.e. we know have $$E[{hat{F(X)}^{(1)}}^T{hat{F(X)}^{(2)}}]=F(X)^TF(X)$$

Does any of this ring familiar to anyone here? I would like to know if what I’m describing is a well known issue with a standard solution (which I feel it ought to be), and would appreciate any references to literature on this.


Get this bounty!!!

#StackBounty: #correlation #covariance #independence #covariance-matrix Correlation matrix of rows, correcting for correlation matrix o…

Bounty: 50

Given a matrix of observations (rows) x variables (columns), can we compute the correlation matrix of the rows, but corrected by the correlation matrix of the columns? The goal would be to avoid inflation in the correlation p values, since the t-test for Pearson correlation assumes that the columns/variables are independent.

Intuitively, this could be accomplished with a weighted correlation, where e.g. if a pair of variables are nearly perfectly correlated, they would each be down-weighted by a factor of 2.


Get this bounty!!!