#StackBounty: #hypothesis-testing #correlation #multiple-comparisons #spearman-rho How to compare of two Spearman correlations matrices?

Bounty: 50

I have two non-parametric rank correlations matrices emp and sim (for example, based on Spearman’s $rho$ rank correlation coefficient):

emp <- matrix(c(
1.0000000, 0.7771328, 0.6800540, 0.2741636,
0.7771328, 1.0000000, 0.5818167, 0.2933432,
0.6800540, 0.5818167, 1.0000000, 0.3432396,
0.2741636, 0.2933432, 0.3432396, 1.0000000), ncol=4)

sim <- matrix(c(
1.0000000, 0.7616454, 0.6545774, 0.3081403,
0.7616454, 1.0000000, 0.5360392, 0.3146167,
0.6545774, 0.5360392, 1.0000000, 0.3739758,
0.3081403, 0.3146167, 0.3739758, 1.0000000), ncol=4)

The emp matrix is the correlation matrix that contains correlations between the emprical values (time series), the sim matrix is the correlation matrix — the simulated values.

I have read the Q&A How to compare two or more correlation matrices?, in my case it is known that emprical values are not from normal distribution, and I can’t use the Box’s M test.

I need to test the null hypothesis $H_0$: matrices emp and sim are drawn from the same distribution.

Question. What is a test do I can use? Is is possible to use the Wishart statistic?

Edit.
Follow to Stephan Kolassa‘s comment I have done a simulation.

I have tried to compare two Spearman correlations matrices emp and sim with the Box’s M test. The test has returned

# Chi-squared statistic = 2.6163, p-value = 0.9891

Then I have simulated 1000 times the correlations matrix sim and plot the distribution of Chi-squared statistic $M(1-c)simchi^2(df)$.

enter image description here

After that I have defined the 5-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$. The defined 5-% quantile equals to

quantile(dfr$stat, probs = 0.05)
#       5% 
# 1.505046

One can see that the 5-% quantile is less that the obtained Chi-squared statistic: 1.505046 < 2.6163 (blue line on the fugure), therefore, my emp‘s statistic $M(1−c)$ does not fall in the left tail of the $(M(1−c))_i$.

Edit 2.
Follow to the second Stephan Kolassa‘s comment I have calculated 95-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$ (blue line on the fugure). The defined 95-% quantile equals to

quantile(dfr$stat, probs = 0.95)
#      95% 
# 7.362071

One can see that the emp‘s statistic $M(1−c)$ does not fall in the right tail of the $(M(1−c))_i$.

Edit 3. I have calculated the exact $p$-value (green line on the figure) through the empirical cumulative distribution function:

ecdf(dfr$stat)(2.6163)
[1] 0.239

One can see that $p$-value=0.239 is greater than $0.05$.

Edit 4.

Dominik Wied (2014): A Nonparametric Test for a Constant Correlation
Matrix, Econometric Reviews, DOI: 10.1080/07474938.2014.998152

Joël Bun, Jean-Philippe Bouchaud and Mark Potters (2016), Cleaning correlation matrices, Risk.net, April 2016

Li, David X., On Default Correlation: A Copula Function Approach (September 1999). Available at SSRN: https://ssrn.com/abstract=187289 or http://dx.doi.org/10.2139/ssrn.187289

G. E. P. Box, A General Distribution Theory for a Class of Likelihood Criteria. Biometrika. Vol. 36, No. 3/4 (Dec., 1949), pp. 317-346

M. S. Bartlett, Properties of Sufficiency and Statistical Tests. Proc. R. Soc. Lond. A 1937 160, 268-282

Robert I. Jennrich (1970): An Asymptotic χ2 Test for the Equality of Two
Correlation Matrices
, Journal of the American Statistical Association, 65:330, 904-912.

Edit 5.

The first founded paper that has no the assumption about normal distribution.

Reza Modarres & Robert W. Jernigan (1993) A robust test for comparing correlation matrices, Journal of Statistical Computation and Simulation, 46:3-4, 169-181


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #multiple-comparisons #spearman-rho #kendall-tau How to compare of two Spearman correlat…

Bounty: 50

I have two non-parametric rank correlations matrices emp and sim (for example, based on Spearman’s $rho$ rank correlation coefficient):

emp <- matrix(c(
1.0000000, 0.7771328, 0.6800540, 0.2741636,
0.7771328, 1.0000000, 0.5818167, 0.2933432,
0.6800540, 0.5818167, 1.0000000, 0.3432396,
0.2741636, 0.2933432, 0.3432396, 1.0000000), ncol=4)

sim <- matrix(c(
1.0000000, 0.7616454, 0.6545774, 0.3081403,
0.7616454, 1.0000000, 0.5360392, 0.3146167,
0.6545774, 0.5360392, 1.0000000, 0.3739758,
0.3081403, 0.3146167, 0.3739758, 1.0000000), ncol=4)

The emp matrix is the correlation matrix that contains correlations between the emprical values (time series), the sim matrix is the correlation matrix — the simulated values.

I have read the Q&A How to compare two or more correlation matrices?, in my case it is known that emprical values are not from normal distribution, and I can’t use the Box’s M test.

I need to test the null hypothesis $H_0$: matrices emp and sim are drawn from the same distribution.

Question. What is a test do I can use? Is is possible to use the Wishart statistic?

Edit.
Follow to Stephan Kolassa‘s comment I have done a simulation.

I have tried to compare two Spearman correlations matrices emp and sim with the Box’s M test. The test has returned

# Chi-squared statistic = 2.6163, p-value = 0.9891

Then I have simulated 1000 times the correlations matrix sim and plot the distribution of Chi-squared statistic $M(1-c)simchi^2(df)$.

enter image description here

After that I have defined the 5-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$. The defined 5-% quantile equals to

quantile(dfr$stat, probs = 0.05)
#       5% 
# 1.505046

One can see that the 5-% quantile is less that the obtained Chi-squared statistic: 1.505046 < 2.6163 (blue line on the fugure), therefore, my emp‘s statistic $M(1−c)$ does not fall in the left tail of the $(M(1−c))_i$.

Edit 2.
Follow to the second Stephan Kolassa‘s comment I have calculated 95-% quantile of Chi-squared statistic $M(1-c)simchi^2(df)$ (blue line on the fugure). The defined 95-% quantile equals to

quantile(dfr$stat, probs = 0.95)
#      95% 
# 7.362071

One can see that the emp‘s statistic $M(1−c)$ does not fall in the right tail of the $(M(1−c))_i$.

Edit 3. I have calculated the exact $p$-value (green line on the figure) through the empirical cumulative distribution function:

ecdf(dfr$stat)(2.6163)
[1] 0.239

One can see that $p$-value=0.239 is greater than $0.05$.

Edit 4.

Dominik Wied (2014): A Nonparametric Test for a Constant Correlation
Matrix, Econometric Reviews, DOI: 10.1080/07474938.2014.998152

Joël Bun, Jean-Philippe Bouchaud and Mark Potters (2016), Cleaning correlation matrices, Risk.net, April 2016

Li, David X., On Default Correlation: A Copula Function Approach (September 1999). Available at SSRN: https://ssrn.com/abstract=187289 or http://dx.doi.org/10.2139/ssrn.187289


Get this bounty!!!

#StackBounty: #correlation #mathematical-statistics #simulation #copula If I have a vector of $N$ correlated probabilities. How can I t…

Bounty: 300

My ultimate goal is to be able to have a way to generate a vector of size $N$ of correlated Bernoulli random variables. One way I am doing this is to use the Gaussian Coupla approach. However, the Gaussian Coupla approach just leaves me with a vector:

$$
(p_1, ldots, p_N) in [0,1]^N
$$

Suppose that I have generated $(p_1, ldots, p_N)$ such that the common correlation between them is $rho$. Now, how can I transform these into a new vector of $0$ or $1$’s? In other words, I would like:

$$
(X_1, ldots, X_N) in {0,1}^N
$$

but with the same correlation $rho$.

One approach I thought of was to assign a hard cutoff rule such that if $p_i < 0.5$, then let $X_i = 0$ and if $p_i geq 0.5$, then let $X_i = 1$.

This seems to work well in simulations in that it retains the correlation structure but it is very arbitrary to me what cutoff value should be chosen aside from $0.5$.

Another way is to treat each $X_i$ as a Bernoulli random variable with success probability $p_i$ and sample from it. However this approach seems to cause loss of correlation and instead of $rho$, I may get $frac{rho}{2}$ or $frac{rho}{3}$.

Does anyone have any thoughts or inputs into this? Thank you.


Get this bounty!!!

#StackBounty: #correlation #variance #standard-deviation #kurtosis Correlation coefficient with standard deviation

Bounty: 50

I quite often find myself testing hypotheses in which the standard deviation of one (Normally distributed) variable is linked to (the mean of) another variable. I would like to be able to express the strength of this association by means of an index between [-1, 1], similar in spirit to a correlation coefficient. I feel like I can’t be the first one with this problem, so my first question is: does something like this exist? My second question is whether something I’ve come up with myself seems reasonable.

To express the problem more precisely, let $Z$ be a normally distributed variable:
$$
Z sim Nleft(0,sigma^2right)
$$
where the standard deviation $sigma$ is a linear function of some other variables:
$$
sigma=Xbeta+varepsilon
$$
where $X={x_1, x_2, …, x_p}$ is a set of predictor variables, and $beta$ is a vector of linear coefficients on these predictors. So compared to the familiar linear model, the difference is that we now have a linear prediction for the second, rather than the first moment of the distribution of $Z$.

Given some observations of $Z$ and $X$, we can find the maximum likelihood estimate of $beta$, which we’ll denote $hat{beta}$. Now the question is, how much of the ‘variance in variance’ of $Z$ is explained by this linear model? This leads me to the idea of using kurtosis. That is, because $Z$ is distributed as a mixture of Normals with different SDs and a common mean, it will be leptokurtic and thus have excess kurtosis w.r.t. a Normal distribution with constant variance. However, if we divided each observation of $Z$ by its SD (i.e. $dot{Z_i}=frac{Z_i}{sigma_i}$, where $sigma_i=X_ibeta$), we should be able to reduce its kurtosis (to the point where, if the changes in variance of $Z$ are perfectly predicted by our fitted model, we should be able to get rid of 100% of the excess kurtosis).

So the index I’m proposing (analogous to $R^2$) is:
$$
xi^2 = 1 – frac{left|text{Kurt}[Z/hat{sigma}]-3right|}{text{Kurt}[Z]-3}
$$
where $hat{sigma}=Xhat{beta}$. If our model explains no “variance in variance” at all, then the kurtosis should be just as high after we transform $Z$ as before, in which case $xi^2=0$. If we managed to explain away all the changes in variance, then ${Z}/{hat{sigma}}$ should be perfectly Normally distributed (with kurtosis of 3), and thus $xi^2=1$.

Does that seem reasonable? Did I just re-invent the wheel (or a dumb version of a wheel)?


Get this bounty!!!

#StackBounty: #r #correlation #binomial #simulation #monte-carlo How to generate a variable which values derive from four binary and pa…

Bounty: 50

Suppose I want to simulate a survey variable in R which values derive from four binary cases $c_i$ with $i={1,2,3, 4}$, each with it’s probability $Pr(c_i=1)=p_i$. Let’s assume that the cases should correlate pairwise with each other in this (idealized) structure:

$begin{matrix}& c1 & c2 & c3 & c4\
c1 & 1 & -1 & 0 & 0\
c2 & -1 & 1 & 0 & 0\
c3 & 0 & 0 & 1 & -1\
c4 & 0 & 0 & -1 & 1
end{matrix}$

For the simulation I want to take $n$ draws (i. e. a sample with size $n$) from the correlation structure above, where in each draw the number of $c_i=1$ are added. This should result in something like $X=(0, 2, 3, 0, 2, 0, 4, 1, 2 ,0 , dots)$.

My attempt in R coding so far looks like the following. (Thereby I orientated myself on this tutorial, unfortunately it doesn’t fit to my needs until the end.)

set.seed(961)
mu <- rep(0, 4)
Sigma <- matrix(c(1, -1, 0, 0, 
                  -1, 1, 0, 0, 
                  0, 0, 1, -1, 
                  0, 0, -1, 1), nrow = 4, ncol = 4)
Sigma
#      [,1] [,2] [,3] [,4]
# [1,]    1   -1    0    0
# [2,]   -1    1    0    0
# [3,]    0    0    1   -1
# [4,]    0    0   -1    1

library(MASS)
rawvars <- mvrnorm(n = 1e4, mu = mu, Sigma = Sigma)

# cov(rawvars)

cor(rawvars)
#             [,1]         [,2]         [,3]         [,4]
# [1,]  1.000000000 -1.000000000 -0.006839596  0.006839597
# [2,] -1.000000000  1.000000000  0.006839597 -0.006839598
# [3,] -0.006839596  0.006839597  1.000000000 -1.000000000
# [4,]  0.006839597 -0.006839598 -1.000000000  1.000000000

pvars <- pnorm(rawvars)

Until here I think it looks good and it seems I am on the right way. In the following the author draws Poisson, exponential, and other data and of all things his code seems to be flawed in the binary example.

I made an attempt myself but I could not specify the probabilities $p_i$ so as a workaround I chose a value $p=.3$ which seems OK by rule of thumb, and it looks like this:

binvars <- qbinom(pvars, 4, .3) 

I then get following distribution:

summary(as.factor(binvars))
#    0     1     2     3     4 
# 9704 16283 10676  2994   343

hist(binvars)

enter image description here

Now I’m facing three major problems. First, how should the resulting vector resp. the distribution look like in general (e. g. with respect to the zeroes)? Second, does the attempt so far make sense at all? Third, how could I solve this problem to the end?

Any help is very appreciated.


Get this bounty!!!

#StackBounty: #correlation #estimation #bias #noise What is known about 2nd order estimation biases due to correlated noise between res…

Bounty: 100

I have recently run into a statistical bias in a type of analysis that I believe is somewhat common in my field, and not typically corrected for. I would like to know if this is more well-known in other fields or in statistics in general, and whether it has a standard name or solution (I have a solution but I wonder if this is the standard one, and if it is, I’d like to have a reference for it).

The bias arises when estimating the mean of, or fitting models to, data with noise that is correlated between response variables. I want to stress that I am not talking about correlated noise between measurements of the same variable, which seems to be a more widely known and discussed phenomenon (e.g. when doing a linear regression this would violate the assumption of independence between residuals).

To make it more concrete, suppose we have some data that obeys the following model:
$$
Y = F(X)+Gamma
$$
where $Y$, $F(X)$ and $E$ are $Ntimes M$ matrices, where $N$ is the number of (simultaneously measured) observations and $M$ the number of response variables. $Y$ are the observations of the response variables, $F(X)$ is some function of the independent variables, and $Gamma$ is Normally distributed noise with, covariance $SigmainRe^{Mtimes M}$ between response variables (i.e. between columns of $Gamma$).

The goal of the analysis is to estimate $F(X)$, which we won’t specify any further because I want to keep the statement of the problem quite general, but you could imagine this being a linear regression model, or even just a simple mean. Since $E[Gamma]=0$, the least-squares estimate of $F(X)$ is unbiased in the first order, i.e. $E[hat{F(X)}]=F(X)$. However, because noise is correlated between response variables, $E[frac{1}{N}Gamma^TGamma]=Sigmaneq0$, and therefore the covariance between the estimates $hat{F(X)}$ is not unbiased, but will instead be proportional to the noise covariance $Sigma$:
$$
E[hat{F(X)}^That{F(X)}]-F(X)^TF(X)proptoSigma
$$
This is a problem if you want to test the hypothesis that $E[{F(X)}^T{F(X)}]proptoSigma$, which happens to be a common hypothesis in my field, because your analysis is biased to find this hypothesized relationship in the estimates of $F(X)$, even if it doesn’t exist in reality. That is, you are biased to find estimates of $F(X)$ that are similar between response variables with correlated noise, even if the true similarity structure of $F(X)$ does not mimic the noise correlation structure.

The simple solution that I’ve come up with is to divide the data into independent halves $Y^{(1)}$ & $Y^{(2)}$, compute separate estimates $hat{F(X)}^{(1)}$ & $hat{F(X)}^{(2)}$ from these data partitions, and then calculate the covariance between those independent estimates. Since noise is independent by construction between $Y^{(1)}$ & $Y^{(2)}$, this means that the bias has disappeared, i.e. we know have $$E[{hat{F(X)}^{(1)}}^T{hat{F(X)}^{(2)}}]=F(X)^TF(X)$$

Does any of this ring familiar to anyone here? I would like to know if what I’m describing is a well known issue with a standard solution (which I feel it ought to be), and would appreciate any references to literature on this.


Get this bounty!!!

#StackBounty: #correlation #covariance #independence #covariance-matrix Correlation matrix of rows, correcting for correlation matrix o…

Bounty: 50

Given a matrix of observations (rows) x variables (columns), can we compute the correlation matrix of the rows, but corrected by the correlation matrix of the columns? The goal would be to avoid inflation in the correlation p values, since the t-test for Pearson correlation assumes that the columns/variables are independent.

Intuitively, this could be accomplished with a weighted correlation, where e.g. if a pair of variables are nearly perfectly correlated, they would each be down-weighted by a factor of 2.


Get this bounty!!!

#StackBounty: #r #time-series #correlation #wavelet #rollapply Wavelet correlation using rolling window in R

Bounty: 200

I have 3 time series which I can apply the wavelet transform to using a rolling window. The rolling window takes a single time series of length 200 and applies the waveslim::modwt function to it over the first 30 samples. This outputs 5 lists of which I am only interested in (d1,d2,d3,d4) and these each have a length of 30. A simple example can be found here:

library(waveslim)
J <- 4 #no. of levels in decomposition
data(ar1)
ar1.modwt <- modwt(ar1, "la8", J)

@G. Grothendieck has kindly provided a neat piece of code for the rolling window approach for a single time series here.

The rolling window increments by 1 and we go again, producing another 5 lists of which I only care for d1->d4 and so on and so on until the full length of the time series had been rolled over.

The next step is to apply the waveslim::brick.wall function to the output of the rolling window lists. The brick.wall function looks at the output of modwt for the first window across the 4 levels and replaces some of the values with NAs.

I believe I have covered this by modifying @G. Grothendieck answer using the following approach, I hope I am right:

modwt2 <- function(...) unlist(head(brick.wall(modwt(...)), 4))
rollr <- rollapplyr(ar1, 30, FUN = modwt2, wf = "la8", n.levels = 4, boundary = "periodic")
L <- lapply(1:nrow(rollr), function(i) matrix(rollr[i,], , 4))

The final piece is to construct correlation matrices for the outputs of the brick.wall function which is L above over the 4 levels of interest.

There is a function called waveslim::wave.correlation which takes two brick.wall outputs X and Y and computes the wave.correlation over the various levels.

library(waveslim)
data(exchange)
returns <- diff(log(as.matrix(exchange)))
returns <- ts(returns, start=1970, freq=12)
wf <- "la8"
J <- 4
demusd.modwt <- modwt(returns[,"DEM.USD"], wf, J)
demusd.modwt.bw <- brick.wall(demusd.modwt, wf)
jpyusd.modwt <- modwt(returns[,"JPY.USD"], wf, J)
jpyusd.modwt.bw <- brick.wall(jpyusd.modwt, wf)
returns.modwt.cor <- wave.correlation(demusd.modwt.bw, jpyusd.modwt.bw,
                                      N = dim(returns)[1])

I wish to expand on this and compute the full correlation matrix for my 3 time series. Please note that the example above with exchange rates does not use the rolling window approach as it uses the full length of the time series which I would like to now do and it also produces a single values for the correlation between two time series. It does not construct the full correlation matrix which I need as I am interested in the eigenvalues of these correlation matrix over time.

I hope this makes sense because it has caused me a lot of heartache over the last few weeks and days.

So in summary:

  1. Take 3 time series
  2. Apply modwt function using rolling window
  3. Apply brick.wall function to each output of the rolling window in 2 above
  4. Create full 3×3 correlation matrix for the 4 levels using outputs of 3 above over time

    Many thanks.


Get this bounty!!!

#StackBounty: #regression #correlation #p-value #assumptions Difference between the assumptions underlying a correlation and a regressi…

Bounty: 50

My question grew out of a discussion with @whuber in the comments of a different question.

Specifically, @whuber ‘s comment was as follows:

One reason it might surprise you is that the assumptions underlying a correlation test and a regression slope test are different–so even when we understand that the correlation and slope are really measuring the same thing, why should their p-values be the same? That shows how these issues go deeper than simply whether $r$ and $beta$ should be numerically equal.

This got my thinking about it and I came across a variety of interesting answers. For example, I found this question “Assumptions of correlation coefficient” but can’t see how this would clarify the comment above.

I found more interesting answers about the relationship of Pearson’s $r$ and the slope $beta$ in a simple linear regression (see here and here for example) but none of them seem to answer what @whuber was referring to in his comment (at least not apparent to me).

Question 1: What are the assumptions underlying a correlation test and a regression slope test?

For my 2nd question consider the following outputs in R:

model <- lm(Employed ~ Population, data = longley)
summary(model)

Call:
lm(formula = Employed ~ Population, data = longley)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4362 -0.9740  0.2021  0.5531  1.9048 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.3807     4.4224   1.895   0.0789 .  
Population    0.4849     0.0376  12.896 3.69e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.013 on 14 degrees of freedom
Multiple R-squared:  0.9224,    Adjusted R-squared:  0.9168 
F-statistic: 166.3 on 1 and 14 DF,  p-value: 3.693e-09

And the output of the cor.test() function:

with(longley, cor.test(Population, Employed))

    Pearson's product-moment correlation

data:  Population and Employed
t = 12.8956, df = 14, p-value = 3.693e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8869236 0.9864676
sample estimates:
      cor 
0.9603906 

As can be seen by the lm() and cov.test() output, the Pearson’s correlation coefficient $r$ and the slope estimate ($beta_1$) are largely different, 0.96 vs. 0.485, respectively, but the t-value and the p-values are the same.

Then I also tried to see if I am able to calculate the t-value for $r$ and $beta_1$, which are the same despite $r$ and $beta_1$ being different. And that’s where I get stuck, at least for $r$:

Calculate the the slope ($beta_1$) in a simple linear regression using the total sums of squares of $x$ and $y$:

x <- longley$Population; y <- longley$Employed
xbar <- mean(x); ybar <- mean(y)
ss.x <- sum((x-xbar)^2)
ss.y <- sum((y-ybar)^2)
ss.xy <- sum((x-xbar)*(y-ybar))

Calculate the least-squares estimate of the regression slope, $beta_{1}$ (there is a proof of this in Crawley’s R Book 1st edition, page 393):

b1 <- ss.xy/ss.x                        
b1
# [1] 0.4848781

Calculate the standard error for $beta_1$:

ss.residual <- sum((y-model$fitted)^2)
n <- length(x) # SAMPLE SIZE
k <- length(model$coef) # NUMBER OF MODEL PARAMETER (i.e. b0 and b1)
df.residual <- n-k
ms.residual <- ss.residual/df.residual # RESIDUAL MEAN SQUARE
se.b1 <- sqrt(ms.residual/ss.x)
se.b1
# [1] 0.03760029

And the t-value and p-value for $beta_1$:

t.b1 <- b1/se.b1
p.b1 <- 2*pt(-abs(t.b1), df=n-2)
t.b1
# [1] 12.89559
p.b1
# [1] 3.693245e-09

What I don’t know at this point, and this is Question 2, is, how to calculate the same t-value using $r$ instead of $beta_1$ (perhaps in baby-steps)?

I assume that since cor.test()‘s alternative hypothesis is whether the true correlation is not equal to 0 (see cor.test() output above), I would expect something like the Pearson correlation coefficient $r$ divided by the “standard error of the Pearson correlation coefficient” (similar to the b1/se.b1 above)?! But what would that standard error be and why?

Maybe this has something to do with the aforementioned assumptions underlying a correlation test and a regression slope test?!


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #statistical-significance #experiment-design #binary-data How to determine a 'strong…

Bounty: 50

I have a set of drivers that are binary and a concept to measure that contains natural numbers between 1-10.

I’m currently using Kruskal’s key driver analysis to determine the relative contribution of each of the drivers. It’s discussed as being more robust that Pearson’s Correlation by taking into consideration the complete set of drivers and their relative contribution.

However, is the Kruskal’s approach still valid when the drivers are binary and the concept to measure are natural numbers between 1 and 10? I thought about switching to using the point biserial correlation, however this is identical to Pearson’s R.

My question is: Where do I set the threshold between a ‘good’ driver and a ‘not so good’ driver? It’s dependent upon the size of the data and also the properties of the data. Calculating the significance using t-tests (ignoring the fact the data may not meet the necessary assumptions of the t-test (that’s bundled in with the pearsonr scipy algorithm), denotes all of them to be significant, as they usually will be because even weak drivers will have some correlation, and aren’t ‘random’. Therefore do I set the ‘strong’ drivers to have a very low p-value – something that seems kind of arbitrary. Or is there a better algorithm that can distinguish between strong and weak drivers?

Or is it that no algorithm can really determine what a strong driver is? Is it dependent upon other factors relating to the context of the data that is being analysed?


Get this bounty!!!