#StackBounty: #r #time-series #correlation #wavelet #rollapply Wavelet correlation using rolling window in R

Bounty: 200

I have 3 time series which I can apply the wavelet transform to using a rolling window. The rolling window takes a single time series of length 200 and applies the waveslim::modwt function to it over the first 30 samples. This outputs 5 lists of which I am only interested in (d1,d2,d3,d4) and these each have a length of 30. A simple example can be found here:

library(waveslim)
J <- 4 #no. of levels in decomposition
data(ar1)
ar1.modwt <- modwt(ar1, "la8", J)

@G. Grothendieck has kindly provided a neat piece of code for the rolling window approach for a single time series here.

The rolling window increments by 1 and we go again, producing another 5 lists of which I only care for d1->d4 and so on and so on until the full length of the time series had been rolled over.

The next step is to apply the waveslim::brick.wall function to the output of the rolling window lists. The brick.wall function looks at the output of modwt for the first window across the 4 levels and replaces some of the values with NAs.

I believe I have covered this by modifying @G. Grothendieck answer using the following approach, I hope I am right:

modwt2 <- function(...) unlist(head(brick.wall(modwt(...)), 4))
rollr <- rollapplyr(ar1, 30, FUN = modwt2, wf = "la8", n.levels = 4, boundary = "periodic")
L <- lapply(1:nrow(rollr), function(i) matrix(rollr[i,], , 4))

The final piece is to construct correlation matrices for the outputs of the brick.wall function which is L above over the 4 levels of interest.

There is a function called waveslim::wave.correlation which takes two brick.wall outputs X and Y and computes the wave.correlation over the various levels.

library(waveslim)
data(exchange)
returns <- diff(log(as.matrix(exchange)))
returns <- ts(returns, start=1970, freq=12)
wf <- "la8"
J <- 4
demusd.modwt <- modwt(returns[,"DEM.USD"], wf, J)
demusd.modwt.bw <- brick.wall(demusd.modwt, wf)
jpyusd.modwt <- modwt(returns[,"JPY.USD"], wf, J)
jpyusd.modwt.bw <- brick.wall(jpyusd.modwt, wf)
returns.modwt.cor <- wave.correlation(demusd.modwt.bw, jpyusd.modwt.bw,
                                      N = dim(returns)[1])

I wish to expand on this and compute the full correlation matrix for my 3 time series. Please note that the example above with exchange rates does not use the rolling window approach as it uses the full length of the time series which I would like to now do and it also produces a single values for the correlation between two time series. It does not construct the full correlation matrix which I need as I am interested in the eigenvalues of these correlation matrix over time.

I hope this makes sense because it has caused me a lot of heartache over the last few weeks and days.

So in summary:

  1. Take 3 time series
  2. Apply modwt function using rolling window
  3. Apply brick.wall function to each output of the rolling window in 2 above
  4. Create full 3×3 correlation matrix for the 4 levels using outputs of 3 above over time

    Many thanks.


Get this bounty!!!

#StackBounty: #regression #correlation #p-value #assumptions Difference between the assumptions underlying a correlation and a regressi…

Bounty: 50

My question grew out of a discussion with @whuber in the comments of a different question.

Specifically, @whuber ‘s comment was as follows:

One reason it might surprise you is that the assumptions underlying a correlation test and a regression slope test are different–so even when we understand that the correlation and slope are really measuring the same thing, why should their p-values be the same? That shows how these issues go deeper than simply whether $r$ and $beta$ should be numerically equal.

This got my thinking about it and I came across a variety of interesting answers. For example, I found this question “Assumptions of correlation coefficient” but can’t see how this would clarify the comment above.

I found more interesting answers about the relationship of Pearson’s $r$ and the slope $beta$ in a simple linear regression (see here and here for example) but none of them seem to answer what @whuber was referring to in his comment (at least not apparent to me).

Question 1: What are the assumptions underlying a correlation test and a regression slope test?

For my 2nd question consider the following outputs in R:

model <- lm(Employed ~ Population, data = longley)
summary(model)

Call:
lm(formula = Employed ~ Population, data = longley)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.4362 -0.9740  0.2021  0.5531  1.9048 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.3807     4.4224   1.895   0.0789 .  
Population    0.4849     0.0376  12.896 3.69e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.013 on 14 degrees of freedom
Multiple R-squared:  0.9224,    Adjusted R-squared:  0.9168 
F-statistic: 166.3 on 1 and 14 DF,  p-value: 3.693e-09

And the output of the cor.test() function:

with(longley, cor.test(Population, Employed))

    Pearson's product-moment correlation

data:  Population and Employed
t = 12.8956, df = 14, p-value = 3.693e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8869236 0.9864676
sample estimates:
      cor 
0.9603906 

As can be seen by the lm() and cov.test() output, the Pearson’s correlation coefficient $r$ and the slope estimate ($beta_1$) are largely different, 0.96 vs. 0.485, respectively, but the t-value and the p-values are the same.

Then I also tried to see if I am able to calculate the t-value for $r$ and $beta_1$, which are the same despite $r$ and $beta_1$ being different. And that’s where I get stuck, at least for $r$:

Calculate the the slope ($beta_1$) in a simple linear regression using the total sums of squares of $x$ and $y$:

x <- longley$Population; y <- longley$Employed
xbar <- mean(x); ybar <- mean(y)
ss.x <- sum((x-xbar)^2)
ss.y <- sum((y-ybar)^2)
ss.xy <- sum((x-xbar)*(y-ybar))

Calculate the least-squares estimate of the regression slope, $beta_{1}$ (there is a proof of this in Crawley’s R Book 1st edition, page 393):

b1 <- ss.xy/ss.x                        
b1
# [1] 0.4848781

Calculate the standard error for $beta_1$:

ss.residual <- sum((y-model$fitted)^2)
n <- length(x) # SAMPLE SIZE
k <- length(model$coef) # NUMBER OF MODEL PARAMETER (i.e. b0 and b1)
df.residual <- n-k
ms.residual <- ss.residual/df.residual # RESIDUAL MEAN SQUARE
se.b1 <- sqrt(ms.residual/ss.x)
se.b1
# [1] 0.03760029

And the t-value and p-value for $beta_1$:

t.b1 <- b1/se.b1
p.b1 <- 2*pt(-abs(t.b1), df=n-2)
t.b1
# [1] 12.89559
p.b1
# [1] 3.693245e-09

What I don’t know at this point, and this is Question 2, is, how to calculate the same t-value using $r$ instead of $beta_1$ (perhaps in baby-steps)?

I assume that since cor.test()‘s alternative hypothesis is whether the true correlation is not equal to 0 (see cor.test() output above), I would expect something like the Pearson correlation coefficient $r$ divided by the “standard error of the Pearson correlation coefficient” (similar to the b1/se.b1 above)?! But what would that standard error be and why?

Maybe this has something to do with the aforementioned assumptions underlying a correlation test and a regression slope test?!


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #statistical-significance #experiment-design #binary-data How to determine a 'strong…

Bounty: 50

I have a set of drivers that are binary and a concept to measure that contains natural numbers between 1-10.

I’m currently using Kruskal’s key driver analysis to determine the relative contribution of each of the drivers. It’s discussed as being more robust that Pearson’s Correlation by taking into consideration the complete set of drivers and their relative contribution.

However, is the Kruskal’s approach still valid when the drivers are binary and the concept to measure are natural numbers between 1 and 10? I thought about switching to using the point biserial correlation, however this is identical to Pearson’s R.

My question is: Where do I set the threshold between a ‘good’ driver and a ‘not so good’ driver? It’s dependent upon the size of the data and also the properties of the data. Calculating the significance using t-tests (ignoring the fact the data may not meet the necessary assumptions of the t-test (that’s bundled in with the pearsonr scipy algorithm), denotes all of them to be significant, as they usually will be because even weak drivers will have some correlation, and aren’t ‘random’. Therefore do I set the ‘strong’ drivers to have a very low p-value – something that seems kind of arbitrary. Or is there a better algorithm that can distinguish between strong and weak drivers?

Or is it that no algorithm can really determine what a strong driver is? Is it dependent upon other factors relating to the context of the data that is being analysed?


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #statistical-significance #experiment-design #binary-data How to determine a 'strong…

Bounty: 50

I have a set of drivers that are binary and a concept to measure that contains natural numbers between 1-10.

I’m currently using Kruskal’s key driver analysis to determine the relative contribution of each of the drivers. It’s discussed as being more robust that Pearson’s Correlation by taking into consideration the complete set of drivers and their relative contribution.

However, is the Kruskal’s approach still valid when the drivers are binary and the concept to measure are natural numbers between 1 and 10? I thought about switching to using the point biserial correlation, however this is identical to Pearson’s R.

My question is: Where do I set the threshold between a ‘good’ driver and a ‘not so good’ driver? It’s dependent upon the size of the data and also the properties of the data. Calculating the significance using t-tests (ignoring the fact the data may not meet the necessary assumptions of the t-test (that’s bundled in with the pearsonr scipy algorithm), denotes all of them to be significant, as they usually will be because even weak drivers will have some correlation, and aren’t ‘random’. Therefore do I set the ‘strong’ drivers to have a very low p-value – something that seems kind of arbitrary. Or is there a better algorithm that can distinguish between strong and weak drivers?

Or is it that no algorithm can really determine what a strong driver is? Is it dependent upon other factors relating to the context of the data that is being analysed?


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #statistical-significance #experiment-design #binary-data How to determine a 'strong…

Bounty: 50

I have a set of drivers that are binary and a concept to measure that contains natural numbers between 1-10.

I’m currently using Kruskal’s key driver analysis to determine the relative contribution of each of the drivers. It’s discussed as being more robust that Pearson’s Correlation by taking into consideration the complete set of drivers and their relative contribution.

However, is the Kruskal’s approach still valid when the drivers are binary and the concept to measure are natural numbers between 1 and 10? I thought about switching to using the point biserial correlation, however this is identical to Pearson’s R.

My question is: Where do I set the threshold between a ‘good’ driver and a ‘not so good’ driver? It’s dependent upon the size of the data and also the properties of the data. Calculating the significance using t-tests (ignoring the fact the data may not meet the necessary assumptions of the t-test (that’s bundled in with the pearsonr scipy algorithm), denotes all of them to be significant, as they usually will be because even weak drivers will have some correlation, and aren’t ‘random’. Therefore do I set the ‘strong’ drivers to have a very low p-value – something that seems kind of arbitrary. Or is there a better algorithm that can distinguish between strong and weak drivers?

Or is it that no algorithm can really determine what a strong driver is? Is it dependent upon other factors relating to the context of the data that is being analysed?


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #statistical-significance #experiment-design #binary-data How to determine a 'strong…

Bounty: 50

I have a set of drivers that are binary and a concept to measure that contains natural numbers between 1-10.

I’m currently using Kruskal’s key driver analysis to determine the relative contribution of each of the drivers. It’s discussed as being more robust that Pearson’s Correlation by taking into consideration the complete set of drivers and their relative contribution.

However, is the Kruskal’s approach still valid when the drivers are binary and the concept to measure are natural numbers between 1 and 10? I thought about switching to using the point biserial correlation, however this is identical to Pearson’s R.

My question is: Where do I set the threshold between a ‘good’ driver and a ‘not so good’ driver? It’s dependent upon the size of the data and also the properties of the data. Calculating the significance using t-tests (ignoring the fact the data may not meet the necessary assumptions of the t-test (that’s bundled in with the pearsonr scipy algorithm), denotes all of them to be significant, as they usually will be because even weak drivers will have some correlation, and aren’t ‘random’. Therefore do I set the ‘strong’ drivers to have a very low p-value – something that seems kind of arbitrary. Or is there a better algorithm that can distinguish between strong and weak drivers?

Or is it that no algorithm can really determine what a strong driver is? Is it dependent upon other factors relating to the context of the data that is being analysed?


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #statistical-significance #experiment-design #binary-data How to determine a 'strong…

Bounty: 50

I have a set of drivers that are binary and a concept to measure that contains natural numbers between 1-10.

I’m currently using Kruskal’s key driver analysis to determine the relative contribution of each of the drivers. It’s discussed as being more robust that Pearson’s Correlation by taking into consideration the complete set of drivers and their relative contribution.

However, is the Kruskal’s approach still valid when the drivers are binary and the concept to measure are natural numbers between 1 and 10? I thought about switching to using the point biserial correlation, however this is identical to Pearson’s R.

My question is: Where do I set the threshold between a ‘good’ driver and a ‘not so good’ driver? It’s dependent upon the size of the data and also the properties of the data. Calculating the significance using t-tests (ignoring the fact the data may not meet the necessary assumptions of the t-test (that’s bundled in with the pearsonr scipy algorithm), denotes all of them to be significant, as they usually will be because even weak drivers will have some correlation, and aren’t ‘random’. Therefore do I set the ‘strong’ drivers to have a very low p-value – something that seems kind of arbitrary. Or is there a better algorithm that can distinguish between strong and weak drivers?

Or is it that no algorithm can really determine what a strong driver is? Is it dependent upon other factors relating to the context of the data that is being analysed?


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #statistical-significance #experiment-design #binary-data How to determine a 'strong…

Bounty: 50

I have a set of drivers that are binary and a concept to measure that contains natural numbers between 1-10.

I’m currently using Kruskal’s key driver analysis to determine the relative contribution of each of the drivers. It’s discussed as being more robust that Pearson’s Correlation by taking into consideration the complete set of drivers and their relative contribution.

However, is the Kruskal’s approach still valid when the drivers are binary and the concept to measure are natural numbers between 1 and 10? I thought about switching to using the point biserial correlation, however this is identical to Pearson’s R.

My question is: Where do I set the threshold between a ‘good’ driver and a ‘not so good’ driver? It’s dependent upon the size of the data and also the properties of the data. Calculating the significance using t-tests (ignoring the fact the data may not meet the necessary assumptions of the t-test (that’s bundled in with the pearsonr scipy algorithm), denotes all of them to be significant, as they usually will be because even weak drivers will have some correlation, and aren’t ‘random’. Therefore do I set the ‘strong’ drivers to have a very low p-value – something that seems kind of arbitrary. Or is there a better algorithm that can distinguish between strong and weak drivers?

Or is it that no algorithm can really determine what a strong driver is? Is it dependent upon other factors relating to the context of the data that is being analysed?


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #statistical-significance #experiment-design #binary-data How to determine a 'strong…

Bounty: 50

I have a set of drivers that are binary and a concept to measure that contains natural numbers between 1-10.

I’m currently using Kruskal’s key driver analysis to determine the relative contribution of each of the drivers. It’s discussed as being more robust that Pearson’s Correlation by taking into consideration the complete set of drivers and their relative contribution.

However, is the Kruskal’s approach still valid when the drivers are binary and the concept to measure are natural numbers between 1 and 10? I thought about switching to using the point biserial correlation, however this is identical to Pearson’s R.

My question is: Where do I set the threshold between a ‘good’ driver and a ‘not so good’ driver? It’s dependent upon the size of the data and also the properties of the data. Calculating the significance using t-tests (ignoring the fact the data may not meet the necessary assumptions of the t-test (that’s bundled in with the pearsonr scipy algorithm), denotes all of them to be significant, as they usually will be because even weak drivers will have some correlation, and aren’t ‘random’. Therefore do I set the ‘strong’ drivers to have a very low p-value – something that seems kind of arbitrary. Or is there a better algorithm that can distinguish between strong and weak drivers?

Or is it that no algorithm can really determine what a strong driver is? Is it dependent upon other factors relating to the context of the data that is being analysed?


Get this bounty!!!

#StackBounty: #hypothesis-testing #correlation #statistical-significance #experiment-design #binary-data How to determine a 'strong…

Bounty: 50

I have a set of drivers that are binary and a concept to measure that contains natural numbers between 1-10.

I’m currently using Kruskal’s key driver analysis to determine the relative contribution of each of the drivers. It’s discussed as being more robust that Pearson’s Correlation by taking into consideration the complete set of drivers and their relative contribution.

However, is the Kruskal’s approach still valid when the drivers are binary and the concept to measure are natural numbers between 1 and 10? I thought about switching to using the point biserial correlation, however this is identical to Pearson’s R.

My question is: Where do I set the threshold between a ‘good’ driver and a ‘not so good’ driver? It’s dependent upon the size of the data and also the properties of the data. Calculating the significance using t-tests (ignoring the fact the data may not meet the necessary assumptions of the t-test (that’s bundled in with the pearsonr scipy algorithm), denotes all of them to be significant, as they usually will be because even weak drivers will have some correlation, and aren’t ‘random’. Therefore do I set the ‘strong’ drivers to have a very low p-value – something that seems kind of arbitrary. Or is there a better algorithm that can distinguish between strong and weak drivers?

Or is it that no algorithm can really determine what a strong driver is? Is it dependent upon other factors relating to the context of the data that is being analysed?


Get this bounty!!!