## #StackBounty: #time-series #forecasting #p-value #model-evaluation #diagnostic Result of a diagnostic test of a predictive model lookin…

### Bounty: 50

I have created a predictive model that outputs a predictive density. I used 1000 rolling windows to estimate the model and predict one step ahead in each window. I collected the 1000 predictions and compared them to the actual realizations. I used several diagnostic tests, among them Kolmogorov-Smirnov. I saved the $$p$$-value of the test.

I did the same for multiple time series. Then I looked at all of the $$p$$-values from the different series. I found that they are `0.440, 0.579, 0.848, 0.476, 0.753, 0.955, 0.919, 0.498, 0.997`. At first I was quite happy that they are much larger than `0.010`, `0.050` or `0.100` (to use the standard cut-off values). But then a colleague of mine pointed out that the $$p$$-values should be distributed as $$text{Uniform}[0,1]$$ under the null of correct predictive distribution, and so I should perhaps not be so happy.

On the one hand, the colleague must be right; the $$p$$-values should ideally be uniformly distributed. On the other hand, I have found that my model predicts "better" than the true model normally would; the discrepancy between the predicted density and the realized density is less than one would normally expect between the true density and the realized density. This could be an indication of overfitting if I were evaluating my model in-sample, but the model has been evaluated out of sample. What does this tell me? Should I be concerned with a diagnostic test’s $$p$$-values being too high?

You could say this is just a small set of $$p$$-values (just 8 of them) so anything could happen, and you might be right. However, suppose I have a larger set of $$p$$-values that are closer to 1 than uniformly distributed; is that a problem? What does that tell me?

Get this bounty!!!

## #StackBounty: #probability #hypothesis-testing #distributions #statistical-significance #p-value How to determine the 'significance…

### Bounty: 100

Problem: I have carried out a series of biological experiments where the output of the experiment is a N x N matrix of counts. I then created a custom distance metric that takes in two rows of counts and calculates the ‘difference’ between them (I will call this difference metric D). I calculated D for all pairwise comparisons and now have an array of difference metrics D called D_array.

My assumption based on biology is that the majority of D in D_array represent that there is no significant difference between the two rows of counts and only the >= 95% interval of D metrics actually represent real differences between two rows of counts. Let us assume that this is true, even if it doesn’t make sense.

So this means if D_array = [0, 1, 2, 3, 4 … 99] (100 metrics) then only a D score of 95-99 are actually representative of a real difference between two rows of counts.

Note: D_array is not representative of my data. My actual data actually has a distribution of values like this (black line represents the mean): https://imgur.com/usvvIgB

Given D_array I want to be able to determine whether a newly calculated distance value D’ is "significant" based on my previous data: the distribution of my D_array. Ideally, I would like to provide some sort of metric of ‘significance’ such as a p-value. By significance I mean the probability / significance of having gotten a result as extreme as D’.

After a bit of reading, I found that I can use bootstrapping to calculate a 95% confidence interval for D_array, and then essentially ask if D’ is outside of the 95% CI range. However, I am unsure if there is a way to determine how significant having obtained a value of D’ is based on D_array.

My questions are:

1. Does asking if D’ is outside of the 95% CI of bootstrapped D_array in order to determine whether D’ represents a ‘real’ difference between two rows of counts make sense?

2. Given D’ and D_array how can I determine the significance of having gotten a value as extreme as D’ as a result. I have seen bootstrapping used to calculate P-values, but this usually requires the mean of two different distributions which I do not have in this case.

3. Is there a better way to determine whether a new observation is ‘significantly’ different from my prior distribution of ‘null’ (D_array) data. If so, how?

Get this bounty!!!

## #StackBounty: #probability #hypothesis-testing #distributions #statistical-significance #p-value How to determine the 'significance…

### Bounty: 100

Problem: I have carried out a series of biological experiments where the output of the experiment is a N x N matrix of counts. I then created a custom distance metric that takes in two rows of counts and calculates the ‘difference’ between them (I will call this difference metric D). I calculated D for all pairwise comparisons and now have an array of difference metrics D called D_array.

My assumption based on biology is that the majority of D in D_array represent that there is no significant difference between the two rows of counts and only the >= 95% interval of D metrics actually represent real differences between two rows of counts. Let us assume that this is true, even if it doesn’t make sense.

So this means if D_array = [0, 1, 2, 3, 4 … 99] (100 metrics) then only a D score of 95-99 are actually representative of a real difference between two rows of counts.

Note: D_array is not representative of my data. My actual data actually has a distribution of values like this (black line represents the mean): https://imgur.com/usvvIgB

Given D_array I want to be able to determine whether a newly calculated distance value D’ is "significant" based on my previous data: the distribution of my D_array. Ideally, I would like to provide some sort of metric of ‘significance’ such as a p-value. By significance I mean the probability / significance of having gotten a result as extreme as D’.

After a bit of reading, I found that I can use bootstrapping to calculate a 95% confidence interval for D_array, and then essentially ask if D’ is outside of the 95% CI range. However, I am unsure if there is a way to determine how significant having obtained a value of D’ is based on D_array.

My questions are:

1. Does asking if D’ is outside of the 95% CI of bootstrapped D_array in order to determine whether D’ represents a ‘real’ difference between two rows of counts make sense?

2. Given D’ and D_array how can I determine the significance of having gotten a value as extreme as D’ as a result. I have seen bootstrapping used to calculate P-values, but this usually requires the mean of two different distributions which I do not have in this case.

3. Is there a better way to determine whether a new observation is ‘significantly’ different from my prior distribution of ‘null’ (D_array) data. If so, how?

Get this bounty!!!

## #StackBounty: #probability #hypothesis-testing #distributions #statistical-significance #p-value How to determine the 'significance…

### Bounty: 100

Problem: I have carried out a series of biological experiments where the output of the experiment is a N x N matrix of counts. I then created a custom distance metric that takes in two rows of counts and calculates the ‘difference’ between them (I will call this difference metric D). I calculated D for all pairwise comparisons and now have an array of difference metrics D called D_array.

My assumption based on biology is that the majority of D in D_array represent that there is no significant difference between the two rows of counts and only the >= 95% interval of D metrics actually represent real differences between two rows of counts. Let us assume that this is true, even if it doesn’t make sense.

So this means if D_array = [0, 1, 2, 3, 4 … 99] (100 metrics) then only a D score of 95-99 are actually representative of a real difference between two rows of counts.

Note: D_array is not representative of my data. My actual data actually has a distribution of values like this (black line represents the mean): https://imgur.com/usvvIgB

Given D_array I want to be able to determine whether a newly calculated distance value D’ is "significant" based on my previous data: the distribution of my D_array. Ideally, I would like to provide some sort of metric of ‘significance’ such as a p-value. By significance I mean the probability / significance of having gotten a result as extreme as D’.

After a bit of reading, I found that I can use bootstrapping to calculate a 95% confidence interval for D_array, and then essentially ask if D’ is outside of the 95% CI range. However, I am unsure if there is a way to determine how significant having obtained a value of D’ is based on D_array.

My questions are:

1. Does asking if D’ is outside of the 95% CI of bootstrapped D_array in order to determine whether D’ represents a ‘real’ difference between two rows of counts make sense?

2. Given D’ and D_array how can I determine the significance of having gotten a value as extreme as D’ as a result. I have seen bootstrapping used to calculate P-values, but this usually requires the mean of two different distributions which I do not have in this case.

3. Is there a better way to determine whether a new observation is ‘significantly’ different from my prior distribution of ‘null’ (D_array) data. If so, how?

Get this bounty!!!

## #StackBounty: #correlation #multiple-regression #p-value #partial-correlation \$p\$-Values: Standardized coefficients vs. partial \$R^2\$ v…

### Bounty: 50

In this excellent answer, the definitions of, and differences between, the three quantities in the title of this question are laid out.

My question concerns the relationship between their $$p$$-values. This answer states that the $$p$$-values of a standardized $$beta$$ and the corresponding partial $$R^2$$ is the same. My question is two-fold.

1. Why is this true?
2. If it is true, what is the relationship between this $$p$$-value and that of the corresponding semi-partial correlation coefficient?

In response to the comment from @ttnphns below, I ran an example on the `duncan_prestige` dataset. One does in fact see that the $$p$$-values for standardized $$beta$$‘s are the same as those for the partial correlation coefficient so question 1 above has been clarified.

But notice now that the $$p$$-values for the semi-partial correlation coefficient are in fact significantly larger than that of the partial correlation. (I was able to reproduce this behaviour in other datasets as well.) Why does this happen?

My intuition concurs with what @ttnphns claims below, but consider that the `ppcor` documentation (on which the python package that I used for my computation is based) lists the exact same formula (2.8) for the the $$t$$-statistic of the partial and semi-partial correlation coefficients; therefore, since these coefficients are de facto different due to their differing scalings, they will have different $$t$$-statistics and $$p$$-values (since $$text{df}$$ is the same in both cases). Is this an error in `ppcor` or is something else going on?

Get this bounty!!!

## #StackBounty: #correlation #multiple-regression #p-value #partial-correlation \$p\$-Values: Standarised coefficients vs. partial \$R^2\$ vs…

### Bounty: 50

In this excellent answer, the definitions of, and differences between, the three quantities in the title of this question are laid out.

My question concerns the relationship between their $$p$$-values. This answer states that the $$p$$-values of a standarised $$beta$$ and the corresponding partial $$R^2$$ is the same. My question is two-fold.

1. Why is this true?
2. If it is true, what is the relationship between this $$p$$-value and that of the corresponding semi-partial correlation coefficient?

In response to the comment from @ttnphns below, I ran an example on the `duncan_prestige` dataset. One does in fact see that the $$p$$-values for standarised $$beta$$‘s are the same as those for the partial correlation coefficient so question 1 above has been clarified.

But notice now that the $$p$$-values for the semi-partial correlation coefficient are in fact significantly larger than that of the partial correlation. (I was able to reproduce this behaviour in other datasets as well.) Why does this happen?

My intuition concurs with what @ttnphns claims below, but consider that the `ppcor` documentation (on which the python package that I used for my computation is based) lists the exact same formula (2.8) for the the $$t$$-statistic of the partial and semi-partial correlation coefficients; therefore, since these coefficients are de facto different due to their differing scalings, they will have different $$t$$-statistics and $$p$$-values (since $$text{df}$$ is the same in both cases). Is this an error in `ppcor` or is something else going on?

Get this bounty!!!

## #StackBounty: #r #statistical-significance #anova #p-value #mice Differences in p.values for MICE imputed dataset between summary(pool(…

### Bounty: 50

I brought a reproducible example with me!

I was trying to figure out what the best way to do parameter significance tests with MICE imputed data. I came across difference options, but those seem to be lacking consistency (what I thought should do the same gives different results).

In the reprex, there is first an illustration in the case of complete data, where I show two equivalent commands. I then try to translate that to the MICE imputed data case, without much success.

``````using<-function(...) {
libs<-unlist(list(...))
req<-unlist(lapply(libs,require,character.only=TRUE))
need<-libs[req==FALSE]
if(length(need)>0){
install.packages(need)
lapply(need,require,character.only=TRUE)
}
}

######################################################
# Classic linear regression with complete cases
######################################################

nhanes_complete=nhanes[complete.cases(nhanes), ]

big_model_lm=lm("bmi ~ age + chl", data=nhanes_complete)
small_model_lm=lm("bmi ~ age", data=nhanes_complete)

print(summary(big_model_lm))
#This is type II Anova
print(Anova(big_model_lm, type=2))
#These 2 result in the same thing

######################################################
# MICE imputation and then do regression
######################################################

#Imputation
set.seed(1)
imp <- mice(nhanes, m=10)

with_test_big_1=with(imp, exp=lm(as.formula("bmi ~ chl + age")))
with_test_big_2=with(imp, exp=lm(as.formula("bmi ~ age + chl")))
with_test_small_1=with(imp, exp=lm(as.formula("bmi ~ age")))
with_test_small_2=with(imp, exp=lm(as.formula("bmi ~ chl")))

#These 4 are consistent
print(D2(with_test_big_1,with_test_small_1))
print(D2(with_test_big_2,with_test_small_1))
print(D2(with_test_big_1,with_test_small_2))
print(D2(with_test_big_2,with_test_small_2))
#This is the same as D2 from the mice package but not consistent depending on order of the predictors
mi.anova(imp, formula="bmi ~ age + chl", type=2)
mi.anova(imp, formula="bmi ~ chl + age ", type=2)
#This is compeltely different from all the following ones
print(summary(pool(with_test_big)))

# What is the difference between mi.anova()/D2() and summary(pool())?
# Why does the order matter for mi.anova?
``````

Does anyone have an explanation for these observations? Which is the recommended procedure?

Get this bounty!!!

## #StackBounty: #modeling #p-value #aic Does automatic model selection via AIC bias the p-values of the selected model?

### Bounty: 50

Let’s say I run a procedure where I fit every possible model given some set of covariates and I select the model with the minimum AIC. I know that if my selection criteria was based on minimizing p-values, the p-values of the selected model would be misleading. But what if my selection criteria was AIC alone? To what extent would this bias the p-values?

I had assumed the effect on p-values would be negligible, but came across this paper, which proves the following:

P values are intimately linked to confidence intervals and to
differences in Akaike’s information criterion (ΔAIC), two metrics that
have been advocated as replacements for the P value.

If this is true, does it imply that p-values are misleading after automatic selection based on AIC? To what extent will they be biased, and what determines this?

Get this bounty!!!

## #StackBounty: #distributions #p-value #inference #data-transformation #controlling-for-a-variable Generate null distribution from pvalues

### Bounty: 50

I have a set of experiments on which I apply the Fisher’s exact method to statistically infer changes in cellular populations.
Some of the data are dummy experiments that model our control experiments which describe the null model.

However, due to some experimental variation most of the controlled experiments reject the null hypothesis at a $$p_{val} <0.05$$. Some of the null hypotheses of the actual experimental conditions are also rejected at a $$p_{val} <0.05$$. However, these pvalues, are magnitudes low than those of my control conditions. This indicates a stronger effect of these experimental conditions. However, I am not aware of a proper method to quantify these changes and statistically infer them.

An example of what the data looks like:

``````ID      Pval            Condition
B0_W1   2.890032e-16    CTRL
B0_W10  7.969311e-38    CTRL
B0_W11  8.078795e-25    CTRL
B0_W12  2.430554e-80    TEST1
B0_W2   3.149525e-30    TEST2
B1_W1   3.767914e-287   TEST3
B1_W10  3.489684e-56    TEST4
B1_W10  3.489684e-56    TEST5
``````

1. selecting the ctrl conditions and let $$X = -ln(p_{val})$$ which will distribute the transformed data as an expontential distribution.
2. Use MLE to find the $$lambda$$ parameter of the expontential distribution. This will be my null distribution.
3. Apply the same transformation to the rest of the $$p_{val}$$ that correspond to the test conditions
4. Use the cdf of the null distribution to get the new "adjusted pvalues".

This essentially will give a new $$alpha$$ threshold for the original pvalues and transform the results accordingly using the null’s distribution cdf. Are these steps correct? Is using MLE to find the rate correct or it violates some of the assumptions to achieve my end goal? Any other approaches I could try?

Get this bounty!!!

## #StackBounty: #confidence-interval #p-value #model #model-comparison p value for difference in model outcomes

### Bounty: 50

I’ve run two different linear mixed effects models on the same data and got two different estimates for the gradient of the longitudinal variable. e.g.

model 1 has estimate 30 with standard error 5.
model 2 has estimate 40 with standard error 4.

I’m interested in calculating a p value for the probability that the models are different, from the estimate and standard error. How do I do this? I’m aware that checking for overlap in the 95% confidence intervals is a bad idea, and that overlapping 83% CIs are a better test, but would like to be able to quantify this with a p value.

Get this bounty!!!