## #StackBounty: #modeling #p-value #aic Does automatic model selection via AIC bias the p-values of the selected model?

### Bounty: 50

Let’s say I run a procedure where I fit every possible model given some set of covariates and I select the model with the minimum AIC. I know that if my selection criteria was based on minimizing p-values, the p-values of the selected model would be misleading. But what if my selection criteria was AIC alone? To what extent would this bias the p-values?

I had assumed the effect on p-values would be negligible, but came across this paper, which proves the following:

P values are intimately linked to confidence intervals and to
differences in Akaike’s information criterion (ΔAIC), two metrics that
have been advocated as replacements for the P value.

If this is true, does it imply that p-values are misleading after automatic selection based on AIC? To what extent will they be biased, and what determines this?

Get this bounty!!!

## #StackBounty: #distributions #p-value #inference #data-transformation #controlling-for-a-variable Generate null distribution from pvalues

### Bounty: 50

I have a set of experiments on which I apply the Fisher’s exact method to statistically infer changes in cellular populations.
Some of the data are dummy experiments that model our control experiments which describe the null model.

However, due to some experimental variation most of the controlled experiments reject the null hypothesis at a $$p_{val} <0.05$$. Some of the null hypotheses of the actual experimental conditions are also rejected at a $$p_{val} <0.05$$. However, these pvalues, are magnitudes low than those of my control conditions. This indicates a stronger effect of these experimental conditions. However, I am not aware of a proper method to quantify these changes and statistically infer them.

An example of what the data looks like:

``````ID      Pval            Condition
B0_W1   2.890032e-16    CTRL
B0_W10  7.969311e-38    CTRL
B0_W11  8.078795e-25    CTRL
B0_W12  2.430554e-80    TEST1
B0_W2   3.149525e-30    TEST2
B1_W1   3.767914e-287   TEST3
B1_W10  3.489684e-56    TEST4
B1_W10  3.489684e-56    TEST5
``````

1. selecting the ctrl conditions and let $$X = -ln(p_{val})$$ which will distribute the transformed data as an expontential distribution.
2. Use MLE to find the $$lambda$$ parameter of the expontential distribution. This will be my null distribution.
3. Apply the same transformation to the rest of the $$p_{val}$$ that correspond to the test conditions
4. Use the cdf of the null distribution to get the new "adjusted pvalues".

This essentially will give a new $$alpha$$ threshold for the original pvalues and transform the results accordingly using the null’s distribution cdf. Are these steps correct? Is using MLE to find the rate correct or it violates some of the assumptions to achieve my end goal? Any other approaches I could try?

Get this bounty!!!

## #StackBounty: #confidence-interval #p-value #model #model-comparison p value for difference in model outcomes

### Bounty: 50

I’ve run two different linear mixed effects models on the same data and got two different estimates for the gradient of the longitudinal variable. e.g.

model 1 has estimate 30 with standard error 5.
model 2 has estimate 40 with standard error 4.

I’m interested in calculating a p value for the probability that the models are different, from the estimate and standard error. How do I do this? I’m aware that checking for overlap in the 95% confidence intervals is a bad idea, and that overlapping 83% CIs are a better test, but would like to be able to quantify this with a p value.

Get this bounty!!!

## Context

This is somewhat similar to this question, but I do not think it is an exact duplicate.

When you look for how instructions on how to perform a bootstrap hypothesis test, it is usually stated that it is fine to use the empirical distribution for confidence intervals but that you need to correctly bootstrap from the distribution under the null hypothesis to get a p-value. As an example, see the accepted answer to this question. A general search on the internet mostly seems to turn up similar answers.

The reason for not using a p-value based on the empirical distribution is that most of the time we do not have translation invariance.

## Example

Let me give a short example. We have a coin and we want to do an one-sided test to see if the frequency of heads is larger than 0.5

We perform $$n = 20$$ trials and get $$k = 14$$ heads. The true p-value for this test would be $$p = 0.058$$.

On the other hand if we bootstrap our 14 out of 20 heads, we effectively sample from the binomial distribution with $$n = 20$$ and $$p = frac{14}{20}=0.7$$. Shifting this distribution by subtracting 0.2 we will get a barely significant result when testing our observed value of 0.7 against the obtained empirical distribution.

In this case the discrepancy is very small, but it gets larger when the success rate we test against gets close to 1.

## Question

Now let me come to the real point of my question: the very same defect also holds for confidence intervals. In fact, if a confidence interval has the stated confidence level $$alpha$$ then the confidence interval not containing the parameter under the null hypothesis is equivalent to rejecting the null hypothesis at a significance level of $$1- alpha$$.

Why is it that the confidence intervals based upon the empirical distribution are widely accepted and the p-value not?

Is there a deeper reason or are people just not as conservative with confidence intervals?

In this answer Peter Dalgaard gives an answer that seems to agree with my argument. He says:

least not (much) worse than the calculation of CI.

Where is the (much) coming from? It implies that generating p-values that way is slightly worse, but does not elaborate on the point.

## Final thoughts

Also in An Introduction to the Bootstrap by Efron and Tibshirani they dedicate a lot of space to the confidence intervals but not to p-values unless they are generated under a proper null hypothesis distribution, with the exception of one throwaway line about the general equivalence of confidence intervals and p-values in the chapter about permutation testing.

Let us also come back to the first question I linked. I agree with the answer by Michael Chernick, but again he also argues that both confidence intervals and p-values based on the empirical bootstrap distribution are equally unreliable in some scenarios. It does not explain why you find many people telling you that the intervals are ok, but the p-values are not.

Get this bounty!!!

## #StackBounty: #hypothesis-testing #statistical-significance #anova #p-value Determine which performance intervention is best?

### Bounty: 100

Suppose I have numerical data describing the total process time for a given software simulation. This data is broken up into 5 groups (Base, AD1, AD2, AD3, AD4) each detailing a different performance intervention with approximately the same number of observations for each group.

My goal is to determine if the performance interventions result in significantly different alive times than the base case and to determine which intervention is “best”. “Best” being defined as the least amount of process time.

To clarify, my data is comprised of all the “regression-tests” in our code framework. So at this point I am looking at a high-level what the interventions do to overall process time but eventually will create sub-categories within each intervention to determine inter-group effect on process time.

My data has some extreme outliers as can be seen from this graphic:

My hypothesis is as follows:

$$H_{0}: mu_{text{base}} = mu_{text{AD1}} = mu_{text{AD2}} = mu_{text{AD3}} = mu_{text{AD4}}$$

$$H_{A}: text{Not all means equal}$$

I am unsure what my hypothesis would be in determining the best “metric”. I am also unsure if using the mean is appropriate in this circumstance given the outliers in my data.

My idea is to use some form of ANOVA or Krukall Wallis test and then maybe a Tukey Test to determine which one is best? I am open to Bayesian or Frequentist approaches to this. I might be over thinking this as well.

Get this bounty!!!

## #StackBounty: #machine-learning #statistical-significance #t-test #p-value How to decide if means of two sets are statistically signifi…

### Bounty: 50

I have a data set consisting of some number of pairs of real numbers. For example:

``````(1.2, 3.4), (3.2, 2.7), ..., (4.2, 1.0)
``````

or

``````(x1, y1), (x2, y2), ..., (xn, yn)
``````

I want to know if the second variable depends on the first one (it is known in advance that if there is a dependency, it is very weak, so it is hard to detect).

I split the data set into two parts using the first number (Xs). Then I use the mean of Ys for the first and the second sub-sets as “predictions”. If find such a split that the squared deviation between the predictions and real values of Ys is minimal. Basically I do what is done by decision trees.

Now I wont to know if the found split and the corresponding difference between the two means is significant. I could use some standard test to check if the means of two sets are statistically significantly different but, I think, it would be incorrect because we did the split that maximise this difference. What would be the way to count for that?

Get this bounty!!!

## #StackBounty: #distributions #p-value #goodness-of-fit #kolmogorov-smirnov Goodness-of-fit test on arbitrary parametric distributions w…

### Bounty: 100

There have been many questions regarding this topic already addressed on CV. However, I was still unsure if this question was addressed directly.

1. Is it possible, for any arbitrary parametric distribution, to properly calculate the p-value for a Kolmogorov-Smirnov test where the parameters of the null distribution are estimated from the data?
2. Or does the choice of parametric distribution determine if this can be achieved?
3. What about the Anderson-Darling, Cramer von-Mises tests?
4. What is the general procedure for estimating the correct p-values?

My general understanding of the procedure would be the following. Assume we have data \$X\$ and a parametric distribution \$F(x;theta)\$. Then I would:

• Estimate parameters \$hattheta_{0}\$ for \$F(x;theta)\$.
• Calculate Kolmogorv-Smirnov, Anderson-Darling, Cramer von-Mises test statistics: KS\$_{0}\$, AD\$_{0}\$ and CVM\$_{0}\$.
• For \$i=1,2,ldots,n\$
1. Simulate data \$y\$ from \$F(;hattheta_{0})\$
2. Estimate \$hattheta_{i}\$ for \$F(y;theta_{i})\$
3. Calculate KS\$_{i}\$, AD\$_{i}\$ and CVM\$_{i}\$ statistics for \$F(y;hattheta_{i})\$
• Calculate \$p\$-values as the proportion of these statistics that are more extreme than KS\$_{0}\$, AD\$_{0}\$ and CVM\$_{0}\$, respectively.

Is this correct?

Get this bounty!!!

## #StackBounty: #p-value #intuition #application #communication #climate Evidence for man-made global warming hits 'gold standard&#39…

### How should we interpret the $$5sigma$$ threshold in this research on climate change?

This message in a Reuter’s article from 25 february is currently all over the news:

They said confidence that human activities were raising the heat at the Earth’s surface had reached a “five-sigma” level, a statistical gauge meaning there is only a one-in-a-million chance that the signal would appear if there was no warming.

I believe that this refers to this article “Celebrating the anniversary of three key events in climate change science” which contains a plot, which is shown schematically below (It is a sketch because I could not find an open source image for an original, similar free images are found here). Another article from the same research group, which seems to be a more original source, is here (but it uses a 1% significance instead of $$5sigma$$).

The plot presents measurements from three different research groups: 1 Remote Sensing Systems, 2 the Center for Satellite Applications and Research, and the 3 University of Alabama at Huntsville.

The plot displays three rising curves of signal to noise ratio as a function of trend length.

So somehow scientists have measured an anthropogenic signal of global warming (or climate change?) at a $$5sigma$$ level, which is apparently some scientific standard of evidence.

For me such graph, which has a high level of abstraction, raises many questions$$^{dagger}$$, and in general I wonder about the question ‘How did they do this?’. How do we explain this experiment into simple words (but not so abstract) and also explain the meaning of the $$5sigma$$ level?

I ask this question here because I do not want a discussion about climate. Instead I want answers regarding the statistical content and especially to clarify the meaning of such a statement that is using/claiming $$5 sigma$$.

$$^dagger$$:What is the null hypothesis? How did they set up the experiment to get a anthropogenic signal? What is the effect size of the signal? Is it just a small signal and we only measure this now because the noise is decreasing, or is the signal increasing? What kind of assumptions are made to create the statistical model by which they determine the crossing of a 5 sigma threshold (independence, random effects, etc…)? Why are the three curves for the different research groups different, do they have different noise or do they have different signals, and in the case of the latter, what does that mean regarding the interpretation of probability and external validity?

Get this bounty!!!

## #StackBounty: #p-value #intuition #application #communication #climate Evidence for man-made global warming hits 'gold standard&#39…

### How should we interpret the $$5sigma$$ threshold in this research on climate change?

This message in a Reuter’s article from 25 february is currently all over the news:

They said confidence that human activities were raising the heat at the Earth’s surface had reached a “five-sigma” level, a statistical gauge meaning there is only a one-in-a-million chance that the signal would appear if there was no warming.

I believe that this refers to this article “Celebrating the anniversary of three key events in climate change science” which contains a plot, which is shown schematically below (It is a sketch because I could not find an open source image for an original, similar free images are found here). Another article from the same research group, which seems to be a more original source, is here (but it uses a 1% significance instead of $$5sigma$$).

The plot presents measurements from three different research groups: 1 Remote Sensing Systems, 2 the Center for Satellite Applications and Research, and the 3 University of Alabama at Huntsville.

The plot displays three rising curves of signal to noise ratio as a function of trend length.

So somehow scientists have measured an anthropogenic signal of global warming (or climate change?) at a $$5sigma$$ level, which is apparently some scientific standard of evidence.

For me such graph, which has a high level of abstraction, raises many questions$$^{dagger}$$, and in general I wonder about the question ‘How did they do this?’. How do we explain this experiment into simple words (but not so abstract) and also explain the meaning of the $$5sigma$$ level?

I ask this question here because I do not want a discussion about climate. Instead I want answers regarding the statistical content and especially to clarify the meaning of such a statement that is using/claiming $$5 sigma$$.

$$^dagger$$:What is the null hypothesis? How did they set up the experiment to get a anthropogenic signal? What is the effect size of the signal? Is it just a small signal and we only measure this now because the noise is decreasing, or is the signal increasing? What kind of assumptions are made to create the statistical model by which they determine the crossing of a 5 sigma threshold (independence, random effects, etc…)? Why are the three curves for the different research groups different, do they have different noise or do they have different signals, and in the case of the latter, what does that mean regarding the interpretation of probability and external validity?

Get this bounty!!!

## #StackBounty: #p-value #intuition #application #communication #climate Evidence for man-made global warming hits 'gold standard&#39…

### How should we interpret the $$5sigma$$ threshold in this research on climate change?

This message in a Reuter’s article from 25 february is currently all over the news:

They said confidence that human activities were raising the heat at the Earth’s surface had reached a “five-sigma” level, a statistical gauge meaning there is only a one-in-a-million chance that the signal would appear if there was no warming.

I believe that this refers to this article “Celebrating the anniversary of three key events in climate change science” which contains a plot, which is shown schematically below (It is a sketch because I could not find an open source image for an original, similar free images are found here). Another article from the same research group, which seems to be a more original source, is here (but it uses a 1% significance instead of $$5sigma$$).

The plot presents measurements from three different research groups: 1 Remote Sensing Systems, 2 the Center for Satellite Applications and Research, and the 3 University of Alabama at Huntsville.

The plot displays three rising curves of signal to noise ratio as a function of trend length.

So somehow scientists have measured an anthropogenic signal of global warming (or climate change?) at a $$5sigma$$ level, which is apparently some scientific standard of evidence.

For me such graph, which has a high level of abstraction, raises many questions$$^{dagger}$$, and in general I wonder about the question ‘How did they do this?’. How do we explain this experiment into simple words (but not so abstract) and also explain the meaning of the $$5sigma$$ level?

I ask this question here because I do not want a discussion about climate. Instead I want answers regarding the statistical content and especially to clarify the meaning of such a statement that is using/claiming $$5 sigma$$.

$$^dagger$$:What is the null hypothesis? How did they set up the experiment to get a anthropogenic signal? What is the effect size of the signal? Is it just a small signal and we only measure this now because the noise is decreasing, or is the signal increasing? What kind of assumptions are made to create the statistical model by which they determine the crossing of a 5 sigma threshold (independence, random effects, etc…)? Why are the three curves for the different research groups different, do they have different noise or do they have different signals, and in the case of the latter, what does that mean regarding the interpretation of probability and external validity?

Get this bounty!!!