*Bounty: 100*

## Context

This is somewhat similar to this question, but I do not think it is an exact duplicate.

When you look for how instructions on how to perform a bootstrap hypothesis test, it is usually stated that it is fine to use the empirical distribution for confidence intervals but that you need to correctly bootstrap from the distribution under the null hypothesis to get a p-value. As an example, see the accepted answer to this question. A general search on the internet mostly seems to turn up similar answers.

The reason for not using a p-value based on the empirical distribution is that most of the time we do not have translation invariance.

## Example

Let me give a short example. We have a coin and we want to do an one-sided test to see if the frequency of heads is larger than 0.5

We perform $n = 20$ trials and get $k = 14$ heads. The true p-value for this test would be $p = 0.058$.

On the other hand if we bootstrap our 14 out of 20 heads, we effectively sample from the binomial distribution with $n = 20$ and $p = frac{14}{20}=0.7$. Shifting this distribution by subtracting 0.2 we will get a barely significant result when testing our observed value of 0.7 against the obtained empirical distribution.

In this case the discrepancy is very small, but it gets larger when the success rate we test against gets close to 1.

## Question

Now let me come to the real point of my question: the very same defect also holds for confidence intervals. In fact, if a confidence interval has the stated confidence level $alpha$ then the confidence interval not containing the parameter under the null hypothesis is equivalent to rejecting the null hypothesis at a significance level of $1- alpha$.

**Why is it that the confidence intervals based upon the empirical distribution are widely accepted and the p-value not?**

Is there a deeper reason or are people just not as conservative with confidence intervals?

In this answer Peter Dalgaard gives an answer that seems to agree with my argument. He says:

There’s nothing particularly wrong about this line of reasoning, or at

least not (much) worse than the calculation of CI.

Where is the (much) coming from? It implies that generating p-values that way is slightly worse, but does not elaborate on the point.

## Final thoughts

Also in *An Introduction to the Bootstrap* by Efron and Tibshirani they dedicate a lot of space to the confidence intervals but not to p-values unless they are generated under a proper null hypothesis distribution, with the exception of one throwaway line about the general equivalence of confidence intervals and p-values in the chapter about permutation testing.

Let us also come back to the first question I linked. I agree with the answer by Michael Chernick, but again he also argues that both confidence intervals and p-values based on the empirical bootstrap distribution are equally unreliable in some scenarios. It does not explain why you find many people telling you that the intervals are ok, but the p-values are not.

Get this bounty!!!