## #StackBounty: #confidence-interval #cross-validation #modeling #standard-error #uncertainty K-fold cross validation based standard erro…

### Bounty: 50

I have an expensive model (or class of models). My baseline approach to quantify uncertainty re the model parameters are hessian based standard errors, and I use k-fold cross validation for model comparison / validation. While a full bootstrap would be pleasant as a more robust uncertainty quantification, this is quite expensive. I think I should also be able to develop expectations for the variance of the leave-k out estimates, to at least get a rough sense of where the hessian based standard error estimates are not performing well. I wonder if someone knows how to do this, or can point to work that does this? Something like an approximate jackknife?

Get this bounty!!!

## #StackBounty: #confidence-interval #python #standard-error #cohens-kappa Calculating the Standard Error and Confidence Interval for Coh…

### Bounty: 50

I need to evaluate the performance of a machine learning application. One of the evaluation metrics chosen is Cohen’s Quadratic Kappa. I found this Python tutorial on how to calculate Cohen’s Quadratic Kappa. What is missing, however, is how to calculate the confidence interval.

Let’s walk through my example (I use a smaller data set for the sake of simplicity). I use NumPy and Scipy Stats for this purpose:

from math import sqrt
import numpy as np
from scipy.stats import norm


This is my confusion matrix:

# x: actuals, y: predictions
confusion_matrix = np.array([
[9, 5, 2, 0, 0, 0],
[4, 7, 1, 0, 0, 0],
[1, 2, 4, 0, 1, 0],
[0, 1, 1, 5, 1, 0],
[0, 0, 0, 1, 2, 1],
[0, 0, 0, 0, 0, 1],
], dtype=np.int)
rows = confusion_matrix.shape[0]
cols = confusion_matrix.shape[1]


I calculate a weight matrix and histograms:

weights = np.zeros((rows, cols))
for r in range(rows):
for c in range(cols):
weights[r, c] = float(((r-c)**2)/(rows*cols))
hist_actual = np.sum(confusion_matrix, axis=0)
hist_prediction = np.sum(confusion_matrix, axis=1)


The expected prediction quality by mere chance is calculated as follows:

expected = np.outer(hist_actual, hist_prediction)


This matrix, and the actual confusion matrix, are normalized:

expected_norm = expected / expected.sum()
confusion_matrix_norm = confusion_matrix / confusion_matrix.sum()


Now I calculate the numerator (actual observed agreement) and the denominator (expected agreement by chance):

for r in range(rows):
for c in range(cols):
numerator += weights[r, c] * confusion_matrix_norm[r, c]
denominator += weights[r, c] * expected_norm[r, c]


Cohen’s Kappa can now be calculated as:

weighted_kappa = (1 - (numerator/denominator))


Which gives me a result of 0.817.

Now to my question: I need to calculate the standard error, in order to calculate the confidence interval. Here’s my approach:

#            p(1-p)
# sek = sqrt -------
#            n(1-e)²
#
# p: numerator (actual observed agreement)
# e: denominator (expected agreement by chance)
# n: total number of predictions
total = hist_actual.sum()
sek = sqrt((numerator * (1 - numerator)) / (total * (1 - denominator) ** 2))


Can I use the total number of predictions, even though I calculate with a normalized numerator and denominator? This would result in a standard error of kappa of 0.023.

The 95% confidence interval then is just straightforward:

alpha = 0.95
margin = (1 - alpha) / 2  # two-tailed test
x = norm.ppf(1 - margin)
lower = weighted_kappa - x * sek
upper = weighted_kappa + x * sek


Which gives an interval of [0.772;0.861].

Get this bounty!!!

## #StackBounty: #mathematical-statistics #confidence-interval #maximum-likelihood #asymptotics How can I obtain an asymptotic $1-alpha$ …

### Bounty: 50

Let $$X sim Gamma(alpha,1)$$ and $$Y|X=x sim Exp(frac{1}{theta x}), alpha >1$$ and $$theta >0$$ are unknown. Let $$tau=E(Y)$$. Suppose that based on the random sample $$Y_1,…,Y_n$$, we have MLEs, $$hat{alpha}$$ and $$hat{theta}$$. Use these MLEs to develop an asymptotic $$1-alpha$$ confidence interval for $$tau$$.

my work:

First, I need to find $$tau=E(Y)=E(frac{1}{theta x})=frac{1}{theta}E(frac{1}{x})$$. We use a transformation of $$T=frac{1}{X}$$, where $$f_T(t)=frac{1}{Gamma(alpha)t^{alpha+1}}e^{-1/t},t>0$$. However, I am having trouble evaluating $$E(T)=int^infty_0frac{1}{Gamma(alpha)t^{alpha}}e^{-1/t}dt$$.

Assuming we have $$tau$$, we can get the asymptotic $$1-alpha$$ CI by using the asymptotic property of MLE. We know that $$hat{alpha}sim AN(alpha,frac{1}{ni(alpha)})$$ and $$hat{theta} sim AN(theta,frac{1}{ni(theta)})$$. However, I am failing to see how I can obtain the asymptotic CI for $$tau$$.

Get this bounty!!!

## #StackBounty: #confidence-interval #binomial Wikipedia's text about the Clopper-Pearson interval for binomial proportions

### Bounty: 50

I’m trying to understand the following text currently (i.e., 2019-09-25) in Wikipedia about the Clopper-Pearson interval:

The Clopper–Pearson interval is an early and very common method for
calculating binomial confidence intervals.[8] This is often called an
‘exact’ method, because it is based on the cumulative probabilities of
the binomial distribution (i.e., exactly the correct distribution
rather than an approximation). However, in cases where we know the
population size, the intervals may not be the smallest possible,
because they include impossible proportions: for instance, for a
population of size 10, an interval of [0.35, 0.65] would be too large
as the true proportion cannot lie between 0.35 and 0.4, or between 0.6
and 0.65.

I do understand that in the given example it would be impossible to get an outcome that would represent a binomial proportion of 0.35 (as this would require 3.5 successes, which is not a possible outcome).

However, I believe the CP-interval is meant to represent the range of underlying probabilities of success (the ‘true proportions’) that have some minimum probability to produce the observed (integer) outcome. As far as I can see, these ‘true proportions’ can take values between 0.35 and 0.4, or between 0.6 and 0.65.

Am I seeing this wrong, or is the cited text incorrect?

Get this bounty!!!

## #StackBounty: #confidence-interval #binomial Wikipedia's text about the Clopper-Pearson interval for binomial proportions

### Bounty: 50

I’m trying to understand the following text currently (i.e., 2019-09-25) in Wikipedia about the Clopper-Pearson interval:

The Clopper–Pearson interval is an early and very common method for
calculating binomial confidence intervals.[8] This is often called an
‘exact’ method, because it is based on the cumulative probabilities of
the binomial distribution (i.e., exactly the correct distribution
rather than an approximation). However, in cases where we know the
population size, the intervals may not be the smallest possible,
because they include impossible proportions: for instance, for a
population of size 10, an interval of [0.35, 0.65] would be too large
as the true proportion cannot lie between 0.35 and 0.4, or between 0.6
and 0.65.

I do understand that in the given example it would be impossible to get an outcome that would represent a binomial proportion of 0.35 (as this would require 3.5 successes, which is not a possible outcome).

However, I believe the CP-interval is meant to represent the range of underlying probabilities of success (the ‘true proportions’) that have some minimum probability to produce the observed (integer) outcome. As far as I can see, these ‘true proportions’ can take values between 0.35 and 0.4, or between 0.6 and 0.65.

Am I seeing this wrong, or is the cited text incorrect?

Get this bounty!!!

## #StackBounty: #confidence-interval #density-estimation Density estimation for big feature space

### Bounty: 50

let’s say I have a data set with 100 features and a couple million of samples. Whenever I get a new sample, I would like to estimate how many samples would have been around it in the original set (let’s say within L1 distance of $$varepsilon$$). How can I do this in an efficient way? To me it sounds like I would like to estimate (joint) density function at a particular point. Perhaps, there’s a way to train a neural net that outputs such density function based on the features values.

Motivation: I would like to use this density function value at a particular point in order to understand how confident should I be in my prediction at that point (the higher the density, the higher would my confidence be).

Get this bounty!!!

## #StackBounty: #bayesian #confidence-interval #frequentist #credible-interval When does a confidence interval "make sense" but…

### Bounty: 50

It is often the case that a confidence interval with 95% coverage is very similar to a credible interval that contains 95% of the posterior density. This happens when the prior is uniform or near uniform in the latter case. Thus a confidence interval can often be used to approximate a credible interval and vice versa. Importantly, we can conclude from this that the much maligned misinterpretation of a confidence interval as a credible interval has little to no practical importance for many simple use cases.

There are a number of examples out there of cases where this does not happen, however they all seem to be cherrypicked by proponents of Bayesian stats in an attempt to prove there is something wrong with the frequentist approach. In these examples, we see the confidence interval contains impossible values, etc which is supposed to show that they are nonsense.

I don’t want to go back over those examples, or a philosophical discussion of Bayesian vs Frequentist.

I am just looking for examples of the opposite. Are there any cases where the confidence and credible intervals are substantially different, and the interval provided by the confidence procedure is clearly superior?

Get this bounty!!!

## Context

This is somewhat similar to this question, but I do not think it is an exact duplicate.

When you look for how instructions on how to perform a bootstrap hypothesis test, it is usually stated that it is fine to use the empirical distribution for confidence intervals but that you need to correctly bootstrap from the distribution under the null hypothesis to get a p-value. As an example, see the accepted answer to this question. A general search on the internet mostly seems to turn up similar answers.

The reason for not using a p-value based on the empirical distribution is that most of the time we do not have translation invariance.

## Example

Let me give a short example. We have a coin and we want to do an one-sided test to see if the frequency of heads is larger than 0.5

We perform $$n = 20$$ trials and get $$k = 14$$ heads. The true p-value for this test would be $$p = 0.058$$.

On the other hand if we bootstrap our 14 out of 20 heads, we effectively sample from the binomial distribution with $$n = 20$$ and $$p = frac{14}{20}=0.7$$. Shifting this distribution by subtracting 0.2 we will get a barely significant result when testing our observed value of 0.7 against the obtained empirical distribution.

In this case the discrepancy is very small, but it gets larger when the success rate we test against gets close to 1.

## Question

Now let me come to the real point of my question: the very same defect also holds for confidence intervals. In fact, if a confidence interval has the stated confidence level $$alpha$$ then the confidence interval not containing the parameter under the null hypothesis is equivalent to rejecting the null hypothesis at a significance level of $$1- alpha$$.

Why is it that the confidence intervals based upon the empirical distribution are widely accepted and the p-value not?

Is there a deeper reason or are people just not as conservative with confidence intervals?

In this answer Peter Dalgaard gives an answer that seems to agree with my argument. He says:

least not (much) worse than the calculation of CI.

Where is the (much) coming from? It implies that generating p-values that way is slightly worse, but does not elaborate on the point.

## Final thoughts

Also in An Introduction to the Bootstrap by Efron and Tibshirani they dedicate a lot of space to the confidence intervals but not to p-values unless they are generated under a proper null hypothesis distribution, with the exception of one throwaway line about the general equivalence of confidence intervals and p-values in the chapter about permutation testing.

Let us also come back to the first question I linked. I agree with the answer by Michael Chernick, but again he also argues that both confidence intervals and p-values based on the empirical bootstrap distribution are equally unreliable in some scenarios. It does not explain why you find many people telling you that the intervals are ok, but the p-values are not.

Get this bounty!!!

### Bounty: 50

Suppose $$Xsim N_3(0,Sigma)$$, where $$Sigma=begin{pmatrix}1&rho&rho^2\rho&1&rho\rho^2&rho&1end{pmatrix}$$.

On the basis of one observation $$x=(x_1,x_2,x_3)’$$, I have to obtain a confidence interval for $$rho$$ with confidence coefficient $$1-alpha$$.

We know that $$X’Sigma^{-1}Xsim chi^2_3$$.

So expanding the quadratic form, I get

$$x’Sigma^{-1}x=frac{1}{1-rho^2}left[x_1^2+(1+rho^2)x_2^2+x_3^2-2rho(x_1x_2+x_2x_3)right]$$

To use this as a pivot for a two-sided C.I with confidence level $$1-alpha$$, I setup $$chi^2_{1-alpha/2,3}le x’Sigma^{-1}xle chi^2_{alpha/2,3}$$

I get two inequalities of the form $$g_1(rho)le 0$$ and $$g_2(rho)ge 0$$, where

$$g_1(rho)=(x_2^2+chi^2_{alpha/2,3})rho^2-2(x_1x_2+x_2x_3)rho+x_1^2+x_2^2+x_3^2-chi^2_{alpha/2,3}$$

and $$g_2(rho)=(x_2^2+chi^2_{1-alpha/2,3})rho^2-2(x_1x_2+x_2x_3)rho+x_1^2+x_2^2+x_3^2-chi^2_{1-alpha/2,3}$$

Am I right in considering a both-sided C.I.? After solving the quadratics in $$rho$$, I am guessing that the resulting C.I would be quite complicated.

Another suitable pivot is $$frac{mathbf1′ x}{sqrt{mathbf1’Sigma mathbf 1}}sim N(0,1)quad,,,mathbf1=(1,1,1)’$$

With $$bar x=frac{1}{3}sum x_i$$, this is same as saying $$frac{3bar x}{sqrt{3+4rho+2rho^2}}sim N(0,1)$$

Using this, I start with the inequality $$left|frac{3bar x}{sqrt{3+4rho+2rho^2}}right|le z_{alpha/2}$$

Therefore, $$frac{9bar x^2}{3+4rho+2rho^2}le z^2_{alpha/2}implies 2(rho+1)^2+1ge frac{9bar x^2}{z^2_{alpha/2}}$$

That is, $$rhoge sqrt{frac{9bar x^2}{2z^2_{alpha/2}}-frac{1}{2}}-1$$

Since the question asks for any confidence interval, there are a number of options available here. I could have also squared the standard normal pivot to get a similar answer in terms of $$chi^2_1$$ fractiles. I am quite sure that both methods I used are valid but I am not certain whether the resulting C.I. is a valid one. I am also interested in other ways to find a confidence interval here.

Get this bounty!!!