*Bounty: 50*

I need to evaluate the performance of a machine learning application. One of the evaluation metrics chosen is Cohen’s Quadratic Kappa. I found this Python tutorial on how to calculate Cohen’s Quadratic Kappa. What is missing, however, is how to calculate the confidence interval.

Let’s walk through my example (I use a smaller data set for the sake of simplicity). I use NumPy and Scipy Stats for this purpose:

```
from math import sqrt
import numpy as np
from scipy.stats import norm
```

This is my confusion matrix:

```
# x: actuals, y: predictions
confusion_matrix = np.array([
[9, 5, 2, 0, 0, 0],
[4, 7, 1, 0, 0, 0],
[1, 2, 4, 0, 1, 0],
[0, 1, 1, 5, 1, 0],
[0, 0, 0, 1, 2, 1],
[0, 0, 0, 0, 0, 1],
], dtype=np.int)
rows = confusion_matrix.shape[0]
cols = confusion_matrix.shape[1]
```

I calculate a weight matrix and histograms:

```
weights = np.zeros((rows, cols))
for r in range(rows):
for c in range(cols):
weights[r, c] = float(((r-c)**2)/(rows*cols))
hist_actual = np.sum(confusion_matrix, axis=0)
hist_prediction = np.sum(confusion_matrix, axis=1)
```

The expected prediction quality by mere chance is calculated as follows:

```
expected = np.outer(hist_actual, hist_prediction)
```

This matrix, and the actual confusion matrix, are normalized:

```
expected_norm = expected / expected.sum()
confusion_matrix_norm = confusion_matrix / confusion_matrix.sum()
```

Now I calculate the numerator (actual observed agreement) and the denominator (expected agreement by chance):

```
for r in range(rows):
for c in range(cols):
numerator += weights[r, c] * confusion_matrix_norm[r, c]
denominator += weights[r, c] * expected_norm[r, c]
```

Cohen’s Kappa can now be calculated as:

```
weighted_kappa = (1 - (numerator/denominator))
```

Which gives me a result of **0.817**.

Now to my question: I need to calculate the standard error, in order to calculate the confidence interval. Here’s my approach:

```
# p(1-p)
# sek = sqrt -------
# n(1-e)²
#
# p: numerator (actual observed agreement)
# e: denominator (expected agreement by chance)
# n: total number of predictions
total = hist_actual.sum()
sek = sqrt((numerator * (1 - numerator)) / (total * (1 - denominator) ** 2))
```

Can I use the total number of predictions, even though I calculate with a normalized numerator and denominator? This would result in a standard error of kappa of **0.023**.

The 95% confidence interval then is just straightforward:

```
alpha = 0.95
margin = (1 - alpha) / 2 # two-tailed test
x = norm.ppf(1 - margin)
lower = weighted_kappa - x * sek
upper = weighted_kappa + x * sek
```

Which gives an interval of **[0.772;0.861]**.

Get this bounty!!!