Bounty: 50
I need to evaluate the performance of a machine learning application. One of the evaluation metrics chosen is Cohen’s Quadratic Kappa. I found this Python tutorial on how to calculate Cohen’s Quadratic Kappa. What is missing, however, is how to calculate the confidence interval.
Let’s walk through my example (I use a smaller data set for the sake of simplicity). I use NumPy and Scipy Stats for this purpose:
from math import sqrt
import numpy as np
from scipy.stats import norm
This is my confusion matrix:
# x: actuals, y: predictions
confusion_matrix = np.array([
[9, 5, 2, 0, 0, 0],
[4, 7, 1, 0, 0, 0],
[1, 2, 4, 0, 1, 0],
[0, 1, 1, 5, 1, 0],
[0, 0, 0, 1, 2, 1],
[0, 0, 0, 0, 0, 1],
], dtype=np.int)
rows = confusion_matrix.shape[0]
cols = confusion_matrix.shape[1]
I calculate a weight matrix and histograms:
weights = np.zeros((rows, cols))
for r in range(rows):
for c in range(cols):
weights[r, c] = float(((r-c)**2)/(rows*cols))
hist_actual = np.sum(confusion_matrix, axis=0)
hist_prediction = np.sum(confusion_matrix, axis=1)
The expected prediction quality by mere chance is calculated as follows:
expected = np.outer(hist_actual, hist_prediction)
This matrix, and the actual confusion matrix, are normalized:
expected_norm = expected / expected.sum()
confusion_matrix_norm = confusion_matrix / confusion_matrix.sum()
Now I calculate the numerator (actual observed agreement) and the denominator (expected agreement by chance):
for r in range(rows):
for c in range(cols):
numerator += weights[r, c] * confusion_matrix_norm[r, c]
denominator += weights[r, c] * expected_norm[r, c]
Cohen’s Kappa can now be calculated as:
weighted_kappa = (1 - (numerator/denominator))
Which gives me a result of 0.817.
Now to my question: I need to calculate the standard error, in order to calculate the confidence interval. Here’s my approach:
# p(1-p)
# sek = sqrt -------
# n(1-e)²
#
# p: numerator (actual observed agreement)
# e: denominator (expected agreement by chance)
# n: total number of predictions
total = hist_actual.sum()
sek = sqrt((numerator * (1 - numerator)) / (total * (1 - denominator) ** 2))
Can I use the total number of predictions, even though I calculate with a normalized numerator and denominator? This would result in a standard error of kappa of 0.023.
The 95% confidence interval then is just straightforward:
alpha = 0.95
margin = (1 - alpha) / 2 # two-tailed test
x = norm.ppf(1 - margin)
lower = weighted_kappa - x * sek
upper = weighted_kappa + x * sek
Which gives an interval of [0.772;0.861].
Get this bounty!!!