I need to evaluate the performance of a machine learning application. One of the evaluation metrics chosen is Cohen’s Quadratic Kappa. I found this Python tutorial on how to calculate Cohen’s Quadratic Kappa. What is missing, however, is how to calculate the confidence interval.
Let’s walk through my example (I use a smaller data set for the sake of simplicity). I use NumPy and Scipy Stats for this purpose:
from math import sqrt import numpy as np from scipy.stats import norm
This is my confusion matrix:
# x: actuals, y: predictions confusion_matrix = np.array([ [9, 5, 2, 0, 0, 0], [4, 7, 1, 0, 0, 0], [1, 2, 4, 0, 1, 0], [0, 1, 1, 5, 1, 0], [0, 0, 0, 1, 2, 1], [0, 0, 0, 0, 0, 1], ], dtype=np.int) rows = confusion_matrix.shape cols = confusion_matrix.shape
I calculate a weight matrix and histograms:
weights = np.zeros((rows, cols)) for r in range(rows): for c in range(cols): weights[r, c] = float(((r-c)**2)/(rows*cols)) hist_actual = np.sum(confusion_matrix, axis=0) hist_prediction = np.sum(confusion_matrix, axis=1)
The expected prediction quality by mere chance is calculated as follows:
expected = np.outer(hist_actual, hist_prediction)
This matrix, and the actual confusion matrix, are normalized:
expected_norm = expected / expected.sum() confusion_matrix_norm = confusion_matrix / confusion_matrix.sum()
Now I calculate the numerator (actual observed agreement) and the denominator (expected agreement by chance):
for r in range(rows): for c in range(cols): numerator += weights[r, c] * confusion_matrix_norm[r, c] denominator += weights[r, c] * expected_norm[r, c]
Cohen’s Kappa can now be calculated as:
weighted_kappa = (1 - (numerator/denominator))
Which gives me a result of 0.817.
Now to my question: I need to calculate the standard error, in order to calculate the confidence interval. Here’s my approach:
# p(1-p) # sek = sqrt ------- # n(1-e)² # # p: numerator (actual observed agreement) # e: denominator (expected agreement by chance) # n: total number of predictions total = hist_actual.sum() sek = sqrt((numerator * (1 - numerator)) / (total * (1 - denominator) ** 2))
Can I use the total number of predictions, even though I calculate with a normalized numerator and denominator? This would result in a standard error of kappa of 0.023.
The 95% confidence interval then is just straightforward:
alpha = 0.95 margin = (1 - alpha) / 2 # two-tailed test x = norm.ppf(1 - margin) lower = weighted_kappa - x * sek upper = weighted_kappa + x * sek
Which gives an interval of [0.772;0.861].