I have some noisy high dimensional data, and each data point has a “score”. The scores are roughly normally distributed. Some scores are known and some are unknown; I want to separate the unknown points into two groups, based on whether I think the score is positive or not.
I have a black box which, given some data points and their scores, gives me a hyperplane correctly separating the points (if one exists).
I separate the points with known score into two disjoint sets for training and validation respectively.
Then, repeatedly (say k times), I do the following:
- Randomly select some data points with positive score and some points with negative score from the training set (for some fixed positive values for m and n).
- Use the black box to (try to) get a separating hyperplane for these sampled points.
- If I get a hyperplane back, save it.
Now I have some hyperplanes (say I have 0 < k’ <= k of them).
I use these hyperplanes to separate the validation set. I select the hyperplane which correctly classifies the most points as having positive or negative score (number of correct positives + number of correct negatives).
My question is: How can I measure the statistical confidence that the finally selected hyperplane is better than random?
Here’s what I’ve done so far:
Say there are n points in the validation set. If a hyperplane correctly classifies a point with probability p, and this is independent for all the points, we can use a binomial distribution.
Let F be the cdf of the binomial distribution. Let X be the number of correctly classified points in the validation set (so we are assuming X ~ B(n, p)). Then P(X <= x) = F(x).
Now, we have k’ hyperplanes. Let’s assume these can be represented as k’ IID variables X1, X2, …, Xk’.
Now P(max(X1, X1, …, Xk’) <= x) = F(x) ^ k’.
Let’s say a random hyperplane is one as above where p equals the proportion of positive scores in the total (so if it’s three quarters positive, p = 0.75).
Sticking some numbers in, I ran these numbers. Let p = 0.5 for simplicity. Suppose I want to check if the selected hyperplane is better than random with probability > 0.95.
If n = 2000, I need to classify 1080 correctly to have confidence greater than 0.95 that this classifier is better than random (I think, unless I did the calculation wrong).
However, if the points themselves are not independent, this doesn’t work. Suppose many of the points are identical so the effective size of the set is much smaller than n. If n = 20, you need to get 18 correct for 0.95 confidence; extrapolating that suggests you’d need 1800/2000.
I am sure that the points are not independent, but I’m not sure in what way, or how to go about measuring that and accounting for it in a calculation similar to the above.
Get this bounty!!!