*Bounty: 100*

I have some noisy high dimensional data, and each data point has a “score”. The scores are roughly normally distributed. Some scores are known and some are unknown; I want to separate the unknown points into two groups, based on whether I think the score is positive or not.

I have a black box which, given some data points and their scores, gives me a hyperplane correctly separating the points (if one exists).

I separate the points with known score into two disjoint sets for training and validation respectively.

Then, repeatedly (say *k* times), I do the following:

- Randomly select some data points with positive score and some points with negative score from the training set (for some fixed positive values for
*m* and *n*).
- Use the black box to (try to) get a separating hyperplane for these sampled points.
- If I get a hyperplane back, save it.

Now I have some hyperplanes (say I have 0 < *k’* <= *k* of them).

I use these hyperplanes to separate the validation set. I select the hyperplane which correctly classifies the most points as having positive or negative score (number of correct positives + number of correct negatives).

My question is: **How can I measure the statistical confidence that the finally selected hyperplane is better than random?**

Here’s what I’ve done so far:

Say there are *n* points in the validation set. If a hyperplane correctly classifies a point with probability *p*, and this is independent for all the points, we can use a binomial distribution.

Let F be the cdf of the binomial distribution. Let X be the number of correctly classified points in the validation set (so we are assuming *X ~ B(n, p)*). Then *P(X <= x) = F(x)*.

Now, we have *k’* hyperplanes. Let’s assume these can be represented as *k’* IID variables *X1, X2, …, Xk’*.

Now *P(max(X1, X1, …, Xk’) <= x) = F(x) ^ k’*.

Let’s say a random hyperplane is one as above where *p* equals the proportion of positive scores in the total (so if it’s three quarters positive, *p = 0.75*).

Sticking some numbers in, I ran these numbers. Let *p = 0.5* for simplicity. Suppose I want to check if the selected hyperplane is better than random with probability > 0.95.

If *n = 2000*, I need to classify 1080 correctly to have confidence greater than 0.95 that this classifier is better than random (I think, unless I did the calculation wrong).

**However**, if the points themselves are not independent, this doesn’t work. Suppose many of the points are identical so the effective size of the set is much smaller than *n*. If *n = 20*, you need to get 18 correct for 0.95 confidence; extrapolating that suggests you’d need 1800/2000.

**I am sure that the points are not independent, but I’m not sure in what way, or how to go about measuring that and accounting for it in a calculation similar to the above.**

Get this bounty!!!