#StackBounty: #classification #subsampling Choosing subsample size (helping a friend analysing a smaller data set)

Bounty: 100

A friend of mine is working analysing 2000 twits per day and categorize them as postive, negative or neutral.
This is a really boring task but the algorithms that do this classification are not very good because they can’t detect sarcasm.
A simple solution to make more easy this task is to do a subsample of the original $N = 2000$ data points.

Doing some tests we saw that with $30%$ of the data the normalized histograms of the subsample and the original data points look very similar but we need to know a better estimation of the error of doing this subsample.

Theoretically the data points are an i.i.d. sequence $(X_i){i=1}^N$ (big assumption) in the space $A = {0,1,2}$ (positive,negative,neutral). Let $(X{(i)}){i=1}^n$ be a subsample of size $n leq N$ (draw $n$ elements uniformly without replacement).
In some sense I want to characterize the distribution of $(X
{(i)}){i=1}^n$ in order to choose a $n$ such that the empirical distribution of $(X{(i)}){i=1}^n$ is close to the empirical distribution of $(X_i){i=1}^N$.

Any help will be appreciated

Get this bounty!!!

Leave a Reply