*Bounty: 100*

*Bounty: 100*

A friend of mine is working analysing 2000 twits per day and categorize them as postive, negative or neutral.

This is a really boring task but the algorithms that do this classification are not very good because they can’t detect sarcasm.

A simple solution to make more easy this task is to do a subsample of the original $N = 2000$ data points.

Doing some tests we saw that with $30%$ of the data the normalized histograms of the subsample and the original data points look very similar but we need to know a better estimation of the error of doing this subsample.

Theoretically the data points are an i.i.d. sequence $(X_i)*{i=1}^N$ (big assumption) in the space $A = {0,1,2}$ (positive,negative,neutral). Let $(X*{(i)})*{i=1}^n$ be a subsample of size $n leq N$ (draw $n$ elements uniformly without replacement).
In some sense I want to characterize the distribution of $(X*{(i)})

*{i=1}^n$ in order to choose a $n$ such that the empirical distribution of $(X*{(i)})

*{i=1}^n$ is close to the empirical distribution of $(X_i)*{i=1}^N$.

Any help will be appreciated