# #StackBounty: #classification #subsampling Choosing subsample size (helping a friend analysing a smaller data set)

### Bounty: 100

A friend of mine is working analysing 2000 twits per day and categorize them as postive, negative or neutral.
This is a really boring task but the algorithms that do this classification are not very good because they can’t detect sarcasm.
A simple solution to make more easy this task is to do a subsample of the original \$N = 2000\$ data points.

Doing some tests we saw that with \$30%\$ of the data the normalized histograms of the subsample and the original data points look very similar but we need to know a better estimation of the error of doing this subsample.

Theoretically the data points are an i.i.d. sequence \$(X_i){i=1}^N\$ (big assumption) in the space \$A = {0,1,2}\$ (positive,negative,neutral). Let \$(X{(i)}){i=1}^n\$ be a subsample of size \$n leq N\$ (draw \$n\$ elements uniformly without replacement).
In some sense I want to characterize the distribution of \$(X
{(i)}){i=1}^n\$ in order to choose a \$n\$ such that the empirical distribution of \$(X{(i)}){i=1}^n\$ is close to the empirical distribution of \$(X_i){i=1}^N\$.

Any help will be appreciated

Get this bounty!!!

This site uses Akismet to reduce spam. Learn how your comment data is processed.