#StackBounty: #r #hypothesis-testing #distributions #t-test #sampling How do I know whether my sample is fair?

Bounty: 100

I have a time series dataset with 500 million rows, twenty-six columns and 400 thousand unique actors.

It’s too much data for me to process all at once, so I want to take a fair sample of my data.

I’ll spend some time talking about the factors:

  1. Four of the columns are IDs and the 400K number references the most granular ID
  2. Two of the columns are dates, one column for the day we recorded the row and another for the actor’s creation date
  3. Eight of the columns are numeric variables that cluster around 0-5 and have a long tail up to hundreds
  4. Eight of the columns are factors with five or fewer levels. Most of these factors cluster 80% of unique actors into one or two buckets and then split the remaining 20% across the remaining one to three buckets
  5. Four of the columns are factors with lots of levels. The most levels for a factor is in the hundreds and the other three have about seventy levels. 80% of the rows can be attributed to the top fifty levels for the most populous factor and top twenty for the other three factors.

My plan is to take a simple random sample of 120 thousand unique actors and filter down my dataset with that sample

I’m concerned that I’ll get a sample that’s not representative of my population. I’m concerned, in part, because my data becomes extremely sparse with the less popular levels and the long tails for the numerics.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.