I have a time series dataset with 500 million rows, twenty-six columns and 400 thousand unique actors.
It’s too much data for me to process all at once, so I want to take a fair sample of my data.
I’ll spend some time talking about the factors:
- Four of the columns are IDs and the 400K number references the most granular ID
- Two of the columns are dates, one column for the day we recorded the row and another for the actor’s creation date
- Eight of the columns are numeric variables that cluster around 0-5 and have a long tail up to hundreds
- Eight of the columns are factors with five or fewer levels. Most of these factors cluster 80% of unique actors into one or two buckets and then split the remaining 20% across the remaining one to three buckets
- Four of the columns are factors with lots of levels. The most levels for a factor is in the hundreds and the other three have about seventy levels. 80% of the rows can be attributed to the top fifty levels for the most populous factor and top twenty for the other three factors.
My plan is to take a simple random sample of 120 thousand unique actors and filter down my dataset with that sample
I’m concerned that I’ll get a sample that’s not representative of my population. I’m concerned, in part, because my data becomes extremely sparse with the less popular levels and the long tails for the numerics.