I have a dataset of the following form:
client_id | date | client_attr_1 | client_attr_2 | client_attr3 | money_spend 1 | 2020-01-01 | 123 | 321 | 188 | 150.24 1 | 2020-01-02 | 123 | 321 | 188 | 18.25 1 | 2020-01-03 | 123 | 321 | 188 | 12.34 2 | 2020-01-02 | 233 | 421 | 181 | 10.10 2 | 2020-01-03 | 233 | 421 | 181 | 20.00 2 | 2020-01-04 | 233 | 421 | 181 | 11.12 2 | 2020-01-01 | 233 | 421 | 181 | 18.36 3 | 2020-02-01 | 723 | 301 | 255 | 1.14 3 | 2020-02-01 | 723 | 301 | 255 | 1.19
My goal is to predict money spend for new clients, day by day.
The goal of the validation procedure is to get a model performance that is not biased by group/time leakage.
I can imagine that an ideal validation scheme that would reflect the actual prediction time situation for that problem would take the following into account:
- Groups – clients, ensure client’s observations are not in train and validation sets at the same time.
- Time – make sure that the model is not training on future clients and predicting on clients from the past to avoid look-ahead bias.
I find it a bit inconvenient as it requires implementing custom validation procedure that could cause some additional problems (e.g. highly different train/test sizes with repeated validation). Therefore, I’d like to drop the second assumption. For that to be a reasonable thing to do, I believe that what I need to check is whether the actual time series (spend given date) are somehow dependent (correlated) on the same dates for different clients (I assume it will not be the case).
Now the questions are:
- Is it the right thing to check?
- Is comparing time series of different clients on the same dates enough?
- Is there a better/proper way to asses such dependency?
- Perhaps I need not to validate that or anything else for the reasons I’m not seeing?