- I’m training a Neural Network for a classification task.
- I have a dataset consisting of 1M samples (100 features for each sample), that was collected over a period of 5 days. The data for each feature always comes from the same sensor. One example of a feature is a temperature sensor.
- My training and validation sets are sampled (without shuffling) with 8-fold cross-validation from the data of the first 4 days. Or to phrase it differently: I always use data from 3.5 days for training, and the consecutive data samples from half a day for validation purposes. So to be clear, this half day’s worth of data could be from either of the four first days.
- My test set is the data from the 5th day.
- My model takes one data sample at a time as an input and outputs a prediction for the correct class. No historical measurements are included. No predictions of future states are done. A data sample only ever contains the current sensor readings.
From what I understand, applying cross-validation to time series data is usually done slightly different than what I presented above (see for example this post), e.g. with a type of forward-chaining / rolling method, to avoid a look-ahead bias and because we can’t fully expect completely i.i.d data samples. My data could be said to be a sort of time series, even though I do not necessarily model and treat it that way. For example, I only ever feed 1 data sample at a time to my network, without including any historical data measurements. Because of this, my gut tells me that it should be fine to use a normal k-fold cross-validation in this particular case, and that I would only need to change the approach if it was properly modelled as a time series task (for example by feeding several historical data samples at a time to my model to estimate the current state). Is my gut right or wrong about this? If it is not, why?