I am familiar with “regular” cross-validation, but now I want to make timeseries predictions while using cross-validation with a simple linear regression function.
I write down a simple example, to help clarify my two questions: one about the train/test split, one question about how to train/test for models when the aim is to predict for different n, with n the steps of prediction, in advance.
(1) The data
Assume I have data for timepoints 1,…,10 as follows:
timeseries = [0.5,0.3,10,4,5,6,1,0.4,0.1,0.9]
(2) Transforming the data into a format useful for supervised learning
As far as I understand, we can use “lags”, i.e. shifts in the data to create a dataset suited for supervised learning:
input = [NaN,0.5,0.3,10,4,5,6,1,0.4,0.1]
output/response = [0.5,0.3,10,4,5,6,1,0.4,0.1,0.9]
Here I have simply shifted the timeseries by one for creating the output vector.
As far as I understand, I could now use input as the input for a linear regression model, and output for the response (the NaN could be approximated our replaced with a random value).
(3) Question 1: Cross-validation (“backtesting”)
Say I want to do now 2-splits, do I have to shift the train as well as the test sets?
I.e. something like:
Independent variable: [NaN,0.5,0.3,10,4,5]
Independent variable: [1,0.4,0.1]
(ii) Question 2: Predicting different lags in advance:
As obvious, I have shifted dependent to independent variables by 1. Assuming now I would like to train a model which can predict 5 time steps in advance — can I keep this lag of one, and nevertheless use the model to predict n+1,…,n+5,… or do I change the shift from independent to dependent variable to 5? What exactly is the difference?
Get this bounty!!!