#StackBounty: #neural-networks #dataset #sample Is it better to split sequences into overlapping or non-overlapping training samples?

Bounty: 50

I have $N$ (time) sequences of data with length $2048$. Each of these sequences correseponds to a different target output. However, I know that only a small part of the sequence is needed to actually predict this target output, say a sub-sequence of length $128$.

I could split up each of the sequences into $16$ partitions of $128$, so that I end up with $16N$ training smaples. However, I could drastically increase the number of training samples if I use a sliding window instead: there are $2048-128 = 1920$ unique sub-sequences of length $128$ that preserve the time series. That means I could in fact generate $1920N$ unique training samples, even though most of the input is overlapping.

I could also use a larger increment between individual "windows", which would reduce the number of sub-sequences but it could remove any autocorrelation between them.

Is it better to split my data into $16N$ non-overlapping sub-sequences or $1920N$ partially overlapping sub-sequences?

Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.