When using “K-Fold Cross Validtion” for Neural Net, do we:
- Pick and save initial weights of the network randomly (let’s call it $W_0$)
- Split data into $N$ equal chunks
- Train model on $N-1$ chunks, validating against the left-out chunk (the $K$’th chunk)
- Get validation error and revert the weights back to $W_0$
- shift $K$ by 1 and repeat from 3.
- Average-out the validtion errors, to get a much better understanding of how network will generalize using this data.
- Revert back to $W_0$ one last time, and train the network using the ENTIRE dataset
I realize 7 is possible, because we have a very good understanding of how network will generalize with the help of step 6. – Is this assumption correct?
Is reverting back to the initial $W_0$ a necessity, else we would overfit? (revert like we do in step 4. and 7.)
Question 3, most important:
Assume we’ve made it to step 7, and will train the model using ENTIRE data. By now, we don’t intend to validate it after we will finish. In that case how do we know when to stop training the model during step 7?
Sure, we can train with same number of epochs as we did during Cross validation. But then how can we be sure that Cross Validation was trained with an appropriate number of epochs in the first place?
Please notice – during steps 3, 4, 5 we only had $K$’th chunk to evaluate Training vs Validation loss. $K$’th chunk is very small, so during the actual Cross-Validation it was unclear when to Early-Stop… To make things worse, it will be even more difficult in case of Leave-One-Out (also know as All-But-One), where K is simply made from a single training example