To my knowledge, the phenomenon of double deep descent is still not well understood, but several authors have reported what they call:
- Model-wise double descent ("double descents" observed as models get bigger) This is framed in the abstract as
The bias-variance trade-off implies that a model should balance under-fitting and over-fitting: rich enough to express underlying structure in data, simple enough to avoid fitting spurious patterns. However, in the modern practice, very rich models such as neural networks are trained to exactly fit (i.e., interpolate) the data. Classically, such models would be considered over-fit, and yet they often obtain high accuracy on test data. This apparent contradiction has raised questions about the mathematical foundations of machine learning and their relevance to practitioners.
- Sample non-monotonicity ("double descents" as we add data).
- Epoch-wise double descent ("double descents" observed in longer training times)
There are also studies that suggest that these double descents of empirical risk may be explained (at least for the MSE and cross entropy losses) by the fact that variance specifically is mono-modal.
Has this type of non-monotonic phenomenon been reported or formally studied before for more than two descents?