I would like to train my LSTM with a “synthetic gradients” Decoupled Neural Interface (DNI).
How to decide on the number of layers and neurons for my DNI?
Searching them by trial end error or what’s worse – by Genetic
algorithm would seem to outweigh the purpose of Synthetic Gradients.
And, if my DNI is an LSTM itself – it seems it would take even longer
to determine its optimal structure
SG speed up the training speed, by allowing multiple forward passes (with immediate weight adjustments), since DNI will already predict the future gradient.
However, we will loose time “experiencing” a few hundreds of training sessions only to find a optimal structure of DNI with which it will predict gradient the best way.
By that time we could have already finished our training with an oldschool Backprop through Time.
Also, how should we avoid our DNI overfitting, how to monitor and ensure it’s not happening?