I am working on some complicated regression problems that I am fitting with deep neural networks. In other to make these deep networks trainable, there are normalisation steps all over the place in my networks. The output in natural units is of course not normalised.
What I have been doing so far is adding a zero-bias simple multiplicative neuron at the end of each regression output, so that the network can learn an appropriate ‘denormalisation’ itself. However, I just realised this has a quite adverse effect on the speed of convergence of my network; likely due to the strong coupling of this neuron with all other unknowns, thus creating a very poorly conditioned ‘valley’ for gradient decent to contend with.
Initialising the weight of this denormalising neuron with for example the mean of the output vector helps a lot in further training; cutting the required iterations to get to the same loss by an order of magnitude. Yet I am worried I am still adversely influencing training despite the initialisation; even with good initialisation the condition number will likely suffer. Thats what happens when you add depth to a network in general, but do I have to waste my depth on this?
The obvious alternative is to do this kind of normalisation outside of my network. But then the ratio is non-trainable and might still ‘clash’ with the normalisation steps in the network. And I don’t like having another little bit of state to keep track of in my model; its a single weight linear transform; my fancy computational graph framework should be able to handle it, right?
I feel like I am reinventing the wheel here. Searching the internet for best practices for this problem does not yield much. Anything you can recommend?