I’ve read that GPT-2 and other transformers use layer normalization before the self-attention and feedforward blocks, but I am still unsure exactly how the normalization works.
Let’s say that our context size is 1024 tokens, the embedding size is 768 (so that each token and its subsequent hidden states are represented by vectors of size 768), and we are using 12 multi-attention heads. So in the diagram above, there are 1024 r’s and each r has dimensionality 768.
For a given layer in the transformer, how many normalization statistics (sample mean and stdev) are computed? Do we do one normalization per token, for 12×1024 normalizations so that the feature values within each token have mean 0 and std 1? Or do we normalize the values for each feature across tokens, for 12×768 normalizations? Or do we normalize all the feature values for all the tokens together, for 12 normalizations? Do we compute separate normalizations for each context in the minibatch?
I’m also keen to understand intuitively why this normalization is desirable. Assuming that the scheme is to normalize the feature values within each token: let’s say one of our tokens is a bland word like "ok" whereas another token is the word "hatred". I would expect that the representation of "hatred" would be spikier, with higher variance among the different feature values. Why is it useful to throw away this information and force the representation for "ok" to be just as spiky? On the other hand, if the normalization scheme is to normalize across feature values, so that if you take feature 1 from all of the tokens in our context, they will have zero mean and stdev 1, doesn’t this throw away information when all of the words in our context are very negative, for example in the context "war violence hate fear"?
Separately, with layer normalization it seems like it is optional to re-scale the normalized values through learned bias and gain parameters. Does GPT-2 do this, or does it keep the values normalized to mean 0 and std 1?