I was watching a lecture by Andrew Ng on batch normalization. When discussing the inference (prediction) on a test it is said that an exponentially weighted average (EWA) of batch normalization parameters is used. My question is: why use exponentially weighted average instead of a "simple" average without any weights (or, to be precise, with equal weights)?
I intuit that:
- the latest batch is computed on weights being the closest to the final ones therefore we want it to influence the data at test time the most,
- at the same time we do not want to get rid significant part of data used for previous training, so we let them influence predictions but in smaller degree (smaller weights).