Triplet-based distance learning for face recognition (http://arxiv.org/abs/1503.03832) seems very effective. I’m curious about one particular aspect of the paper. As part of finding an embedding for a face, the authors normalize the hidden units using L2 normalization, which constrains the representation to be on a hypersphere. Why is that helpful or needed?

