The general intuition I have seen for KL divergence is that it computes the difference in expected length sampling from distribution $P$ with an optimal code for $P$ versus sampling from distribution $P$ with an optimal code for $Q$.
This makes sense as a general intuition as to why it’s a similarity metric between two distributions, but there are a number of similarity metrics between two distributions. There must be some underlying assumptions based on how it chooses to assign distance versus other metrics.
This seems fundamental to understanding when to use KL divergence. Is there a good intuition for understanding how KL divergence differs from other similarity metrics?