#StackBounty: #kullback-leibler #exponential-family KL Divergence, Bregman, and uniqueness

Bounty: 50

Banerjee, Arindam, et al. “Clustering with Bregman divergences.” Journal of machine learning research 6.Oct (2005): 1705-1749.

In section 4 (pg 1720) the authors mention

It has been observed in the literature that exponential families and
Bregman divergences have a close relationship that can be exploited
for several learning problems. In particular, Forster and Warmuth
(2000)[Section 5.1] remarked that the log-likelihood of the density of
an exponential family distribution \$p(ψ,θ)\$ can be written as the sum of
the negative of a uniquely determined Bregman divergence \$d_φ(x,µ)\$ and a
function that does not depend on the distribution parameters.

They later proved in Theorem 4 (pg 1721) that it is unique and one-to-one.

“From Theorem 4 we note that every regular exponential family
corresponds to a unique and distinct Bregman divergence (one-to-one
mapping)” (pg 1722)

Table 2 lists some distributions within the exponential family along with their unique Bregman divergence. A select portion of that table is copied below (names of divergences taken from Table 1)

\$\$
begin{array}{l|l}
text{Distribution} & d_φ(x,µ) & text{name} \
hline
text{1-D Gaussian} & frac{1}{2 sigma^2} {(x-mu)}^2 & text{Squared Loss}\
text{1-D Poisson} & x log left( frac{x}{mu} right) – (x-mu) \
text{1-D Bernoulli} & x log left( frac{x}{mu} right) – (1-x) log left( frac{1-x}{1-mu} right) & text{Logistic Loss}\
text{1-D Binomial} & x log left( frac{x}{mu} right) – (N-x) log left( frac{N-x}{N-mu} right) \
text{1-D Exponential} & frac{x}{mu} – log left( frac{x}{mu} right) – 1 & text{Itakura-Saito distance} \
text{d-D Sph Gaussian} & frac{1}{2 sigma^2} {(x-mu)}^2 \
text{d-D Multinomial} & sum_{j=1}^d x_j log left( frac{x_j}{mu_j} right) & text{KL-divergence}\
end{array}
\$\$

Questions:

1. If every exponential family distribution has a unique Bregman Divergence, then is that the optimal distance (divergence) metric to use specific to that distribution? (e.g. use Logistic Loss for Bernoulli)

2. If yes to #1 above, why is KL-divergence used so often comparing two distributions when it is unique only to multinomials? (comparisons even within the exponential family)

For example, Wikipedia lists the KL divergence between two members of the same distribution, even within the exponential family distribution