Bounty: 50
While reading the following paper on Bregman Divergence (link)
Banerjee, Arindam, et al. “Clustering with Bregman divergences.” Journal of machine learning research 6.Oct (2005): 17051749.
In section 4 (pg 1720) the authors mention
It has been observed in the literature that exponential families and
Bregman divergences have a close relationship that can be exploited
for several learning problems. In particular, Forster and Warmuth
(2000)[Section 5.1] remarked that the loglikelihood of the density of
an exponential family distribution $p(ψ,θ)$ can be written as the sum of
the negative of a uniquely determined Bregman divergence $d_φ(x,µ)$ and a
function that does not depend on the distribution parameters.
They later proved in Theorem 4 (pg 1721) that it is unique and onetoone.
“From Theorem 4 we note that every regular exponential family
corresponds to a unique and distinct Bregman divergence (onetoone
mapping)” (pg 1722)
Table 2 lists some distributions within the exponential family along with their unique Bregman divergence. A select portion of that table is copied below (names of divergences taken from Table 1)
$$
begin{array}{ll}
text{Distribution} & d_φ(x,µ) & text{name} \
hline
text{1D Gaussian} & frac{1}{2 sigma^2} {(xmu)}^2 & text{Squared Loss}\
text{1D Poisson} & x log left( frac{x}{mu} right) – (xmu) \
text{1D Bernoulli} & x log left( frac{x}{mu} right) – (1x) log left( frac{1x}{1mu} right) & text{Logistic Loss}\
text{1D Binomial} & x log left( frac{x}{mu} right) – (Nx) log left( frac{Nx}{Nmu} right) \
text{1D Exponential} & frac{x}{mu} – log left( frac{x}{mu} right) – 1 & text{ItakuraSaito distance} \
text{dD Sph Gaussian} & frac{1}{2 sigma^2} {(xmu)}^2 \
text{dD Multinomial} & sum_{j=1}^d x_j log left( frac{x_j}{mu_j} right) & text{KLdivergence}\
end{array}
$$
Questions:
 If every exponential family distribution has a unique Bregman Divergence, then is that the optimal distance (divergence) metric to use specific to that distribution? (e.g. use Logistic Loss for Bernoulli)
 If yes to #1 above, why is KLdivergence used so often comparing two distributions when it is unique only to multinomials? (comparisons even within the exponential family)
For example, Wikipedia lists the KL divergence between two members of the same distribution, even within the exponential family distribution
 dD Gaussian link
 1D Poisson link
 1D Exponential link
 (ironically KL not included in the multinomial article)
 Is there a theoretical justification to using KLdivergence between those distributions although they may have a different Bregmandivergence, and KL is unique to just multinomials? It seems that if #1 is true, then the optimal divergence for Exponential would be ItakuraSaito distance, etc.

If #1 is is false, then when is it proper to use the Bregman divergence of that distribution compared to KLdivergence (or others) which may appear comparing two members of the same distribution? Does KL have a higher theoretical justification of use across distributions, although it is a special case of Bregmandivergence unique to multinomials?