#StackBounty: #kullback-leibler #exponential-family KL Divergence, Bregman, and uniqueness

Bounty: 50

While reading the following paper on Bregman Divergence (link)

Banerjee, Arindam, et al. “Clustering with Bregman divergences.” Journal of machine learning research 6.Oct (2005): 1705-1749.

In section 4 (pg 1720) the authors mention

It has been observed in the literature that exponential families and
Bregman divergences have a close relationship that can be exploited
for several learning problems. In particular, Forster and Warmuth
(2000)[Section 5.1] remarked that the log-likelihood of the density of
an exponential family distribution $p(ψ,θ)$ can be written as the sum of
the negative of a uniquely determined Bregman divergence $d_φ(x,µ)$ and a
function that does not depend on the distribution parameters.

They later proved in Theorem 4 (pg 1721) that it is unique and one-to-one.

“From Theorem 4 we note that every regular exponential family
corresponds to a unique and distinct Bregman divergence (one-to-one
mapping)” (pg 1722)

Table 2 lists some distributions within the exponential family along with their unique Bregman divergence. A select portion of that table is copied below (names of divergences taken from Table 1)

text{Distribution} & d_φ(x,µ) & text{name} \
text{1-D Gaussian} & frac{1}{2 sigma^2} {(x-mu)}^2 & text{Squared Loss}\
text{1-D Poisson} & x log left( frac{x}{mu} right) – (x-mu) \
text{1-D Bernoulli} & x log left( frac{x}{mu} right) – (1-x) log left( frac{1-x}{1-mu} right) & text{Logistic Loss}\
text{1-D Binomial} & x log left( frac{x}{mu} right) – (N-x) log left( frac{N-x}{N-mu} right) \
text{1-D Exponential} & frac{x}{mu} – log left( frac{x}{mu} right) – 1 & text{Itakura-Saito distance} \
text{d-D Sph Gaussian} & frac{1}{2 sigma^2} {(x-mu)}^2 \
text{d-D Multinomial} & sum_{j=1}^d x_j log left( frac{x_j}{mu_j} right) & text{KL-divergence}\


  1. If every exponential family distribution has a unique Bregman Divergence, then is that the optimal distance (divergence) metric to use specific to that distribution? (e.g. use Logistic Loss for Bernoulli)

  2. If yes to #1 above, why is KL-divergence used so often comparing two distributions when it is unique only to multinomials? (comparisons even within the exponential family)

For example, Wikipedia lists the KL divergence between two members of the same distribution, even within the exponential family distribution

  • d-D Gaussian link
  • 1-D Poisson link
  • 1-D Exponential link
  • (ironically KL not included in the multinomial article)
    1. Is there a theoretical justification to using KL-divergence between those distributions although they may have a different Bregman-divergence, and KL is unique to just multinomials? It seems that if #1 is true, then the optimal divergence for Exponential would be Itakura-Saito distance, etc.
    2. If #1 is is false, then when is it proper to use the Bregman divergence of that distribution compared to KL-divergence (or others) which may appear comparing two members of the same distribution? Does KL have a higher theoretical justification of use across distributions, although it is a special case of Bregman-divergence unique to multinomials?

Get this bounty!!!

Leave a Reply