*Bounty: 50*

*Bounty: 50*

In a Variational Autoencoder (VAE), given some data $x$ and latent variables $t$ with prior distribution $p(t) = mathcal{N}(t | 0, I)$, the encoder aims to learn a distribution $q_{phi}(t)$ that approximates the true posterior $p(t|x)$ and the decoder aims to learn a distribution $p_{theta}(x|t)$ that approximates the true underlying distribution $p^*(x|t)$.

These models are then trained jointly to maximize an objective $L(phi, theta)$, which is a lower bound for the log-likelihood of the training set:

$L(phi, theta) = sum_i mathbb{E}*{q*{phi}} log frac{p_{theta}(x_i|t)p(t)}{q_{phi}(t)} leq sum_i log int p_{theta}(x_i|t)p(t) dt$

According to section C.2 in the original paper from Kingma and Welling (https://arxiv.org/pdf/1312.6114.pdf), when we model $p_{theta}(x|t)$ as a family of gaussians, the decoder should output both the mean $mu(t)$ and the (diagonal) covariance $sigma^2(t) I$ for the gaussian distribution.

**My question is: isn’t this optimization problem ill-posed** (just like maximum likelihood training in GMMs)? Having an output for the variance (or log-variance, as is most common), if the decoder can produce a perfect reconstruction for a single image in the training set (i.e. $mu(t_i)=x_i$) then it can set the corresponding variance $sigma^2(t_i)$ to something arbitrarily close to zero and therefore the likelihood goes to infinity regardless of what happens with the remaining training examples.

I know that most gaussian VAE implementations have a simplified decoder that outputs the mean only, replacing the term $mathbb{E}*{q*{phi}} log p_{theta}(x_i|t)$

by the squared error between the original image and the reconstruction (which is equivalent to setting the covariance to be always the identity matrix). Is this because of the ill-posedness of the original formulation?