#StackBounty: #expectation-maximization #marginal #fisher-information Does marginalization of some of the latent variables improve conv…

Bounty: 50

Given a likelihood to maximize
$$
log p(x | theta)
$$

Imagine that, in order to apply EM, we can augment the model with one or two latent variables. In that case, we can derive two lower bounds:

$$
log p(x | theta) = logint_{z_1}p(x , z_1 | theta)
geq
int_{z_1}logleftlbrace
frac{p(x, z_1 | theta)}{p(z_1| x, theta)}
rightrbrace p(z_1 | x, theta) = mathcal{L}_1
$$

or

$$
log p(x | theta) = log int_{z_1,z_2} p(x, z_1, z_2 | theta)
geq
int_{z_1,z_2}logleftlbrace
frac{p(x, z_1, z_2 | theta)}{p(z_1, z_2 | x, theta)}
rightrbrace p(z_1, z_2 | x, theta) = mathcal{L}_2
$$
or

Is there any reason why the lower bound of the first approach should be better in terms of speed of convergence or any other property?


I think I got a demonstration that $mathcal{L}_1 geq mathcal{L_2}$. If this is true, I would only need a demonstration that this makes $mathcal{L}_1$ faster to converge:

The first lower bound is
begin{align}
&
mathcal{L}1 =mathbb{E}{z_1| x, theta}[log p(x, z_1 | theta)]
– mathbb{E}_{z_1| x, theta}[log p(z_1| x, theta)]
end{align}

The second lower bound is
begin{align}
mathcal{L}2 =mathbb{E}{z_1, z_2 | x, theta}[log p(x, z_1, z_2 | theta)]
– mathbb{E}_{z_1, z_2 | x, theta}[log p(z_1, z_2 | x, theta)]
end{align}

Now we will show that $mathcal{L}_1(theta) geq mathcal{L}_2(theta)$:

begin{align}
&mathcal{L}1
=mathbb{E}
{z_1 | x, theta}[log mathbb{E}{z_2, | z_1, x, theta}frac{p(x, z_1, z_2 | theta)}{p(z_2 | z_1, x, theta)}]
– mathbb{E}
{z_1 | x, theta}[log p(z_1| x, theta)]\
geq&
mathbb{E}{z_1 | x, theta}[mathbb{E}{z_2 | z_1, x, theta}[log p(x, z_1, z_2 | theta)]]
– mathbb{E}{z_1 | x, theta}[mathbb{E}{z_2 | z_1, x, theta}[log p(z_2 | z_1, x, theta)]
– mathbb{E}{z_1 | x, theta}[log p(z_1| x, theta)]\
&=
mathbb{E}
{z_1, z_2}[log p(x, z_1, z_2 | theta)]
– mathbb{E}{z_1, z_2}[log p(z_2 | z_1, x, theta)]
– mathbb{E}
{z_1 | x, theta}[log p(z_1| x, theta)]\
&=
mathbb{E}{z_1, z_2}[log p(x, z_1, z_2 | theta)]
– mathbb{E}
{z_1, z_2}[log p(z_1, z_2 |x, theta)]\
&=
mathcal{L}_2
end{align}

I don’t see why this should make $mathcal{L}_1$ faster to converge than $mathcal{L}_2$, but maybe it has been proven for some cases such as the exponential family?


Get this bounty!!!

Leave a Reply