*Bounty: 50*

*Bounty: 50*

The Black box VI paper introduces Rao-Blackwellization as a method to reduce the variance of the gradient estimator using score function, in section 3.1.

However I don’t quite get the basic idea behind those formulas, please give me some hint and help!

**UPDATE**

To make this question more self-contained, I’ll try to put in more details (also some thoughts of my own).

Suppose I have a 2d Gaussian dataset $$X sim N(mu, P^{-1})$$, and the mean is known to be $mu = (0,0)$, but the precision matrix $P$ is unknown, and I want to estimate $P$ using variational inference, that means we need to find a variational distribution $q(P)$ to approximate the true (unknown) posterior distribution $p(P|X)$, which is a KL div $kl(q|p)$, and this KL div objective could be reformulated as a proxy objective, i.e. ELBO, which is

$$L_{ELBO} = E_{q(P)}[log p(X,P) – log q(P)]$$

and in my problem we have

$$

begin{align*}
p(X|P) sim N(0,P^{-1}); & qquad text{likelihood as Gaussian} \
p(P) sim W(d_0,S_0); & qquad text{prior for P as Wishart} \
q(P) sim W(d,S); & qquad text{variational distribution for P as Wishart}
end{align*}

$$

, now the problem comes down to

*optimizing $L_{ELBO}$ to find the best variational parameters of $q(P)$, i.e. $d,S$*.

We compute the gradient of loss w.r.t. to $d$ and $S$, so that we could do a gradient ascent update to optimize $L$, now here comes the general gradient formula of $ELBO$ w.r.t. variational parameters (see detail of derivation)

$$nabla_{lambda}L = E_{q}[nabla_{lambda}log q(P|lambda)cdot(log p(X,P)-log q(P|lambda))]$$

here $lambda$ means the variational parameters for short.

Given this gradient formula, we iteratively draw samples of $P$ from $q(P|lambda)$, compute $nabla_lambda L$ for each sample and average them as a noisy estimate for the real gradient, finally apply gradient ascent over the variational parameters and repeat this process until convergence, that is

$$nabla_{lambda}L approx frac{1}{n_sample} sum_{i=1}^{n_sample} [nabla_{lambda}log q(P_i|lambda)cdot(log p(X,P_i)-log q(P_i|lambda)]$$

and this particular noisy estimate could have high variance, so here finally comes my question, how do we use Rao-Blackwellization to reduce the variance?

Please help and correct me if anything wrong!