## #StackBounty: #self-study #covariance-matrix #variational-bayes #approximate-inference #rao-blackwell Rao-Blackwellization in variation…

### Bounty: 50

The Black box VI paper introduces Rao-Blackwellization as a method to reduce the variance of the gradient estimator using score function, in section 3.1.

However I don’t quite get the basic idea behind those formulas, please give me some hint and help!

UPDATE

To make this question more self-contained, I’ll try to put in more details (also some thoughts of my own).

Suppose I have a 2d Gaussian dataset $$X sim N(mu, P^{-1})$$, and the mean is known to be $$mu = (0,0)$$, but the precision matrix $$P$$ is unknown, and I want to estimate $$P$$ using variational inference, that means we need to find a variational distribution $$q(P)$$ to approximate the true (unknown) posterior distribution $$p(P|X)$$, which is a KL div $$kl(q|p)$$, and this KL div objective could be reformulated as a proxy objective, i.e. ELBO, which is
$$L_{ELBO} = E_{q(P)}[log p(X,P) – log q(P)]$$
and in my problem we have

begin{align} p(X|P) sim N(0,P^{-1}); & qquad text{likelihood as Gaussian} \ p(P) sim W(d_0,S_0); & qquad text{prior for P as Wishart} \ q(P) sim W(d,S); & qquad text{variational distribution for P as Wishart} end{align}
, now the problem comes down to optimizing $$L_{ELBO}$$ to find the best variational parameters of $$q(P)$$, i.e. $$d,S$$.

We compute the gradient of loss w.r.t. to $$d$$ and $$S$$, so that we could do a gradient ascent update to optimize $$L$$, now here comes the general gradient formula of $$ELBO$$ w.r.t. variational parameters (see detail of derivation)
$$nabla_{lambda}L = E_{q}[nabla_{lambda}log q(P|lambda)cdot(log p(X,P)-log q(P|lambda))]$$
here $$lambda$$ means the variational parameters for short.

Given this gradient formula, we iteratively draw samples of $$P$$ from $$q(P|lambda)$$, compute $$nabla_lambda L$$ for each sample and average them as a noisy estimate for the real gradient, finally apply gradient ascent over the variational parameters and repeat this process until convergence, that is
$$nabla_{lambda}L approx frac{1}{n_sample} sum_{i=1}^{n_sample} [nabla_{lambda}log q(P_i|lambda)cdot(log p(X,P_i)-log q(P_i|lambda)]$$

and this particular noisy estimate could have high variance, so here finally comes my question, how do we use Rao-Blackwellization to reduce the variance?