*Bounty: 100*

I am currently reading the paper entitled “*Variational Auto-Encoded Deep Gaussian Processes*” by Dai et al, a copy of which may be found here.

The paper proposes a stacking of Gaussian Process Latent Variable Models in an ANN like fashion and introducing a variational distribution paramaterised by a Multilayer Perceptron to aid tractability.

**Before asking my question, I shall introduce some preliminaries for your convenience(paraphrased from the paper). If this is too long, I am happy to shorten it and rely more on the paper as a reference.**

The marginal likelihood of the DGP model is given in Equation 4 in the paper as

$p(displaystyle mathbf{Y}) = int p(mathbf{Y} mid mathbf{X}*{1}) prod*{l=2}^{L} p(mathbf{X}*{l-1} mid mathbf{X}*{l})p(mathbf{X}*{L})dmathbf{X}*{1} ldots dmathbf{X}_{L}$

where $mathbf{Y} in mathbb{R}^{N times D}$ is a matrix of observed data, $L$ is the number of layers of latent variables, for layer $l$, $mathbf{X}*{l} in mathbb{R}^{N times Q*{l}}$ is a latent space representation of feature dimensionality $Q_{l} < Q_{l-1}$.

For layer 1, the input is $mathbf{Y}$ so for $l=2$, $mathbf{X}_{l-1} = mathbf{Y}$

A variational lower bound of the above log marginal distribution is given as per Equation 5 in the paper

$displaystyle mathcal{L} =

sum_{l=1}^{L} mathop{mathbb{E}}big[log p(mathbf{X}*{l-1}midmathbf{X}*{l})big]*{q(mathbf{X}*{l-1})q(mathbf{X}*{l})} +*

sum{l=1}^{L-1} H(q(mathbf{X}*{l})) – KL(q(mathbf{X}*{L}) midmid p(mathbf{X}_{L}))$

Where $H(.)$ is the Shannon Entropy and $KL(. midmid .)$ is the Kullback Leibler divergence. To make inference tractable, the variational distribution $q(.)$ is introduced above, as is defined as follows

$displaystyle q(mathbf{X}*{l}) =*

prod{n=1}^{N} mathcal{N}(mathbf{x}^{(n)}*{l} mid boldsymbolmu^{(n)}*{l}, boldsymbolSigma^{(n)}_{l})$

where $boldsymbolmu_{l}(.)$ is the output of a Multilayer Perceptron taking $mathbf{X}*{l}$ as input and $boldsymbolSigma*{l}$ is the posterior variance, assumed to be diagonal and the same over all datapoints.

However, at this point the variational lower bound is still intractable, due to the expectation in the first term of $mathcal{L}$. As such, auxiliary variables $mathbf{U}*{l} in mathbb{R}^{M times Q*{l}}$ are introduced, as are noise free observations $mathbf{F}*{l} in mathbb{R}^{N times Q*{l-1}}$ and the first term of $mathcal{L}$ reformulated as follows(though at this point, it it not clear to me how $mathbf{F}*{l}$ differs from $mathbf{X}*{l-1}$), see Equation 10 in the paper

$displaystylemathop{mathbb{E}}big[log p(mathbf{X}*{l-1}midmathbf{X}*{l})big]*{q(mathbf{X}*{l-1})q(mathbf{X}*{l})} geq*

mathop{mathbb{E}}big[ log p(mathbf{X}{l-1} mid mathbf{F}*{l}) – KL(q(mathbf{U}*{l} mid mathbf{X}*{l-1}) midmid p(mathbf{U}*{l})) big]*{p(mathbf{F}*{l} mid mathbf{U}*{l}, mathbf{X}*{l})q(mathbf{U}*{l} mid mathbf{X}*{l-1})q(mathbf{X}*{l-1})q(mathbf{X}*{l})}$

Finally, the authors give a distributed(in the parallel computation sense) form of the variational lower bound $mathcal{L}$ with the first term taking the following form

$displaystyle Trbig(mathop{mathbb{E}}big[ mathbf{X}^{T}*{l-1}mathbf{X}*{l-1} big]*{q(mathbf{X}*{t-1})}big) =

sum_{n=1}^{N} (boldsymbolmu^{(n)}*{l-1})^{T}boldsymbolmu*{l-1}^{(n)} +

Trbig( boldsymbolSigma_{l-1}^{(n)} big)$

and the second term taking the following form

$displaystyle Trbig( boldsymbolLambda_{l}^{-1}boldsymbolPsi_{l}^{T} mathop{mathbb{E}}big[mathbf{X}*{l-1}mathbf{X}*{l-1}big]*{q(mathbf{X}*{l-1})} boldsymbolPsi_{l} big) =

Trbig(boldsymbolLambda_{l}^{-1}(boldsymbolPsi_{l}^{T}mathbf{R}*{l-1}^{T})(boldsymbolPsi*{l}mathbf{R}*{l-1})big)*

+ Trbigg(boldsymbolLambda{l}^{-1} bigg(sum_{n=1}^{N}boldsymbolPsi_{l}^{(n)}alpha_{l-1}^{(n)}bigg)bigg(sum_{n=1}^{N}boldsymbolPsi_{l}^{(n)}alpha_{l-1}^{(n)}bigg)^{T}bigg)$

where $boldsymbolLambda_{l} = mathbf{K}*{mathbf{U}*{l}mathbf{U}*{l}} + mathop{mathbb{E}}big[mathbf{K}^{T}*{mathbf{F}*{l}mathbf{U}*{l}}mathbf{K}*{mathbf{F}*{l}mathbf{U}*{l}}big]*{q(mathbf{X}*{l})}$ for covariance matrix $mathbf{K}*{<…>}$ generated by a covariance kernel such as the exponentiated quadratic. Similarly, $boldsymbolPsi_{l} = mathop{mathbb{E}}big[mathbf{K}*{mathbf{F}*{l}mathbf{U}*{l}}big]*{q(mathbf{X}_{l})}$.

Additionally, $mathbf{R}*{l-1} = big[(boldsymbolmu^{(1)}*{l-1})^{T} dots (boldsymbolmu^{(N)}*{l-1})^{T} big]$ and $alpha*{l-1}^{(n)} = sqrt{Trbig(boldsymbolSigma_{l-1}^{(n)}big)}$ and $mathbf{A}*{l=1} = diagbig(alpha*{l-1}^{(1)} dots alpha_{l-1}^{N}big)$

Finally, the second term of the above distributed form is obtained by making the following observation

$mathop{mathbb{E}}big[ mathbf{X}*{l-1}mathbf{X}*{l-1}^{T} big] = mathbf{R}*{l-1}^{T}mathbf{R}*{l-1} + mathbf{A}*{l-1}mathbf{A}*{l-1}$

**Points of Confusion – Question(s)**

Given the above formulation, there are a few computational issues that I am at present not entirely clear on.

Firstly, how in general does one take the expectation of a covariance matrix generated by a kernel such as the exponentiated quadratic?

For example, in the above formulation, the computation of the quantity $boldsymbolPsi_{l}$ is not clear to me. By the definitions given in the paper, to evaluate this quantity one takes the expectation w.r.t. the variational distribution $q(mathbf{X}_{l})$, which itself is defined to be the product of Normal PDF evaluations of the GP latent variables given their MLP encodings – a scalar.

It is not clear to me how to use this to take the expectation of an arbitrary covariance/cross-covariance matrix $mathbf{K} in mathbb{R}^{N times M}$. As such, the evaluation of the quantity defined by the expectation in the second term of $boldsymbolLambda_{l}$ is also unclear to me as we are again attemnpting to take the expectation of a cross-covariance matrix w.r.t. the variational distribution $q(mathbf{X}_{l})$.

My central question is, given a formulation like the one set out above, how does one handle these expectation terms?

In addition, if anybody is familiar with this work and/or area of research, it would also be of great assistance to clarify the form of $mathbf{F}*{l}$ and $mathbf{U}*{l}$ for an arbitrary layer. To my understanding, for $l=1$, the data layer, $mathbf{F}$ is a subset of the data points(removed from the dataset) and $mathbf{U}$ the corresponding latent variables. However, what about arbitrary $l$?

Any assistance would be greatly appreciated.

Get this bounty!!!