*Bounty: 50*

*Bounty: 50*

Given a Gaussian process with kernel function $K_{theta}$ depending on some hyperparameters $theta$ and set of observations ${(x_i,y_i)}*{i=1}^n$**, I want to choose $theta$ to maximize the likelihood of the observations. This amounts to solving the optimization
$$max*{theta} left[-y^tC(theta)^{-1}y-logdet(C(theta))right] :=max_{theta} L(theta,x,y)$$

where $C(theta)$ is the covariance matrix $C(theta)

*{ij}=K*{theta}(x_i,x_j)$.

Now, the gradient $d_{theta}L$ can be derived straightforwardly using matrix calculus, so in principle one could just use gradient descent. The issue is that calculating $d_{theta}L$ requires calculating $C(theta)^{-1}$, so even for modest numbers of points this can be prohibitive.

What happens if we try to use Stochastic Gradient Descent? That is, instead of using the full gradient at each step, we instead use the gradient of estimator

$$hat{L}*B(theta,x,y):=max*{theta} left[ -y_B^tC(theta)_B^{-1}y_B-logdet(C(theta)_B) right]$$

where $Bsubset{1,dots,n}$, $|B|=k$ is sampled at random at each step, and $y_B$ denotes the vector formed by keeping only those indices that are contained in $B$ (similarly $C(theta)_B$ denotes the matrix obtained only by keeping rows and columns whose indices are in $B$). Note that this is equal (up to linear rescaling) to the log likelihood of the Gaussian process restricted to the points $x_B$.

**My question is**, is there any theoretical work exploring the properties of this scheme, and are there any major disadvantages to it? Most of the theoretical work on SGD that I have seen makes the assumption that the observations are independent conditional on $theta$, allowing for a decomposition of the loss of the form $sum_i l(theta,x_i,y_i)$, however that is not the case here.

I can think of an advantage and a disadvantage. The advantage, we only have to invert a $ktimes k$ matrix at each step, so the computation can be much faster. The disadvantage is that this does not give an unbiased estimator of the gradient, since in general $E_B hat{L}_B(theta,x,y)neq L(theta,x,y)$. However, empirically this seems to work reasonably well on some datasets I have tried.