## #StackBounty: #maximum-likelihood #gradient-descent #gaussian-process #sgd SGD for Gaussian Process estimation

### Bounty: 50

Given a Gaussian process with kernel function $$K_{theta}$$ depending on some hyperparameters $$theta$$ and set of observations $${(x_i,y_i)}{i=1}^n$$, I want to choose $$theta$$ to maximize the likelihood of the observations. This amounts to solving the optimization
$$max$$
{theta} left[-y^tC(theta)^{-1}y-logdet(C(theta))right] :=max_{theta} L(theta,x,y)\$\$
where $$C(theta)$$ is the covariance matrix $$C(theta){ij}=K{theta}(x_i,x_j)$$.

Now, the gradient $$d_{theta}L$$ can be derived straightforwardly using matrix calculus, so in principle one could just use gradient descent. The issue is that calculating $$d_{theta}L$$ requires calculating $$C(theta)^{-1}$$, so even for modest numbers of points this can be prohibitive.

What happens if we try to use Stochastic Gradient Descent? That is, instead of using the full gradient at each step, we instead use the gradient of estimator
$$hat{L}B(theta,x,y):=max{theta} left[ -y_B^tC(theta)_B^{-1}y_B-logdet(C(theta)_B) right]$$
where $$Bsubset{1,dots,n}$$, $$|B|=k$$ is sampled at random at each step, and $$y_B$$ denotes the vector formed by keeping only those indices that are contained in $$B$$ (similarly $$C(theta)_B$$ denotes the matrix obtained only by keeping rows and columns whose indices are in $$B$$). Note that this is equal (up to linear rescaling) to the log likelihood of the Gaussian process restricted to the points $$x_B$$.

My question is, is there any theoretical work exploring the properties of this scheme, and are there any major disadvantages to it? Most of the theoretical work on SGD that I have seen makes the assumption that the observations are independent conditional on $$theta$$, allowing for a decomposition of the loss of the form $$sum_i l(theta,x_i,y_i)$$, however that is not the case here.

I can think of an advantage and a disadvantage. The advantage, we only have to invert a $$ktimes k$$ matrix at each step, so the computation can be much faster. The disadvantage is that this does not give an unbiased estimator of the gradient, since in general $$E_B hat{L}_B(theta,x,y)neq L(theta,x,y)$$. However, empirically this seems to work reasonably well on some datasets I have tried.

Get this bounty!!!

## #StackBounty: #neural-networks #ensemble #sgd Neural network doesn't converge but has good performance

### Bounty: 100

I have a sequence (> 100 million) of symbols and several models predict the next symbol. To combine these predictions I’m using stacked generalization with a multilayer perceptron trained with online gradient descent.

For the inputs the network uses the predictions of the models so the total inputs are nmodels * nsymbols. As outputs there is one node per symbol. The network is trained on each new symbol.

The network doesn’t stabilize, that is, if it stops being trained the predictions become worse and the weights are constantly changing. But the predictions are better than any single model and better than any simple combination such as weighted majority, exponential moving average etc.

I’ve also tried using as inputs the last N predictions(N * nmodels * nsymbols), the last N predictions + the last N symbols of the sequence, the last N predictions + the last N symbols + the last N probabilities the network assigned for the actual symbol. In all cases the predictions improve but still no stabilization.

The sequence has some times long sub-sequences where the predictions of the models are very accurate but most of the time this doesn’t happen.

My main concern is with understanding why this happens. Any ideas?

EDIT: The sequence appears to be non-stationary.

Get this bounty!!!

## #StackBounty: #neural-networks #ensemble #sgd Neural network doesn't converge but has good performance

### Bounty: 100

I have a sequence (> 100 million) of symbols and several models predict the next symbol. To combine these predictions I’m using stacked generalization with a multilayer perceptron trained with online gradient descent.

For the inputs the network uses the predictions of the models so the total inputs are nmodels * nsymbols. As outputs there is one node per symbol. The network is trained on each new symbol.

The network doesn’t stabilize, that is, if it stops being trained the predictions become worse and the weights are constantly changing. But the predictions are better than any single model and better than any simple combination such as weighted majority, exponential moving average etc.

I’ve also tried using as inputs the last N predictions(N * nmodels * nsymbols), the last N predictions + the last N symbols of the sequence, the last N predictions + the last N symbols + the last N probabilities the network assigned for the actual symbol. In all cases the predictions improve but still no stabilization.

The sequence has some times long sub-sequences where the predictions of the models are very accurate but most of the time this doesn’t happen.

My main concern is with understanding why this happens. Any ideas?

EDIT: The sequence appears to be non-stationary.

Get this bounty!!!

## #StackBounty: #neural-networks #ensemble #sgd Neural network doesn't converge but has good performance

### Bounty: 100

I have a sequence (> 100 million) of symbols and several models predict the next symbol. To combine these predictions I’m using stacked generalization with a multilayer perceptron trained with online gradient descent.

For the inputs the network uses the predictions of the models so the total inputs are nmodels * nsymbols. As outputs there is one node per symbol. The network is trained on each new symbol.

The network doesn’t stabilize, that is, if it stops being trained the predictions become worse and the weights are constantly changing. But the predictions are better than any single model and better than any simple combination such as weighted majority, exponential moving average etc.

I’ve also tried using as inputs the last N predictions(N * nmodels * nsymbols), the last N predictions + the last N symbols of the sequence, the last N predictions + the last N symbols + the last N probabilities the network assigned for the actual symbol. In all cases the predictions improve but still no stabilization.

The sequence has some times long sub-sequences where the predictions of the models are very accurate but most of the time this doesn’t happen.

My main concern is with understanding why this happens. Any ideas?

EDIT: The sequence appears to be non-stationary.

Get this bounty!!!

## #StackBounty: #neural-networks #ensemble #sgd Neural network doesn't converge but has good performance

### Bounty: 100

I have a sequence (> 100 million) of symbols and several models predict the next symbol. To combine these predictions I’m using stacked generalization with a multilayer perceptron trained with online gradient descent.

For the inputs the network uses the predictions of the models so the total inputs are nmodels * nsymbols. As outputs there is one node per symbol. The network is trained on each new symbol.

The network doesn’t stabilize, that is, if it stops being trained the predictions become worse and the weights are constantly changing. But the predictions are better than any single model and better than any simple combination such as weighted majority, exponential moving average etc.

I’ve also tried using as inputs the last N predictions(N * nmodels * nsymbols), the last N predictions + the last N symbols of the sequence, the last N predictions + the last N symbols + the last N probabilities the network assigned for the actual symbol. In all cases the predictions improve but still no stabilization.

The sequence has some times long sub-sequences where the predictions of the models are very accurate but most of the time this doesn’t happen.

My main concern is with understanding why this happens. Any ideas?

EDIT: The sequence appears to be non-stationary.

Get this bounty!!!

## #StackBounty: #neural-networks #ensemble #sgd Neural network doesn't converge but has good performance

### Bounty: 100

I have a sequence (> 100 million) of symbols and several models predict the next symbol. To combine these predictions I’m using stacked generalization with a multilayer perceptron trained with online gradient descent.

For the inputs the network uses the predictions of the models so the total inputs are nmodels * nsymbols. As outputs there is one node per symbol. The network is trained on each new symbol.

The network doesn’t stabilize, that is, if it stops being trained the predictions become worse and the weights are constantly changing. But the predictions are better than any single model and better than any simple combination such as weighted majority, exponential moving average etc.

I’ve also tried using as inputs the last N predictions(N * nmodels * nsymbols), the last N predictions + the last N symbols of the sequence, the last N predictions + the last N symbols + the last N probabilities the network assigned for the actual symbol. In all cases the predictions improve but still no stabilization.

The sequence has some times long sub-sequences where the predictions of the models are very accurate but most of the time this doesn’t happen.

My main concern is with understanding why this happens. Any ideas?

EDIT: The sequence appears to be non-stationary.

Get this bounty!!!

## #StackBounty: #neural-networks #ensemble #sgd Neural network doesn't converge but has good performance

### Bounty: 100

I have a sequence (> 100 million) of symbols and several models predict the next symbol. To combine these predictions I’m using stacked generalization with a multilayer perceptron trained with online gradient descent.

For the inputs the network uses the predictions of the models so the total inputs are nmodels * nsymbols. As outputs there is one node per symbol. The network is trained on each new symbol.

The network doesn’t stabilize, that is, if it stops being trained the predictions become worse and the weights are constantly changing. But the predictions are better than any single model and better than any simple combination such as weighted majority, exponential moving average etc.

I’ve also tried using as inputs the last N predictions(N * nmodels * nsymbols), the last N predictions + the last N symbols of the sequence, the last N predictions + the last N symbols + the last N probabilities the network assigned for the actual symbol. In all cases the predictions improve but still no stabilization.

The sequence has some times long sub-sequences where the predictions of the models are very accurate but most of the time this doesn’t happen.

My main concern is with understanding why this happens. Any ideas?

EDIT: The sequence appears to be non-stationary.

Get this bounty!!!

## #StackBounty: #neural-networks #ensemble #sgd Neural network doesn't converge but has good performance

### Bounty: 100

I have a sequence (> 100 million) of symbols and several models predict the next symbol. To combine these predictions I’m using stacked generalization with a multilayer perceptron trained with online gradient descent.

For the inputs the network uses the predictions of the models so the total inputs are nmodels * nsymbols. As outputs there is one node per symbol. The network is trained on each new symbol.

The network doesn’t stabilize, that is, if it stops being trained the predictions become worse and the weights are constantly changing. But the predictions are better than any single model and better than any simple combination such as weighted majority, exponential moving average etc.

I’ve also tried using as inputs the last N predictions(N * nmodels * nsymbols), the last N predictions + the last N symbols of the sequence, the last N predictions + the last N symbols + the last N probabilities the network assigned for the actual symbol. In all cases the predictions improve but still no stabilization.

The sequence has some times long sub-sequences where the predictions of the models are very accurate but most of the time this doesn’t happen.

My main concern is with understanding why this happens. Any ideas?

EDIT: The sequence appears to be non-stationary.

Get this bounty!!!

## #StackBounty: #neural-networks #ensemble #sgd Neural network doesn't converge but has good performance

### Bounty: 100

I have a sequence (> 100 million) of symbols and several models predict the next symbol. To combine these predictions I’m using stacked generalization with a multilayer perceptron trained with online gradient descent.

For the inputs the network uses the predictions of the models so the total inputs are nmodels * nsymbols. As outputs there is one node per symbol. The network is trained on each new symbol.

The network doesn’t stabilize, that is, if it stops being trained the predictions become worse and the weights are constantly changing. But the predictions are better than any single model and better than any simple combination such as weighted majority, exponential moving average etc.

I’ve also tried using as inputs the last N predictions(N * nmodels * nsymbols), the last N predictions + the last N symbols of the sequence, the last N predictions + the last N symbols + the last N probabilities the network assigned for the actual symbol. In all cases the predictions improve but still no stabilization.

The sequence has some times long sub-sequences where the predictions of the models are very accurate but most of the time this doesn’t happen.

My main concern is with understanding why this happens. Any ideas?

EDIT: The sequence appears to be non-stationary.

Get this bounty!!!

## #StackBounty: #neural-networks #ensemble #sgd Neural network doesn't converge but has good performance

### Bounty: 100

I have a sequence (> 100 million) of symbols and several models predict the next symbol. To combine these predictions I’m using stacked generalization with a multilayer perceptron trained with online gradient descent.

For the inputs the network uses the predictions of the models so the total inputs are nmodels * nsymbols. As outputs there is one node per symbol. The network is trained on each new symbol.

The network doesn’t stabilize, that is, if it stops being trained the predictions become worse and the weights are constantly changing. But the predictions are better than any single model and better than any simple combination such as weighted majority, exponential moving average etc.

I’ve also tried using as inputs the last N predictions(N * nmodels * nsymbols), the last N predictions + the last N symbols of the sequence, the last N predictions + the last N symbols + the last N probabilities the network assigned for the actual symbol. In all cases the predictions improve but still no stabilization.

The sequence has some times long sub-sequences where the predictions of the models are very accurate but most of the time this doesn’t happen.

My main concern is with understanding why this happens. Any ideas?

EDIT: The sequence appears to be non-stationary.

Get this bounty!!!