*Bounty: 50*

*Bounty: 50*

I have fitted a Gaussian Process (GP) to perform a binary classification task. The **dataset is balanced**, so I have an equal number of samples with 0/1 label for the training. The **covariance function used is an RBF kernel**, which needs the hyperparameter “length scale” to be tuned.

To be sure that I am not overfitting data, and that I am selecting proper kernel hyperparameters, I performed a grid search to find out the best percentage of training data and length scale, obtaining as statistical metrics the overall accuracy (OA) and the log-marginal likelihood (LML) on the test set.

You can see the results in the following image (left for OA, right for LML):

EDIT:I re-uploaded the image with the normalized log-marginal likelihood. Common sense indicates that optimal model should find a

trade-off between model complexity and accuracy metrics. Thus, these

models lay somewhere between 30%-40% of training data and 0.7-0.9 of

length scale of the RBF kernel within the GP. This is great for model

selection, but unfortunately, I think I still cannot answer the

questions below… Any new insights on the interpretation of the LML?

After exploring the effect of training size and the hyperparameter on the statistical metrics, I think it would be safe to select a model using at least 30% of data for training and a length scale for RBF of 0.1. However, I do not understand the role of LML to select the model (or even whether it needs to be considered), but common sense suggests that it should be as small as possible (i.e. around -400, represented in yellow). This means my best model is located at training size = 10-20% and length_scale=0.1.

I have seen that other people (here and here) have (somewhat) similar questions regarding LML, but I can’t find ideas that help me understanding the link between good error OA metrics and LML. In other words, I am having trouble at interpreting the LML.

In concrete, I would like to get more insights on:

- What is the impact of a high/low LML on the predictive power of the

GP? - How much better is a model with LML=-400 compared to one with

LML=-700? - What does it mean to have a LML of -400? Isn’t -400 a lot

for a statistical metric? - Did I
**really**found a solution for my problem with these LML metrics?

Thanks for your help!