I have fitted a Gaussian Process (GP) to perform a binary classification task. The dataset is balanced, so I have an equal number of samples with 0/1 label for the training. The covariance function used is an RBF kernel, which needs the hyperparameter “length scale” to be tuned.
To be sure that I am not overfitting data, and that I am selecting proper kernel hyperparameters, I performed a grid search to find out the best percentage of training data and length scale, obtaining as statistical metrics the overall accuracy (OA) and the log-marginal likelihood (LML) on the test set.
You can see the results in the following image (left for OA, right for LML):
EDIT: I re-uploaded the image with the normalized log-marginal likelihood. Common sense indicates that optimal model should find a
trade-off between model complexity and accuracy metrics. Thus, these
models lay somewhere between 30%-40% of training data and 0.7-0.9 of
length scale of the RBF kernel within the GP. This is great for model
selection, but unfortunately, I think I still cannot answer the
questions below… Any new insights on the interpretation of the LML?
After exploring the effect of training size and the hyperparameter on the statistical metrics, I think it would be safe to select a model using at least 30% of data for training and a length scale for RBF of 0.1. However, I do not understand the role of LML to select the model (or even whether it needs to be considered), but common sense suggests that it should be as small as possible (i.e. around -400, represented in yellow). This means my best model is located at training size = 10-20% and length_scale=0.1.
I have seen that other people (here and here) have (somewhat) similar questions regarding LML, but I can’t find ideas that help me understanding the link between good error OA metrics and LML. In other words, I am having trouble at interpreting the LML.
In concrete, I would like to get more insights on:
- What is the impact of a high/low LML on the predictive power of the
- How much better is a model with LML=-400 compared to one with
- What does it mean to have a LML of -400? Isn’t -400 a lot
for a statistical metric?
- Did I really found a solution for my problem with these LML metrics?
Thanks for your help!