I have a large-ish set of unevenly-spaced time-series data from instruments around the world, for which I’m using Gaussian process regression to do interpolation and short-term future prediction. My kernel is a combination of squared-exponential and periodic components to model the trends at different time-scales, and it’s working reasonably well. At this point I’m working on optimizing the model and, in particular, making sure that the variances that come out of it are reasonable.
The measurements that I’m using for input come with a “confidence score” in arbitrary units 0 (worst) to 100 (best), and I would like to pass these as variances into the GPR (sklearn GaussianProcessRegressor
alpha parameter), but I don’t have any guidance for how much error any given confidence score indicates, so I need to discover a function on my own that will map CS to variance. What’s the best way to go about this?
So far I’ve had the fairly obvious idea to take batches of the data, fit the GPR model to the batch, and for each point in the batch record its CS and the difference between the measurement and the predicted mean at the same point, then do curve fitting to map CS to mean squared error. However, the devil is in the details. Am I better off including all points or using a left-out set? Should I iterate, doing successive runs with variances learned from the previous epoch, or use fixed variances throughout? (I’m worried that the former approach could diverge, because larger input variances lead to larger output variances, which lead to learning larger input variances). Should I be doing something more Bayesian with log-likelihood or otherwise take into account where the point falls in the predicted distribution, rather than just distance from the predicted mean? I’m afraid I’ve gone a bit beyond my mathematical knowledge here.
In case it’s relevant, I’m using scikit-learn
GaussianProcessRegressor in my actual prediction, but am open to pretty much anything.