*Bounty: 50*

*Bounty: 50*

I have created a predictive model that outputs a predictive density. I used 1000 rolling windows to estimate the model and predict one step ahead in each window. I collected the 1000 predictions and compared them to the actual realizations. I used several diagnostic tests, among them Kolmogorov-Smirnov. I saved the $p$-value of the test.

I did the same for multiple time series. Then I looked at all of the $p$-values from the different series. I found that they are `0.440, 0.579, 0.848, 0.476, 0.753, 0.955, 0.919, 0.498, 0.997`

. At first I was quite happy that they are much larger than `0.010`

, `0.050`

or `0.100`

(to use the standard cut-off values). But then a colleague of mine pointed out that the $p$-values should be distributed as $text{Uniform}[0,1]$ under the null of correct predictive distribution, and so I should perhaps not be so happy.

On the one hand, the colleague must be right; the $p$-values should ideally be uniformly distributed. On the other hand, I have found that my model predicts "better" than the true model normally would; the discrepancy between the predicted density and the realized density is less than one would normally expect between the true density and the realized density. This could be an indication of overfitting if I were evaluating my model in-sample, but the model has been evaluated out of sample. What does this tell me? Should I be concerned with a diagnostic test’s $p$-values being too high?

You could say this is just a small set of $p$-values (just 8 of them) so anything could happen, and you might be right. However, suppose I have a larger set of $p$-values that are closer to 1 than uniformly distributed; is that a problem? What does that tell me?