*Bounty: 150*

*Bounty: 150*

Solomonoff’s universal prior for models is based on the algorithmic complexity of a computer program $p$ which executes that model. Where $l$ is the length of the computer program, the prior is proportional to $2^{-l}$.

This works fine for deterministic models. We have an observation and we want to understand what model best explains the observation. If the program $p$ returns an output equal to the observation then this program is valid, otherwise it is invalid. We can apply Bayes rule to get a posterior probability by considering all programs which returned the desired output.

I’m confused about how this is extended to the case where there is randomness in the observations we collect.

Take a simple example: Suppose we are considering a linear regression model $y=ax+b$. The computer program which executes this model has a length $l$ which is a function of $a$ and $b$ (bigger parameters require more bits to model). The prior is $pi(a,b)$.

But this program doesn’t include a random element. The regression model is $y=ax+b+epsilon$ where $epsilon$ is random noise with a distribution $N(0,sigma^2)$

How should we consider a prior for the noise? A computer program can’t produce a random noise so this can’t be a part of the program length. Should the prior for $sigma^2$ be considered separately to the prior for the model? Or should the number of bits required to encode the numeric value $sigma^2$ be included in the length of the program?

In addition to the variance of the noise, the underlying normal distribution would seen to have some degree of complexity associated with it. Should the bits required to describe a normal distribution be included in the model prior?