## #StackBounty: #self-study #bayesian #normal-distribution #conjugate-prior Correlated belief update: Is this understanding of Bayesian p…

### Bounty: 50

I am reading this paper Knowledge-Gradient Policy for Correlated Normal Beliefs for Rank and Selection Problem. The idea is as follows: We have $$M$$ distinct alternatives and samples from alternative $$i$$ are iid with $$mathcal{N}(theta_i,lambda_i)$$, where $$theta_i$$ is unknown and $$lambda_i$$ is the known variance. At every time step, we can sample from only one of the arms and at the end of $$N$$ timesteps, we pick the alternative that we think that has the highest mean reward.

More concretely using the notation and text in the paper,

Let $$mathbf theta = (theta_1, dots, theta_M)’$$ be column vector of unknown means. We initially assume our belief about $$bf theta$$ as:

begin{align} bf theta sim mathcal{N}(mu^0,Sigma^0) label{1} tag{1} end{align}

Consider a sequence of $$N$$ sampling decisions, $$x^{0}, x^{1}, ldots, x^{N-1} .$$ The measurement decision
$$x^{n}$$ selects an alternative to sample at time $$n$$ from the set $${1, ldots, M}$$. The measurement error $$varepsilon^{n+1} sim mathcal{N}left(0, lambda_{x^{n}}right)$$ is independent conditionally on $$x^{n}$$, and the resulting sample observation is $$hat{y}^{n+1}=theta_{x^{n}}+varepsilon^{n+1}$$. Conditioned on $$theta$$ and $$x^{n}$$, the sample has conditional distribution $$hat{y}^{n+1} sim mathcal{N}left(theta_{x^{n}}, lambda_{x^{n}}right)$$. Note that our assumption that the errors $$varepsilon^{1}, ldots, varepsilon^{N}$$ are independent
differentiates our model from one that would be used for common random numbers. Instead,
we introduce correlation by allowing a non-diagonal covariance matrix $$Sigma^{0}$$.

We may think of $$theta$$ as having been chosen randomly at the initial time 0 , unknown to
the experimenter but according to the prior distribution (1), and then fixed for the duration
of the sampling sequence. Through sampling, the experimenter is given the opportunity to
better learn what value $$theta$$ has taken.

We define a filtration $$left(mathcal{F}^{n}right)$$ wherein $$mathcal{F}^{n}$$ is the sigma-algebra generated by the samples
observed by time $$n$$ and the identities of their originating alternatives. That is, $$mathcal{F}^{n}$$ is the sigma-algebra generated by $$x^{0}, hat{y}^{1}, x^{1}, hat{y}^{2}, ldots, x^{n-1}, hat{y}^{n} .$$ We write $$mathbb{E}_{n}$$ to indicate $$mathbb{E}left[cdot mid mathcal{F}^{n}right]$$,
the conditional expectation taken with respect to $$mathcal{F}^{n}$$, and then define $$mu^{n}:=mathbb{E}_{n}[theta]$$ and $$Sigma^{n}:=operatorname{Cov}left[theta mid mathcal{F}^{n}right]$$. Conditionally on $$mathcal{F}^{n}$$, our posterior predictive belief for $$theta$$ is multivariate normal with mean vector $$mu^{n}$$ and covariance matrix $$Sigma^{n} .$$

We can obtain the updates of $$mu^{n}$$ and $$Sigma^{n}$$ as functions of $$mu^{n-1}, Sigma^{n-1}, hat{y}^{n}$$, and $$x^{n-1}$$ as follows:

begin{align} mu^{n+1} &=mu^{n}+frac{hat{y}^{n+1}-mu_{x}^{n}}{lambda_{x}+sum_{x x}^{n}} Sigma^{n} e_{x} tag{2} label{2}\ Sigma^{n+1} &=Sigma^{n}-frac{Sigma^{n} e_{x} e_{x}^{prime} Sigma^{n}}{lambda_{x}+Sigma_{x x}^{n}} tag{3} label{3} text { where } e_{x} text { is a column } M text { -vector of } 0 text { s with a single } 1 text { at index } x end{align}

My question:

The authors claim that in equation ref{2}, $$hat{y}^{n+1}-mu_{x}^{n}$$ when conditioned on $$mathcal{F}^n$$ has zero mean ; this claim seems wrong to me. My understanding is that $$hat{y}^{n+1}$$ still follows $$mathcal{N}(theta^*_x,lambda_x)$$ where $$theta^*_x$$ is some realisation sampled from $$mathcal{N}(mu^0_x,Sigma^0_{xx})$$ and this true $$theta^{*}_x$$ need not be the same as $$mu^n_{x}$$.

On the basis of this claim, the authors design an algorithm and prove some theoretical results. This is a widely cited paper and so, I think I am missing something here with respect to bayesian setting and posterior distributions.

Get this bounty!!!

## #StackBounty: #bayesian #conjugate-prior Correlated belief update: Is this understanding of Bayesian posterior wrong

### Bounty: 50

I am reading this paper Knowledge-Gradient Policy for Correlated Normal Beliefs for Rank and Selection Problem. The idea is as follows: We have $$M$$ distinct alternatives and samples from alternative $$i$$ are iid with $$mathcal{N}(theta_i,lambda_i)$$, where $$theta_i$$ is unknown and $$lambda_i$$ is the known variance. At every time step, we can sample from only one of the arms and at the end of $$N$$ timesteps, we pick the alternative that we think that has the highest mean reward.

More concretely using the notation and text in the paper,

Let $$mathbf theta = (theta_1, dots, theta_M)’$$ be column vector of unknown means. We initially assume our belief about $$bf theta$$ as:

begin{align} bf theta sim mathcal{N}(mu^0,Sigma^0) label{1} tag{1} end{align}

Consider a sequence of $$N$$ sampling decisions, $$x^{0}, x^{1}, ldots, x^{N-1} .$$ The measurement decision
$$x^{n}$$ selects an alternative to sample at time $$n$$ from the set $${1, ldots, M}$$. The measurement error $$varepsilon^{n+1} sim mathcal{N}left(0, lambda_{x^{n}}right)$$ is independent conditionally on $$x^{n}$$, and the resulting sample observation is $$hat{y}^{n+1}=theta_{x^{n}}+varepsilon^{n+1}$$. Conditioned on $$theta$$ and $$x^{n}$$, the sample has conditional distribution $$hat{y}^{n+1} sim mathcal{N}left(theta_{x^{n}}, lambda_{x^{n}}right)$$. Note that our assumption that the errors $$varepsilon^{1}, ldots, varepsilon^{N}$$ are independent
differentiates our model from one that would be used for common random numbers. Instead,
we introduce correlation by allowing a non-diagonal covariance matrix $$Sigma^{0}$$.

We may think of $$theta$$ as having been chosen randomly at the initial time 0 , unknown to
the experimenter but according to the prior distribution (1), and then fixed for the duration
of the sampling sequence. Through sampling, the experimenter is given the opportunity to
better learn what value $$theta$$ has taken.

We define a filtration $$left(mathcal{F}^{n}right)$$ wherein $$mathcal{F}^{n}$$ is the sigma-algebra generated by the samples
observed by time $$n$$ and the identities of their originating alternatives. That is, $$mathcal{F}^{n}$$ is the sigma-algebra generated by $$x^{0}, hat{y}^{1}, x^{1}, hat{y}^{2}, ldots, x^{n-1}, hat{y}^{n} .$$ We write $$mathbb{E}_{n}$$ to indicate $$mathbb{E}left[cdot mid mathcal{F}^{n}right]$$,
the conditional expectation taken with respect to $$mathcal{F}^{n}$$, and then define $$mu^{n}:=mathbb{E}_{n}[theta]$$ and $$Sigma^{n}:=operatorname{Cov}left[theta mid mathcal{F}^{n}right]$$. Conditionally on $$mathcal{F}^{n}$$, our posterior predictive belief for $$theta$$ is multivariate normal with mean vector $$mu^{n}$$ and covariance matrix $$Sigma^{n} .$$

We can obtain the updates of $$mu^{n}$$ and $$Sigma^{n}$$ as functions of $$mu^{n-1}, Sigma^{n-1}, hat{y}^{n}$$, and $$x^{n-1}$$ as follows:

begin{align} mu^{n+1} &=mu^{n}+frac{hat{y}^{n+1}-mu_{x}^{n}}{lambda_{x}+sum_{x x}^{n}} Sigma^{n} e_{x} tag{2} label{2}\ Sigma^{n+1} &=Sigma^{n}-frac{Sigma^{n} e_{x} e_{x}^{prime} Sigma^{n}}{lambda_{x}+Sigma_{x x}^{n}} tag{3} label{3} text { where } e_{x} text { is a column } M text { -vector of } 0 text { s with a single } 1 text { at index } x end{align}

My question:

The authors claim that in equation ref{2}, $$hat{y}^{n+1}-mu_{x}^{n}$$ when conditioned on $$mathcal{F}^n$$ has zero mean ; this claim seems wrong to me. My understanding is that $$hat{y}^{n+1}$$ still follows $$mathcal{N}(theta^*_x,lambda_x)$$ where $$theta^*_x$$ is some realisation sampled from $$mathcal{N}(mu^0_x,Sigma^0_{xx})$$ and this true $$theta^{*}_x$$ need not be the same as $$mu^n_{x}$$.

On the basis of this claim, the authors design an algorithm and prove some theoretical results. This is a widely cited paper and so, I think I am missing something here with respect to bayesian setting and posterior distributions.

Get this bounty!!!

## #StackBounty: #hypothesis-testing #bayesian #estimation #inference Posterior calculation on binomial distribution using quadratic loss …

### Bounty: 50

Que

Let x be a binomial variate with parameters n and p (0<p<1). using a quadratic error loss function and a priori distribution of p as $$pi(p)$$ = 1, obtain the bayes’ estimate for p.

Hey lately I have been teaching myself bayes estimator( in relation to statistical inference) ,
$$f(x| p ) = C^{n}_{x} p^x (1-p)^{n-x }$$

Since prior distribution is 1

So joint distribution of x, p

f(x,p) = $$C^{n}_{x} p^x (1-p)^{n-x }$$ only

Now posterior distribution is directly proportional to joint distribution of x and p

f(p|x) $$propto C^{n}_{x} p^x (1-p)^{n-x }$$

f(p|x) $$propto p^x (1-p)^{n-x }$$

f(p|x) $$propto p^{x+1-1} (1-p)^{n-x+1-1 }$$

f(p|x) $$~ beta({x+1, n-x+1 )}$$

As we know Expected value of posterior distribution is

E(f(p|x)) $$= frac{x+1}{ n+2 }$$

Now can someone help in calculating bayes risk using quadratic loss function , because I have no idea on how to proceed .

Get this bounty!!!

## #StackBounty: #r #neural-networks #bayesian How do i calculate the gradients for the bayes-by-backprop algorithm in R?

### Bounty: 100

I’m trying to create a neural network in R based on the bayes-by-backprop model as described in this paper. As inspiration for the code i’m using the python description on this website. However, in this paper they use an autograd function and i’m trying to calculate the gradient manually. The important algorithm from the paper is this one:

I think i’ve got most of the code ready, but i’m getting stuck on calculating the partial derivatives needed to calculate the gradients for mu (mean) and rho (parameter of the standard deviation).

This is the backpropagation function that i’m trying to make. I’m trying to adapt it from a regular neural network that i created before. The parameters are:

• y = ground truth
• Wn = list, every entry represents a layer of the network, containing a sample of the probability distributions in that layor
• Un = list similar to Wn, the mus (mean) of the probability distributions in the posterior
• Rn = list similar to Wn, the rhos (parameter used to define the standard deviation)
• sigma = list similar to Wn, the standard deviation
• eps = list similar to Wn, the random variable that is used for the reparametrization trick
• ff = a list of the outputs of each feedforward layer, the last output is the model prediction.
``````backpropagate <- function(y, Wn, Un, Rn, sigma, ff, eps, learn_rate = 0.01) {

y_hat <- ff[[3]]

loss <- list()
dWn <- list()

loss[[2]] <- loss_function(y_hat, y, Wn, Un, sigma)

dUn[[2]] <- ???
dRn[[2]] <- ???

loss[[1]] <- ???

dUn[[1]] <- ???
dRn[[1]] <- ???

Un[[1]] <- Un[[1]] - learn_rate * dUn[[1]]
Un[[2]] <- Un[[2]] - learn_rate * dUn[[2]]

Rn[[1]] <- Rn[[1]] - learn_rate * dRn[[1]]
Rn[[2]] <- Rn[[2]] - learn_rate * dRn[[2]]

return(list(Un, Rn))
}
``````

How do i calculate the gradients of mu and rho? I can do partial derivatives on polynomial and exponential functions but mu and rho are defined here only as probability distributions. The loss function is defined as the $$posterior – prior * likelihood$$ but how do i take the derivative w.r.t. mu and rho of those?

$$posterior = log(P(w)) = sum_i log(N(w_i | mu, sigma^2))$$
$$prior = log(q(w|theta)) = sum_i log(N(w_i | 0, 1))$$
$$likelihood = log(P(D|w)) = y * log(softmax(hat{y}))$$

For reference:

My main training function looks like this:

``````train <- function(x, y, neurons = 32, layers = 3, rate = 0.01, iterations = 10000) {

d <- ncol(x) + 1

Un <- list()
Rn <- list()

Un[[1]] <- matrix(rnorm(d * neurons), d, neurons)        # generate mus for the gaussian distribution of each neuron. matrix d (rows) x neurons (columns)
Un[[2]] <- matrix(rnorm(neurons + 1), neurons + 1, 1)

Rn[[1]] <- matrix(-3, d, neurons)        # generate the rhos to calculate the standard deviation later
Rn[[2]] <- matrix(-3, neurons + 1, 1)

for (i in 1:iterations) {

Wn <- list()
sigma <- list()
eps <- list()

sigma[[1]] <- log(1 + exp(Rn[[1]]))     # calculate the standard deviation from rho
sigma[[2]] <- log(1 + exp(Rn[[2]]))

eps[[1]] <- matrix(rnorm(d * neurons, 0, 0.1), nrow = d, ncol = neurons)     # generate a random number epsilon for every neuron
eps[[2]] <- matrix(rnorm(neurons + 1, 0, 0.1), neurons + 1, 1)

Wn[[1]] <- Un[[1]] + sigma[[1]] * eps[[1]]     # take one sample of the posterior
Wn[[2]] <- Un[[2]] + sigma[[2]] * eps[[2]]

ff <- feedforward(x, Wn)
backlist <- backpropagate(y, Wn, Un, Rn, sigma, ff, eps, learn_rate = rate)

Un <- backlist[[1]]
Rn <- backlist[[2]]
}
return(Wn)
}
``````

And then my feedforward plus helper functions:

``````feedforward <- function(x, Wn) {

ff <- list(x)
Zn <- cbind(1, ff[[1]]) %*% Wn[[1]]
ff[[2]] <- sigmoid::relu(Zn)
Zn <- cbind(1, ff[[2]]) %*% Wn[[2]]
ff[[3]] <- Zn

return(ff)
}

log_softmax <- function(y_hat, y){
y * log(exp(y_hat) / sum(exp(y_hat)))
}

log_gaussian <- function(x, mu, sigma){
return(-0.5 * log(2*pi) - log(sigma) - (x - mu)**2 / (2 * sigma**2))
}

loss_function <- function(y_hat, y, Wn, Un, sigma){

posterior <- sum(log_gaussian(Wn[[i]], Un[[i]], sigma[[i]]))

prior <- sum(log_gaussian(Wn[[i]], 0, 1))

likelihood <- log_softmax(y_hat, y)

return(posterior - prior * likelihood)
}
``````

Get this bounty!!!

## #StackBounty: #regression #bayesian #normal-distribution Linear regression dependencies of the evidence

### Bounty: 50

In linear regression in order to calculate the posterior distribution $$p(mathbf{w}|mathcal{D})$$ we need to perform the Bayes rule using the likelihood, prior and evidence as follows:

$$p(mathbf{w}|mathcal{D}) = frac{p(mathcal{D}|mathbf{w})p(mathbf{w})}{p(mathcal{D})}$$

Then, by modelling prior likelihood with Gaussans using as precision $$alpha, beta$$ we can write it as:

$$p(mathbf{w}|mathcal{D})= = frac{mathcal{N} left( mathbf{w} | mathbf{0}, frac{1}{alpha} I right) mathcal{N} left( mathbf{t} | mathbf{Phi}mathbf{w}, 1/beta Iright) }{ int mathcal{N} left( mathbf{w} | mathbf{0}, frac{1}{alpha} I right) mathcal{N} left( mathbf{t} | mathbf{Phi}mathbf{w}, 1/beta I right)dw }$$

So more compactly we can write the following form:
$$= frac{mathcal{N} left( mathbf{w} | mathbf{0}, frac{1}{alpha} I right) mathcal{N} left( mathbf{t} | mathbf{Phi}mathbf{w}, 1/beta Iright) }{ p(mathbf{t} | mathbf{Phi, alpha, beta})}$$

My question is why the evidence is conditioned to precision parameters $$alpha, beta$$. Is that actually correct?

Get this bounty!!!

## #StackBounty: #self-study #bayesian #correlation #posterior #sequential-pattern-mining Sequential update of Posterior distribution of a…

### Bounty: 50

Consider the following problem of sequentially estimating ( in a Bayesian setting ) the means of the two Gaussian distributions $$mathcal{N}(mu_1,sigma_0^2)$$ and $$mathcal{N}(mu_2,sigma_0^2)$$ which have the same known variance $$sigma_0^2$$ and have a known correlation $$rho$$.

Using conjugate priors, at any given time $$t$$, we have prior probabilities for the means as $$mu_1 sim mathcal{N}(theta_1(t),sigma_1^2(t))$$, $$mu_2 sim mathcal{N}(theta_2(t),sigma_2^2(t))$$. We observe a unidimensional sample $$x(t)$$ from one of the two distributions and we update the parameters of our prior probabilities as follows:

WLOG, assume that $$x(t)$$ comes from the first distribution i.e. $$x(t) = x_1(t)$$. The update equations are given by
begin{align} theta_{1}(t+1)&=frac{sigma_{1}^{2}(t) x_{1}(t)+sigma_{0}^{2} theta_{1}(t)}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(1)}\ sigma_{1}^{2}(t+1)&=frac{sigma_{1}^{2}(t) sigma_{0}^{2}}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(2)}\ theta_{2}(t+1)&=theta_{2}(t)+ underbrace{rho sigma_1(t) sigma_2(t)}_{rho – term} frac{x_{1}(t)-theta_{1}(t)}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(3)}\ sigma_{2}^{2}(t+1)&=sigma_{2}^{2}(t)-frac{sigma_{1,2}^{2}}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(4)} end{align}

The above update equations follow from straightforward application of conjugate priors.

Here is my question – Let $$S = {1,2,2,1,2,1, dots, 1}$$ be an arbitrary binary sequence of length of T which determines the distribution from which we observe the sample. E.g. $$S(3) = 2$$ which says that $$x(3) = x_2(3)$$ i.e. at the third time instance, I get the sample from distribution 2. So, based on $$S$$, I get the samples from one of the distributions at each $$t$$. I want to determine if the sequence of updates $$theta_1(t)$$ and $$theta_2(t)$$ converge to $$mu_1$$ and $$mu_2$$ respectively.

Here is what I know:

I made some simplifying assumptions such as $$theta_1(0)=theta_2(0)=0 ;sigma_1(0)=sigma_2(0)=sigma_0=1$$ and proceed ahead:

For $$rho = 0$$:

If the correlation $$rho$$ was zero, samples from one distribution are not going to effect the updates in the $$theta$$ parameters of the other distribution because the $$rho – term$$ would be zero in equation 3. So, we can repeat the process of sampling the unidimensional observations $$x(t)$$ for large enough times so that we get enough samples from both the distributions and then, say that both $$theta_1(t)$$ and $$theta_2(t)$$ asymptotically converge to $$mu_1$$ and $$mu_2$$, respectively. We can also comment on the rates of convergence using SLLN, etc.

For $$rho ne 0$$:

However, I am not sure any such results of asymptotic convergence exist for $$rho ne 0$$. My intuition is that using the additional information of correlation should help in tighter bounds/faster convergence. However, I am clueless in this case, because both $$theta_1$$ and $$theta_2$$ get updated at every $$t$$. Any pointers to prior literature and your thoughts are welcome.

Get this bounty!!!

## #StackBounty: #bayesian #correlation #posterior #conjugate-prior #sequential-pattern-mining Sequential update of Posterior distribution…

### Bounty: 50

Consider the following problem of sequentially estimating ( in a Bayesian setting ) the means of the two Gaussian distributions $$mathcal{N}(mu_1,sigma_0^2)$$ and $$mathcal{N}(mu_2,sigma_0^2)$$ which have the same known variance $$sigma_0^2$$ and have a known correlation $$rho$$.

Using conjugate priors, at any given time $$t$$, we have prior probabilities for the means as $$mu_1 sim mathcal{N}(theta_1(t),sigma_1^2(t))$$, $$mu_2 sim mathcal{N}(theta_2(t),sigma_2^2(t))$$. We observe a unidimensional sample $$x(t)$$ from one of the two distributions and we update the parameters of our prior probabilities as follows:

WLOG, assume that $$x(t)$$ comes from the first distribution i.e. $$x(t) = x_1(t)$$. The update equations are given by
begin{align} theta_{1}(t+1)&=frac{sigma_{1}^{2}(t) x_{1}(t)+sigma_{0}^{2} theta_{1}(t)}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(1)}\ sigma_{1}^{2}(t+1)&=frac{sigma_{1}^{2}(t) sigma_{0}^{2}}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(2)}\ theta_{2}(t+1)&=theta_{2}(t)+ underbrace{rho sigma_1(t) sigma_2(t)}_{rho – term} frac{x_{1}(t)-theta_{1}(t)}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(3)}\ sigma_{2}^{2}(t+1)&=sigma_{2}^{2}(t)-frac{sigma_{1,2}^{2}}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(4)} end{align}

The above update equations follow from straightforward application of conjugate priors.

Here is my question – Let $$S = {1,2,2,1,2,1, dots, 1}$$ be an arbitrary binary sequence of length of T which determines the distribution from which we observe the sample. E.g. $$S(3) = 2$$ which says that $$x(3) = x_2(3)$$ i.e. at the third time instance, I get the sample from distribution 2. So, based on $$S$$, I get the samples from one of the distributions at each $$t$$. I want to determine if the sequence of updates $$theta_1(t)$$ and $$theta_2(t)$$ converge to $$mu_1$$ and $$mu_2$$ respectively.

Here is what I know:

I made some simplifying assumptions such as $$theta_1(0)=theta_2(0)=0 ;sigma_1(0)=sigma_2(0)=sigma_0=1$$ and proceed ahead:

For $$rho = 0$$:

If the correlation $$rho$$ was zero, samples from one distribution are not going to effect the updates in the $$theta$$ parameters of the other distribution because the $$rho – term$$ would be zero in equation 3. So, we can repeat the process of sampling the unidimensional observations $$x(t)$$ for large enough times so that we get enough samples from both the distributions and then, say that both $$theta_1(t)$$ and $$theta_2(t)$$ asymptotically converge to $$mu_1$$ and $$mu_2$$, respectively. We can also comment on the rates of convergence using SLLN, etc.

For $$rho ne 0$$:

However, I am not sure any such results of asymptotic convergence exist for $$rho ne 0$$. My intuition is that using the additional information of correlation should help in tighter bounds/faster convergence. However, I am clueless in this case, because both $$theta_1$$ and $$theta_2$$ get updated at every $$t$$. Any pointers to prior literature and your thoughts are welcome.

Get this bounty!!!

## #StackBounty: #bayesian #kernel-smoothing Bayesian updating of nonparametric estimate of distribution

### Bounty: 50

Is there any way to perform "bayesian updating" of a nonparametric estimate of some distribution (say, a kernel density estimation) in light of a new set of observed values?

Get this bounty!!!

## #StackBounty: #bayesian #outliers #gaussian-mixture-distribution #gibbs #state-space-models Applying outlier adjustment using student&#…

### Bounty: 50

I’m exploring performing outlier adjustment in a state-space model by using student’s $$t$$ distribution. The gist of the problem is formulated as follows:

begin{align*} y_t^* &= u_t + o_t – o_{t-1}, \ u_t &= rho_1 u_{t-1} + rho_2 u_{t-2} + eta_{t}, \ o_t &sim t_nu (sigma_o). end{align*}
where $$y_t^*$$ is a time series of observables, modeled as the sum of $$u_t$$ and the one-period change in $$o_t$$, where $$o_t$$ is the additive outlier component. We let $$u_t$$ follow the $$AR(2)$$ process, and $$o_t$$ follows i.i.d. $$t$$ distribution.

After consulting some literature, I decided to model the $$t$$-distributed variables using scale mixtures; i.e., the measurement equation above is equivalent to
begin{align*} y_t^* & = u_t + sqrt{varphi_t} z_t – sqrt{varphi_{t-1}} z_{t-1},qquad zsim N(0, sigma_o^2) text{ and } varphisim IG. end{align*}

I then cast the above system in the state-space format, with the following measurement equation:
$$y_t^* = begin{pmatrix} sqrt{varphi_t} & -sqrt{varphi_{t-1}} & 1 & 0 end{pmatrix} begin{pmatrix} z_t \ z_{t-1} \ u_t \ u_{t-1} end{pmatrix},$$ and the following transition system
$$begin{pmatrix} z_t \ z_{t-1} \ u_t \ u_{t-1} end{pmatrix} = begin{pmatrix} 0 & 0 & 0 & 0 \ 1 & 0 & 0 & 0 \ 0 & 0 & rho_1 & rho_2 \ 0 & 0 & 1 & 0 end{pmatrix} begin{pmatrix} z_{t-1} \ z_{t-2} \ u_{t-1} \ u_{t-2} end{pmatrix} + begin{pmatrix} eta_z \ 0 \ eta_u \ 0 end{pmatrix},$$
where $$eta_z sim N(0, sigma_o^2)$$.

To draw the outlier component and the hyperparameters, I thought the following Gibbs sampling procedure might be appropriate:

1. Conditional on $$varphi_t^{j-1}$$ and $$sigma_o^{j-1}$$, use simulation smoother to draw $$z_t$$ and $$u_t$$; then it’s trivial to compute $$o_t^j = sqrt{varphi_t} z_t$$;
2. Given $$o_t^j$$ and conditional on $$varphi_t^{j-1}$$, draw $$sigma_o^j$$.
3. Conditional on $$sigma_o^j$$ and $$nu^{j-1}$$, draw $$varphi_t^j$$.
4. Draw $$nu^j$$.
5. Increment $$j$$ by 1.

I’m somewhat new to Gibbs sampling and non-Gaussian state-space models. I’m wondering whether there are obvious theoretical issues with this setup.

Get this bounty!!!

## #StackBounty: #bayesian #outliers #gaussian-mixture-distribution #gibbs #state-space-models Applying outlier adjustment using student&#…

### Bounty: 50

I’m exploring performing outlier adjustment in a state-space model by using student’s $$t$$ distribution. The gist of the problem is formulated as follows:

begin{align*} y_t^* &= u_t + o_t – o_{t-1}, \ u_t &= rho_1 u_{t-1} + rho_2 u_{t-2} + eta_{t}, \ o_t &sim t_nu (sigma_o). end{align*}
where $$y_t^*$$ is a time series of observables, modeled as the sum of $$u_t$$ and the one-period change in $$o_t$$, where $$o_t$$ is the additive outlier component. We let $$u_t$$ follow the $$AR(2)$$ process, and $$o_t$$ follows i.i.d. $$t$$ distribution.

After consulting some literature, I decided to model the $$t$$-distributed variables using scale mixtures; i.e., the measurement equation above is equivalent to
begin{align*} y_t^* & = u_t + sqrt{varphi_t} z_t – sqrt{varphi_{t-1}} z_{t-1},qquad zsim N(0, sigma_o^2) text{ and } varphisim IG. end{align*}

I then cast the above system in the state-space format, with the following measurement equation:
$$y_t^* = begin{pmatrix} sqrt{varphi_t} & -sqrt{varphi_{t-1}} & 1 & 0 end{pmatrix} begin{pmatrix} z_t \ z_{t-1} \ u_t \ u_{t-1} end{pmatrix},$$ and the following transition system
$$begin{pmatrix} z_t \ z_{t-1} \ u_t \ u_{t-1} end{pmatrix} = begin{pmatrix} 0 & 0 & 0 & 0 \ 1 & 0 & 0 & 0 \ 0 & 0 & rho_1 & rho_2 \ 0 & 0 & 1 & 0 end{pmatrix} begin{pmatrix} z_{t-1} \ z_{t-2} \ u_{t-1} \ u_{t-2} end{pmatrix} + begin{pmatrix} eta_z \ 0 \ eta_u \ 0 end{pmatrix},$$
where $$eta_z sim N(0, sigma_o^2)$$.

To draw the outlier component and the hyperparameters, I thought the following Gibbs sampling procedure might be appropriate:

1. Conditional on $$varphi_t^{j-1}$$ and $$sigma_o^{j-1}$$, use simulation smoother to draw $$z_t$$ and $$u_t$$; then it’s trivial to compute $$o_t^j = sqrt{varphi_t} z_t$$;
2. Given $$o_t^j$$ and conditional on $$varphi_t^{j-1}$$, draw $$sigma_o^j$$.
3. Conditional on $$sigma_o^j$$ and $$nu^{j-1}$$, draw $$varphi_t^j$$.
4. Draw $$nu^j$$.
5. Increment $$j$$ by 1.

I’m somewhat new to Gibbs sampling and non-Gaussian state-space models. I’m wondering whether there are obvious theoretical issues with this setup.

Get this bounty!!!