#StackBounty: #bayesian #monte-carlo #ab-test #bayes-factors Bayesian A/B testing and decision metrics

Bounty: 50

Say I need to test two different product features ({existing/control: blue} vs {new/treatment: red} font on webpage, for example), and need to boil my analysis down a to a single go/don’t go criteria for launching a new feature.

I’m oversimplifying, but a frequentist would set up an experiment and use p-values to decide if the new feature was significantly better than the control (and if the feature should be launched.)

Bayesian analysts, by contrast, can generate full posteriors over the efficacy of the control and treatment features. But it seems there’s some debate of where to go from here. I’ve read that the Bayes Factor can produce a single number and its value can determine whether to launch the feature. However, I’ve read that Andrew Gelman isn’t a huge enthusiast of the Bayes Factor as it’s reductionist in nature.

(In this product feature context, assume a Beta-Binomial conjugate prior as the math is fairly friendly and the posterior is tractable w/o need of MCMC methods.)

In a video I watched, the author used Bayesian inference to estimate the control and treatment efficacy parameters, then used monte carlo simulations to estimate the overlap between distributions. He alleged that this number was analogous to p-values and could be used in conjunction with an alpha (say 0.05) to decide whether to launch the new feature.

I’ve also come across terms like BIC, AIC, and WAIC; I’m not sure if these are improvements upon the Bayes Factor or totally different metrics.

At any rate, I’d like to solicit the community’s recommendations on launch/no-launch decisions using Bayesian inference: What are your thoughts?


Get this bounty!!!

#StackBounty: #hypothesis-testing #bayesian Hypothesis testing via separate inference for each group and then combining

Bounty: 50

Suppose there are two groups, A and B, and we are interested in inferring a certain parameter for each one and also the difference between the two parameters. Here we can take a Bayesian perspective and strive for a posterior distribution in each case. I am wondering if the following is a sound way of doing this:

  1. estimate the posterior for group A,
  2. estimate the posterior for group B, and
  3. estimate the posterior of the difference by sampling extensively the first two posteriors and taking the difference.

I am specifically unsure about this kind of divide-and-conquer approach where each group is treated separately, and then the results are combined. Usually, it is done in one take where, perhaps, a linear model is fitted with an indicator for the group membership.

Let me give a simple example. Say, the outcome is binary. One can then use a Bernoulli–beta model to infer the posterior of the success probability, which will be a beta distribution for each group. As the last step, one can sample the two betas and get a posterior for the difference.


Get this bounty!!!

#StackBounty: #self-study #bayesian #normal-distribution #conjugate-prior Correlated belief update: Is this understanding of Bayesian p…

Bounty: 50

I am reading this paper Knowledge-Gradient Policy for Correlated Normal Beliefs for Rank and Selection Problem. The idea is as follows: We have $M$ distinct alternatives and samples from alternative $i$ are iid with $mathcal{N}(theta_i,lambda_i)$, where $theta_i$ is unknown and $lambda_i$ is the known variance. At every time step, we can sample from only one of the arms and at the end of $N$ timesteps, we pick the alternative that we think that has the highest mean reward.

More concretely using the notation and text in the paper,

Let $mathbf theta = (theta_1, dots, theta_M)’$ be column vector of unknown means. We initially assume our belief about $bf theta$ as:

begin{align}
bf theta sim mathcal{N}(mu^0,Sigma^0) label{1} tag{1}
end{align}

Consider a sequence of $N$ sampling decisions, $x^{0}, x^{1}, ldots, x^{N-1} .$ The measurement decision
$x^{n}$ selects an alternative to sample at time $n$ from the set ${1, ldots, M}$. The measurement error $varepsilon^{n+1} sim mathcal{N}left(0, lambda_{x^{n}}right)$ is independent conditionally on $x^{n}$, and the resulting sample observation is $hat{y}^{n+1}=theta_{x^{n}}+varepsilon^{n+1}$. Conditioned on $theta$ and $x^{n}$, the sample has conditional distribution $hat{y}^{n+1} sim mathcal{N}left(theta_{x^{n}}, lambda_{x^{n}}right)$. Note that our assumption that the errors $varepsilon^{1}, ldots, varepsilon^{N}$ are independent
differentiates our model from one that would be used for common random numbers. Instead,
we introduce correlation by allowing a non-diagonal covariance matrix $Sigma^{0}$.

We may think of $theta$ as having been chosen randomly at the initial time 0 , unknown to
the experimenter but according to the prior distribution (1), and then fixed for the duration
of the sampling sequence. Through sampling, the experimenter is given the opportunity to
better learn what value $theta$ has taken.

We define a filtration $left(mathcal{F}^{n}right)$ wherein $mathcal{F}^{n}$ is the sigma-algebra generated by the samples
observed by time $n$ and the identities of their originating alternatives. That is, $mathcal{F}^{n}$ is the sigma-algebra generated by $x^{0}, hat{y}^{1}, x^{1}, hat{y}^{2}, ldots, x^{n-1}, hat{y}^{n} .$ We write $mathbb{E}_{n}$ to indicate $mathbb{E}left[cdot mid mathcal{F}^{n}right]$,
the conditional expectation taken with respect to $mathcal{F}^{n}$, and then define $mu^{n}:=mathbb{E}_{n}[theta]$ and $Sigma^{n}:=operatorname{Cov}left[theta mid mathcal{F}^{n}right]$. Conditionally on $mathcal{F}^{n}$, our posterior predictive belief for $theta$ is multivariate normal with mean vector $mu^{n}$ and covariance matrix $Sigma^{n} .$

We can obtain the updates of $mu^{n}$ and $Sigma^{n}$ as functions of $mu^{n-1}, Sigma^{n-1}, hat{y}^{n}$, and $x^{n-1}$ as follows:

begin{align}
mu^{n+1} &=mu^{n}+frac{hat{y}^{n+1}-mu_{x}^{n}}{lambda_{x}+sum_{x x}^{n}} Sigma^{n} e_{x} tag{2} label{2}\
Sigma^{n+1} &=Sigma^{n}-frac{Sigma^{n} e_{x} e_{x}^{prime} Sigma^{n}}{lambda_{x}+Sigma_{x x}^{n}} tag{3} label{3}
text { where } e_{x} text { is a column } M text { -vector of } 0 text { s with a single } 1 text { at index } x
end{align}

My question:

The authors claim that in equation ref{2}, $hat{y}^{n+1}-mu_{x}^{n}$ when conditioned on $mathcal{F}^n$ has zero mean ; this claim seems wrong to me. My understanding is that $hat{y}^{n+1}$ still follows $mathcal{N}(theta^*_x,lambda_x)$ where $theta^*_x$ is some realisation sampled from $mathcal{N}(mu^0_x,Sigma^0_{xx})$ and this true $theta^{*}_x$ need not be the same as $mu^n_{x}$.

On the basis of this claim, the authors design an algorithm and prove some theoretical results. This is a widely cited paper and so, I think I am missing something here with respect to bayesian setting and posterior distributions.


Get this bounty!!!

#StackBounty: #bayesian #conjugate-prior Correlated belief update: Is this understanding of Bayesian posterior wrong

Bounty: 50

I am reading this paper Knowledge-Gradient Policy for Correlated Normal Beliefs for Rank and Selection Problem. The idea is as follows: We have $M$ distinct alternatives and samples from alternative $i$ are iid with $mathcal{N}(theta_i,lambda_i)$, where $theta_i$ is unknown and $lambda_i$ is the known variance. At every time step, we can sample from only one of the arms and at the end of $N$ timesteps, we pick the alternative that we think that has the highest mean reward.

More concretely using the notation and text in the paper,

Let $mathbf theta = (theta_1, dots, theta_M)’$ be column vector of unknown means. We initially assume our belief about $bf theta$ as:

begin{align}
bf theta sim mathcal{N}(mu^0,Sigma^0) label{1} tag{1}
end{align}

Consider a sequence of $N$ sampling decisions, $x^{0}, x^{1}, ldots, x^{N-1} .$ The measurement decision
$x^{n}$ selects an alternative to sample at time $n$ from the set ${1, ldots, M}$. The measurement error $varepsilon^{n+1} sim mathcal{N}left(0, lambda_{x^{n}}right)$ is independent conditionally on $x^{n}$, and the resulting sample observation is $hat{y}^{n+1}=theta_{x^{n}}+varepsilon^{n+1}$. Conditioned on $theta$ and $x^{n}$, the sample has conditional distribution $hat{y}^{n+1} sim mathcal{N}left(theta_{x^{n}}, lambda_{x^{n}}right)$. Note that our assumption that the errors $varepsilon^{1}, ldots, varepsilon^{N}$ are independent
differentiates our model from one that would be used for common random numbers. Instead,
we introduce correlation by allowing a non-diagonal covariance matrix $Sigma^{0}$.

We may think of $theta$ as having been chosen randomly at the initial time 0 , unknown to
the experimenter but according to the prior distribution (1), and then fixed for the duration
of the sampling sequence. Through sampling, the experimenter is given the opportunity to
better learn what value $theta$ has taken.

We define a filtration $left(mathcal{F}^{n}right)$ wherein $mathcal{F}^{n}$ is the sigma-algebra generated by the samples
observed by time $n$ and the identities of their originating alternatives. That is, $mathcal{F}^{n}$ is the sigma-algebra generated by $x^{0}, hat{y}^{1}, x^{1}, hat{y}^{2}, ldots, x^{n-1}, hat{y}^{n} .$ We write $mathbb{E}_{n}$ to indicate $mathbb{E}left[cdot mid mathcal{F}^{n}right]$,
the conditional expectation taken with respect to $mathcal{F}^{n}$, and then define $mu^{n}:=mathbb{E}_{n}[theta]$ and $Sigma^{n}:=operatorname{Cov}left[theta mid mathcal{F}^{n}right]$. Conditionally on $mathcal{F}^{n}$, our posterior predictive belief for $theta$ is multivariate normal with mean vector $mu^{n}$ and covariance matrix $Sigma^{n} .$

We can obtain the updates of $mu^{n}$ and $Sigma^{n}$ as functions of $mu^{n-1}, Sigma^{n-1}, hat{y}^{n}$, and $x^{n-1}$ as follows:

begin{align}
mu^{n+1} &=mu^{n}+frac{hat{y}^{n+1}-mu_{x}^{n}}{lambda_{x}+sum_{x x}^{n}} Sigma^{n} e_{x} tag{2} label{2}\
Sigma^{n+1} &=Sigma^{n}-frac{Sigma^{n} e_{x} e_{x}^{prime} Sigma^{n}}{lambda_{x}+Sigma_{x x}^{n}} tag{3} label{3}
text { where } e_{x} text { is a column } M text { -vector of } 0 text { s with a single } 1 text { at index } x
end{align}

My question:

The authors claim that in equation ref{2}, $hat{y}^{n+1}-mu_{x}^{n}$ when conditioned on $mathcal{F}^n$ has zero mean ; this claim seems wrong to me. My understanding is that $hat{y}^{n+1}$ still follows $mathcal{N}(theta^*_x,lambda_x)$ where $theta^*_x$ is some realisation sampled from $mathcal{N}(mu^0_x,Sigma^0_{xx})$ and this true $theta^{*}_x$ need not be the same as $mu^n_{x}$.

On the basis of this claim, the authors design an algorithm and prove some theoretical results. This is a widely cited paper and so, I think I am missing something here with respect to bayesian setting and posterior distributions.


Get this bounty!!!

#StackBounty: #hypothesis-testing #bayesian #estimation #inference Posterior calculation on binomial distribution using quadratic loss …

Bounty: 50

Que

Let x be a binomial variate with parameters n and p (0<p<1). using a quadratic error loss function and a priori distribution of p as $ pi(p) $ = 1, obtain the bayes’ estimate for p.

Hey lately I have been teaching myself bayes estimator( in relation to statistical inference) ,
$ f(x| p ) = C^{n}_{x} p^x (1-p)^{n-x } $

Since prior distribution is 1

So joint distribution of x, p

f(x,p) = $ C^{n}_{x} p^x (1-p)^{n-x } $ only

Now posterior distribution is directly proportional to joint distribution of x and p

f(p|x) $ propto C^{n}_{x} p^x (1-p)^{n-x } $

f(p|x) $ propto p^x (1-p)^{n-x } $

f(p|x) $ propto p^{x+1-1} (1-p)^{n-x+1-1 } $

f(p|x) $ ~ beta({x+1, n-x+1 )} $

As we know Expected value of posterior distribution is

E(f(p|x)) $ = frac{x+1}{ n+2 } $

Now can someone help in calculating bayes risk using quadratic loss function , because I have no idea on how to proceed .


Get this bounty!!!

#StackBounty: #r #neural-networks #bayesian How do i calculate the gradients for the bayes-by-backprop algorithm in R?

Bounty: 100

I’m trying to create a neural network in R based on the bayes-by-backprop model as described in this paper. As inspiration for the code i’m using the python description on this website. However, in this paper they use an autograd function and i’m trying to calculate the gradient manually. The important algorithm from the paper is this one:

bayes by backprop algorithm

I think i’ve got most of the code ready, but i’m getting stuck on calculating the partial derivatives needed to calculate the gradients for mu (mean) and rho (parameter of the standard deviation).

This is the backpropagation function that i’m trying to make. I’m trying to adapt it from a regular neural network that i created before. The parameters are:

  • y = ground truth
  • Wn = list, every entry represents a layer of the network, containing a sample of the probability distributions in that layor
  • Un = list similar to Wn, the mus (mean) of the probability distributions in the posterior
  • Rn = list similar to Wn, the rhos (parameter used to define the standard deviation)
  • sigma = list similar to Wn, the standard deviation
  • eps = list similar to Wn, the random variable that is used for the reparametrization trick
  • ff = a list of the outputs of each feedforward layer, the last output is the model prediction.
backpropagate <- function(y, Wn, Un, Rn, sigma, ff, eps, learn_rate = 0.01) {
  
  y_hat <- ff[[3]]
  
  loss <- list()
  dWn <- list()
  
  loss[[2]] <- loss_function(y_hat, y, Wn, Un, sigma)
  
  dUn[[2]] <- ???
  dRn[[2]] <- ???
  
  loss[[1]] <- ???
  
  dUn[[1]] <- ???
  dRn[[1]] <- ???
    
  Un[[1]] <- Un[[1]] - learn_rate * dUn[[1]]
  Un[[2]] <- Un[[2]] - learn_rate * dUn[[2]]
  
  Rn[[1]] <- Rn[[1]] - learn_rate * dRn[[1]]
  Rn[[2]] <- Rn[[2]] - learn_rate * dRn[[2]]
  
  return(list(Un, Rn))
}

How do i calculate the gradients of mu and rho? I can do partial derivatives on polynomial and exponential functions but mu and rho are defined here only as probability distributions. The loss function is defined as the $$posterior – prior * likelihood$$ but how do i take the derivative w.r.t. mu and rho of those?

$$posterior = log(P(w)) = sum_i log(N(w_i | mu, sigma^2)) $$
$$prior = log(q(w|theta)) = sum_i log(N(w_i | 0, 1)) $$
$$likelihood = log(P(D|w)) = y * log(softmax(hat{y})) $$

For reference:

My main training function looks like this:

train <- function(x, y, neurons = 32, layers = 3, rate = 0.01, iterations = 10000) {
  
  d <- ncol(x) + 1
  
  Un <- list()
  Rn <- list()
  
  Un[[1]] <- matrix(rnorm(d * neurons), d, neurons)        # generate mus for the gaussian distribution of each neuron. matrix d (rows) x neurons (columns)
  Un[[2]] <- matrix(rnorm(neurons + 1), neurons + 1, 1)
  
  Rn[[1]] <- matrix(-3, d, neurons)        # generate the rhos to calculate the standard deviation later
  Rn[[2]] <- matrix(-3, neurons + 1, 1)
  
  for (i in 1:iterations) {
    
    Wn <- list()
    sigma <- list()
    eps <- list()
    
    sigma[[1]] <- log(1 + exp(Rn[[1]]))     # calculate the standard deviation from rho
    sigma[[2]] <- log(1 + exp(Rn[[2]]))
    
    eps[[1]] <- matrix(rnorm(d * neurons, 0, 0.1), nrow = d, ncol = neurons)     # generate a random number epsilon for every neuron
    eps[[2]] <- matrix(rnorm(neurons + 1, 0, 0.1), neurons + 1, 1)
    
    Wn[[1]] <- Un[[1]] + sigma[[1]] * eps[[1]]     # take one sample of the posterior
    Wn[[2]] <- Un[[2]] + sigma[[2]] * eps[[2]]

    ff <- feedforward(x, Wn)
    backlist <- backpropagate(y, Wn, Un, Rn, sigma, ff, eps, learn_rate = rate)
    
    Un <- backlist[[1]]
    Rn <- backlist[[2]]
  }
  return(Wn)
}

And then my feedforward plus helper functions:

feedforward <- function(x, Wn) {
  
  ff <- list(x)
  Zn <- cbind(1, ff[[1]]) %*% Wn[[1]]
  ff[[2]] <- sigmoid::relu(Zn)
  Zn <- cbind(1, ff[[2]]) %*% Wn[[2]]
  ff[[3]] <- Zn

  return(ff)
}

log_softmax <- function(y_hat, y){
  y * log(exp(y_hat) / sum(exp(y_hat)))
}

log_gaussian <- function(x, mu, sigma){
  return(-0.5 * log(2*pi) - log(sigma) - (x - mu)**2 / (2 * sigma**2))
}

loss_function <- function(y_hat, y, Wn, Un, sigma){
  
  posterior <- sum(log_gaussian(Wn[[i]], Un[[i]], sigma[[i]]))
  
  prior <- sum(log_gaussian(Wn[[i]], 0, 1))
  
  likelihood <- log_softmax(y_hat, y)
  
  return(posterior - prior * likelihood)
}


Get this bounty!!!

#StackBounty: #regression #bayesian #normal-distribution Linear regression dependencies of the evidence

Bounty: 50

In linear regression in order to calculate the posterior distribution $
$$p(mathbf{w}|mathcal{D})$
we need to perform the Bayes rule using the likelihood, prior and evidence as follows:

$$p(mathbf{w}|mathcal{D}) = frac{p(mathcal{D}|mathbf{w})p(mathbf{w})}{p(mathcal{D})}$$

Then, by modelling prior likelihood with Gaussans using as precision $alpha, beta$ we can write it as:

$$ p(mathbf{w}|mathcal{D})= = frac{mathcal{N} left( mathbf{w} | mathbf{0}, frac{1}{alpha} I right) mathcal{N} left( mathbf{t} | mathbf{Phi}mathbf{w}, 1/beta Iright) }{ int mathcal{N} left( mathbf{w} | mathbf{0}, frac{1}{alpha} I right) mathcal{N} left( mathbf{t} | mathbf{Phi}mathbf{w}, 1/beta I right)dw } $$

So more compactly we can write the following form:
$$ = frac{mathcal{N} left( mathbf{w} | mathbf{0}, frac{1}{alpha} I right) mathcal{N} left( mathbf{t} | mathbf{Phi}mathbf{w}, 1/beta Iright) }{ p(mathbf{t} | mathbf{Phi, alpha, beta})}$$

My question is why the evidence is conditioned to precision parameters $alpha, beta$. Is that actually correct?


Get this bounty!!!

#StackBounty: #self-study #bayesian #correlation #posterior #sequential-pattern-mining Sequential update of Posterior distribution of a…

Bounty: 50

Consider the following problem of sequentially estimating ( in a Bayesian setting ) the means of the two Gaussian distributions $mathcal{N}(mu_1,sigma_0^2)$ and $mathcal{N}(mu_2,sigma_0^2)$ which have the same known variance $sigma_0^2$ and have a known correlation $rho$.

Using conjugate priors, at any given time $t$, we have prior probabilities for the means as $mu_1 sim mathcal{N}(theta_1(t),sigma_1^2(t))$, $mu_2 sim mathcal{N}(theta_2(t),sigma_2^2(t))$. We observe a unidimensional sample $x(t)$ from one of the two distributions and we update the parameters of our prior probabilities as follows:


WLOG, assume that $x(t)$ comes from the first distribution i.e. $x(t) = x_1(t)$. The update equations are given by
begin{align}
theta_{1}(t+1)&=frac{sigma_{1}^{2}(t) x_{1}(t)+sigma_{0}^{2} theta_{1}(t)}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(1)}\
sigma_{1}^{2}(t+1)&=frac{sigma_{1}^{2}(t) sigma_{0}^{2}}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(2)}\
theta_{2}(t+1)&=theta_{2}(t)+ underbrace{rho sigma_1(t) sigma_2(t)}_{rho – term} frac{x_{1}(t)-theta_{1}(t)}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(3)}\
sigma_{2}^{2}(t+1)&=sigma_{2}^{2}(t)-frac{sigma_{1,2}^{2}}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(4)}
end{align}

The above update equations follow from straightforward application of conjugate priors.


Here is my question – Let $S = {1,2,2,1,2,1, dots, 1}$ be an arbitrary binary sequence of length of T which determines the distribution from which we observe the sample. E.g. $S(3) = 2$ which says that $x(3) = x_2(3)$ i.e. at the third time instance, I get the sample from distribution 2. So, based on $S$, I get the samples from one of the distributions at each $t$. I want to determine if the sequence of updates $theta_1(t)$ and $theta_2(t)$ converge to $mu_1$ and $mu_2$ respectively.

Here is what I know:

I made some simplifying assumptions such as $theta_1(0)=theta_2(0)=0 ;sigma_1(0)=sigma_2(0)=sigma_0=1 $ and proceed ahead:

For $rho = 0$:

If the correlation $rho$ was zero, samples from one distribution are not going to effect the updates in the $theta$ parameters of the other distribution because the $rho – term$ would be zero in equation 3. So, we can repeat the process of sampling the unidimensional observations $x(t)$ for large enough times so that we get enough samples from both the distributions and then, say that both $theta_1(t)$ and $theta_2(t)$ asymptotically converge to $mu_1$ and $mu_2$, respectively. We can also comment on the rates of convergence using SLLN, etc.

For $rho ne 0$:

However, I am not sure any such results of asymptotic convergence exist for $rho ne 0$. My intuition is that using the additional information of correlation should help in tighter bounds/faster convergence. However, I am clueless in this case, because both $theta_1$ and $theta_2$ get updated at every $t$. Any pointers to prior literature and your thoughts are welcome.


Get this bounty!!!

#StackBounty: #bayesian #correlation #posterior #conjugate-prior #sequential-pattern-mining Sequential update of Posterior distribution…

Bounty: 50

Consider the following problem of sequentially estimating ( in a Bayesian setting ) the means of the two Gaussian distributions $mathcal{N}(mu_1,sigma_0^2)$ and $mathcal{N}(mu_2,sigma_0^2)$ which have the same known variance $sigma_0^2$ and have a known correlation $rho$.

Using conjugate priors, at any given time $t$, we have prior probabilities for the means as $mu_1 sim mathcal{N}(theta_1(t),sigma_1^2(t))$, $mu_2 sim mathcal{N}(theta_2(t),sigma_2^2(t))$. We observe a unidimensional sample $x(t)$ from one of the two distributions and we update the parameters of our prior probabilities as follows:


WLOG, assume that $x(t)$ comes from the first distribution i.e. $x(t) = x_1(t)$. The update equations are given by
begin{align}
theta_{1}(t+1)&=frac{sigma_{1}^{2}(t) x_{1}(t)+sigma_{0}^{2} theta_{1}(t)}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(1)}\
sigma_{1}^{2}(t+1)&=frac{sigma_{1}^{2}(t) sigma_{0}^{2}}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(2)}\
theta_{2}(t+1)&=theta_{2}(t)+ underbrace{rho sigma_1(t) sigma_2(t)}_{rho – term} frac{x_{1}(t)-theta_{1}(t)}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(3)}\
sigma_{2}^{2}(t+1)&=sigma_{2}^{2}(t)-frac{sigma_{1,2}^{2}}{sigma_{1}^{2}(t)+sigma_{0}^{2}} qquad text{(4)}
end{align}

The above update equations follow from straightforward application of conjugate priors.


Here is my question – Let $S = {1,2,2,1,2,1, dots, 1}$ be an arbitrary binary sequence of length of T which determines the distribution from which we observe the sample. E.g. $S(3) = 2$ which says that $x(3) = x_2(3)$ i.e. at the third time instance, I get the sample from distribution 2. So, based on $S$, I get the samples from one of the distributions at each $t$. I want to determine if the sequence of updates $theta_1(t)$ and $theta_2(t)$ converge to $mu_1$ and $mu_2$ respectively.

Here is what I know:

I made some simplifying assumptions such as $theta_1(0)=theta_2(0)=0 ;sigma_1(0)=sigma_2(0)=sigma_0=1 $ and proceed ahead:

For $rho = 0$:

If the correlation $rho$ was zero, samples from one distribution are not going to effect the updates in the $theta$ parameters of the other distribution because the $rho – term$ would be zero in equation 3. So, we can repeat the process of sampling the unidimensional observations $x(t)$ for large enough times so that we get enough samples from both the distributions and then, say that both $theta_1(t)$ and $theta_2(t)$ asymptotically converge to $mu_1$ and $mu_2$, respectively. We can also comment on the rates of convergence using SLLN, etc.

For $rho ne 0$:

However, I am not sure any such results of asymptotic convergence exist for $rho ne 0$. My intuition is that using the additional information of correlation should help in tighter bounds/faster convergence. However, I am clueless in this case, because both $theta_1$ and $theta_2$ get updated at every $t$. Any pointers to prior literature and your thoughts are welcome.


Get this bounty!!!