#StackBounty: #bayesian #graphical-model How many parameters does the following Bayesian Network (Graphical model) contains?

Bounty: 50

enter image description here

These are the relations defined in the graph.

P(D),
P(E),
P(A),
P(C),
P(G|D, E, A),
P(I|E, A, C),
P(T|E, C),
P(F|E, A, C),
P(S|A),
P(J|G, I)

Hello, I have this network that shows a graph (Graphical model). The directed edges represent conditional probabilities. I need to compute a number of parameters for this whole graph. I know the formulas K – 1 where K is the number of states or K^M – 1 when I have M variables. The problem here is that the number of states each node can assume is different.

Here are the number of states each random variable can have.

[('Difficulty', 'D', 3), ('Effort', 'E', 3), ('Aptitute', 'A', 5), ('Confidence', 'C', 3), ('Grade', 'G', 5), ('Interview', 'I', 3), ('TuteAttendence', 'T', 3), ('ForumParticipation', 'F', 3), ('SAT', 'S', 3), ('Job', 'J', 2)]

How to compute a total number of parameters this model needs when fully connected vs when the DAG is considered.

I have made a python function that computes it as follows.

def calculate_params(n):
    n_states = len(n.states)
    if len(n.parents) == 0:
        return n_states
    else:
        p_states = []
        for p in n.parents:
            p_states.append(calculate_params(p) - 1)
        return (n_states - 1) * np.prod(np.asarray(p_states))

print(calculate_params(J))

it outputs 1923. Is this answer correct?
Also how to compute the number of parameters when the graph is fully connected. In my opinion it should be print((3**7 - 1) + (5**2) + (2**1 - 1)) = 2212 but I am not sure.

What would be an easy correct way to compute this?


Get this bounty!!!

#StackBounty: #bayesian #multiarmed-bandit #uninformative-prior Batches of bayesian updates for gaussian with unknown variance differen…

Bounty: 50

I’m working on a project where I continuously (in batches) update the pdf estimation for an event normally distributed. My variance is unknown, so I’m using the equations given in session 4.1.2 of this book:

enter image description here

I would expect that multiple small updates would lead to approximately the same parameters to one single big update. This is obviously true for the updates of the means, but not so clear for the standard deviation. If we simulate it:

from scipy.stats import norm
import numpy as np

np.random.seed(42)
n_0 = 0
m_0 = 0
s_0 = 0
cum_obs = np.array([])
batch_size = 10
n_batches = 100
for i in range(n_batches):
    batch_obs = norm.rvs(loc=10, scale=5, size=batch_size)
    cum_obs = np.concatenate([cum_obs, batch_obs])

    n = batch_size
    nu_0 = -1 + n_0
    nu = nu_0 + n
    x_bar = np.mean(batch_obs)
    s = np.std(batch_obs)

    var = (
            (1 / nu)
            * (
                    (s ** 2) * (n - 1)
                    + (s_0 ** 2) * nu_0
                    + n_0 * n * ((x_bar - m_0) ** 2) / (n_0 + n)
            )
    )

    mult = (1 / nu) * (1 / (n + n_0))

    m_0 = x_bar
    n_0 += n
    s0 = s

    s = np.sqrt(var)
    if i == n_batches - 1:
        print("...n")
    if i < 5 or i == n_batches - 1:
        print("Using update rules:tttt", s)
        print("Computing the std for all observed data:t", cum_obs.std())
        print("")
Using update rules:                          3.429529651472475
Computing the std for all observed data:     3.429529651472475

Using update rules:                          4.02186514878462
Computing the std for all observed data:     4.67859979446889

Using update rules:                          2.5465800067817965
Computing the std for all observed data:     4.424395957981525

Using update rules:                          2.6167707409764502
Computing the std for all observed data:     4.704117427614569

Using update rules:                          1.83449085034948
Computing the std for all observed data:     4.621424689362122

Using update rules:                          0.4917832154508146
Computing the std for all observed data:     4.893631038736771

I set the priors in a way that the first update in the two computations are equal, but as the batches go, the two get further and further apart and the presented equations get far from the expect value of 5.

I would like to know if there’s something fundamentally wrong in what I’m doing or even if I should expect similar results for the two computations.

Edit: I noticed a typo in my code with s0 vs s_0 and the results make more sense now. But the stds computed with the equation still seems more unprecise


Get this bounty!!!

#StackBounty: #bayesian #spatial #autoregressive #hierarchical-bayesian #gibbs Sampling Bayesian Intrinsic CAR Model

Bounty: 50

I’m having trouble to understand how to sample from a spatial model that includes an intrinsic conditional autoregressive (ICAR) component. There is a vast amount of materials out there that define the model, but much less in terms of computational details. Given is some spatial data $y$ and $x$ in areas $i$ and periods $t$. For simplicity, $y$ is modeled in a Gaussian framework such that

$y_{it} = x_{it}beta_i + epsilon_{it} quad epsilon_{it} sim N(0, sigma^2)$.

I’d like to spatially smooth $beta_i$ through a hierarchical prior. For this, I model $beta_i$ as $beta_i = alpha + phi_i$ where $alpha$ is a common mean of the coefficients. $phi_i$ is a spatial component that $-$ conditional on the neighboring components of $i$, denoted as $phi_{isim j}$ $-$ follows an ICAR process:

$phi_i | phi_{isim j} sim N(phi_i^N, tau^2 / N_i)$

where $N_i$ denotes the number of neighbors of unit $i$ and $phi_i^N$ is the average value of the spatial effects $phi$ in the neighboring units of $i$.

This translates into a (conditional) prior $beta_i sim N(alpha + phi^N_i, tau^2 / N_i)$. Given suitable priors for $alpha$ and the variances $tau^2$ and $sigma^2$, is standard Gibbs sampling permissible here? All implementations I found use some variant of Metropolis-Hastings at some point.

Proposed sampler:

1.) update $beta_i$ using the linear regression that models the data

1.a) compute $phi^N_i$ as the average of $phi_{i sim j}$ of the neighbors of $i$ using $phi_i = beta_i – alpha$

2.) update $alpha$ using the regression model $beta_i – phi^N_i = alpha + nu_i$ where $nu_i sim N(0, tau^2 / N_i)$

3.) update variance $tau^2$ using residuals $beta_i – alpha – phi^N_i$

4.) update variance $sigma^2$ using residuals $y_{it} – x_{it}beta_i$

I am a bit confused as $phi$ is never sampled directly here, it only appears as conditioning variable that is computed as $beta_i – alpha = phi_i$.


Get this bounty!!!

#StackBounty: #self-study #bayesian #mathematical-statistics #markov-chain-montecarlo #graphical-model Inference in Dirichlet process m…

Bounty: 50

I need to cluster some data ${x1, ldots, x_n}$ through a Dirichlet process mixture model.
Consider the following Dirichlet process mixture model, in which the base measure is a $NIW(mu_0, lambda_0, nu_0, S)$ distribution and the random probability measure $G = sum_{k=1}^{infty} pi_k delta_{(mu_k, Sigma_k)}$ is constructed according to stick breaking:

$$pi sim GEM(alpha)$$
$$ {mu_k, Sigma_k} sim NIW(mu_0, lambda_0, nu_0, S) $$
$$ z_i | pi sim pi$$
$$x_i | Z_i, {mu_k, Sigma_k} sim N(mu_{z_i}, Sigma_{z_i})$$

I need to design a collapsed Gibbs sampler. Also, I need to a posteriori estimate ${mu_k, Sigma_k}$ for all clusters in the data.
It is clear that this requires integrating out $pi$ and $ {mu_k, Sigma_k}$, thus leading to sample only the $z_i$‘s. However, it is not clear to me how to formally obtain the ${mu_k, Sigma_k}$ subsequently to the collapsed Gibbs sampling process. What about the $pi_k$‘s subsequently to the collapsed Gibbs sampling process?


Get this bounty!!!

#StackBounty: #bayesian #markov-chain-montecarlo #missing-data #multiple-imputation Best way to combine MCMC inference with multiple im…

Bounty: 50

I can derive an MCMC algorithm for sampling from the posterior distribution of a parameter vector of interest, but only starting with a dataset that has no missing values. The actual dataset that I want to use for inference has substantial missingness in its covariates.

One approach would be to build a more complex MCMC algorithm that, for example, first fills in the missing data with draws from the missing values’ posterior predictive distribution. However, this feels intractably hard.

What I’d rather do is use an off-the-shelf method to generate multiple imputations of the dataset, then run my existing MCMC algorithm on each imputed, complete dataset and then recombine into final estimates of (for example) a posterior expectation or a posterior interval for a parameter of interest.

Is there a body of literature that attempts to solve problems in this way? Or is there a much better way to do this? Or is this approach wrong-headed or infeasible? Any pointers would be helpful.


Get this bounty!!!

#StackBounty: #bayesian #change-point Modelling a multivariate changepoint analysis

Bounty: 50

I’m trying to model a multivariate multiple change point analysis. The data I have is as follows:

Multiple sensors, positive count data, multiple change points and a time series.

For example, imagine I have a bucket of oranges with n unreliable camera sensors. Each day I want to measure how many oranges I have based on the readings from my cameras (I don’t know anything about the accuracy of the cameras – some may be good, some bad).

. Day 1 Day 2 Day 3 Day 4
Cam 1 0 1 2 2
Cam 2 0 1 2 2
Cam 3 0 1 3 2

I would want to identify Day 2 and Day 3 (with some uncertainty give Cam 3 disagrees).

Optional extra:
There is the additional and perhaps somewhat confusing rule that if multiple oranges are added they can only be removed one at a time. If the oranges were added one at a time, they can all be removed at once.

For example, the following is not permitted:

. Day 1 Day 2 Day 3 Day 4
$y$ 0 2 0 0

Please help me describe the model in the usual notation, e.g.:

$tau sim DiscreteUniform(0, 5)$
$lambda_1 sim Exp()$
$lambda_2 sim Exp()$
$y = Poisson(tau > lambda_1)$


Get this bounty!!!

#StackBounty: #bayesian #jags #stan Example where the posterior from Jags and Stan are really different and have real impacts on decisi…

Bounty: 50

I have seen many people claiming Stan is "much better" than JAGS, meaning roughly this: "although Jags is much faster, the quality of the samples is worse. So it’s worth waiting for the much longer time that Stan takes, because you "better" compute the uncertainty."

Ok, so far so good. But… I have not seen any good examples proving that practical claim. By that I mean examples in which the posterior distribution outputted by both samplers would be so different to the point of making a substantive difference (for instance, a different decision), and thus the burden of using Stan would be worth the time and pay off.

Could you provide such an example (with model and code)?


Get this bounty!!!

#StackBounty: #time-series #bayesian #econometrics #hierarchical-bayesian How would a Bayesian answer this question from Jeffrey Wooldr…

Bounty: 50

Jeffrey Wooldridge, a famous econometrician, posed the following question to Bayesians on twitter:

I think frequentists and Bayesians are not yet on the same page, and
it has little to do with philosophy. It seems some Bayesians think a
proper response to clustering standard errors is to specify an HLM.
But in the linear case, HLM leads to GLS, not OLS.

Moreover, a Bayesian would take the HLM structure seriously in all
respects: variance and correlation structure and distribution. I’m
happy to use an HLM to improve efficiency over pooled estimation, but
I would cluster my standard errors, anyway. A Bayesian would not.

There still seems to be a general confusion that fully specifying
everything and using a GLS or joint MLE is a costless alternative to
pooled methods that use few assumptions. And the Bayesian approach is
particular unfair to pooled methods.

One only needs to think of something like a simple time series
regression with serial correlation. I think there are four common
things one might do.

  1. OLS with usual (nonrobust) SEs
  2. OLS with Newey-West SEs
  3. Prais-Winston with usual SEs
  4. P-W with N-W SEs

In my view, choice (3) is almost as bad as (1). Choices (2) and (4)
make sense, with (4) requiring strict exogeneity. But at least we’re
then comparing applies with applies.

Again, what is the Bayesian version of (4) after priors and
distributional assumptions are imposed?

What is the answer to Wooldridge’s question?

Glossary of acronyms:

HLM = Hierarchical Linear Model, GLS = Generalized Least Squares, OLS = Ordinary Least Squares, SE = Standard Error, MLE = Maximum Likelihood Estimator.


Get this bounty!!!

#StackBounty: #regression #time-series #bayesian #latent-variable Latent Learning of Skill in Dance Dance Revolution

Bounty: 200

I’m trying to model and plot the rate of learning Dance Dance Revolution over the past year.

I have data of the format below, with the score received for a song on a specific date and difficulty level. Songs have been played many times over the year, at a variety of difficulty levels. I’d like to model the skill learned over the year.

Date Played Song Difficulty Level Game Version Score Letter Grade Num Perfects Num Greats Num Goods Num Almosts Num Boos Num O.K.s Max Combo
March 12, 2021 Girls Just Wanna Have Fun Expert DDR Supernova 8381742 A 153 52 4 0 7 23 69

Each song is unique per game version, and the scores across game versions are non standard (some are in the hundred thousand range and some in the thousand range). There are a total of 8 game versions, 5 difficulty levels, and ~100 songs.

A trivial solution would be to do a linear regression of the score over days, but this would not take into account differences by song, difficulty level, or game version.

Any ideas how to model the learning rate in a less trivial way?


Get this bounty!!!

#StackBounty: #bayesian #mathematical-statistics #multiple-regression #lognormal-distribution #back-transformation Bayesian lognormal m…

Bounty: 50

I have a Bayesian model of the form:

$$
begin{align}
y & sim logNormal(mu, sigma)\
mu_n & = alpha + beta_0 c_n + beta_1 d_n + beta_2 c_n d_n
end{align}
$$

Where:

  • $y$ is a variable measured in ms
  • $c$ is a sum-contrast coded variable equal to 1 or -1.
  • $d$ is a scaled and centered continuous variable

This omits assumed prior distributions, and slope and intercept adjustments, I believe they are not important with respect to the main point.
Since the likelihood is logNormal the results are on the log scale.

I want to back-transform the estimates from the model and compute the effect sizes of different variables. I think I know how to do this for $c$:

$$
mu|_{c=1} – mu|_{c=-1} = exp (alpha + beta_0 + beta_1 d_n + beta_2 d_n ) – exp (alpha – beta_0 + beta_1 d_n – beta_2 d_n )
$$

For interaction this is slightly more complicated. Let the interaction effect be:
$IE = frac{partial mu}{partial d}|_{c = 1} – frac{partial mu}{partial d}|_{c = -1}$. I remove the log by exponentiation and then I take the derivative which results in:

$$
frac{partial mu}{partial d} = exp (alpha + beta_0 c_n + beta_1 d_n + beta_2 c_n d_n ) (beta_1 + beta_2 c_n)
$$

So the interaction effect is the difference of these two terms:

$$
begin{align}
frac{partial mu}{partial d}|_{c = 1} = & exp (alpha + beta_0 + beta_1 d_n + beta_2 d_n ) (beta_1 + beta_2) \
frac{partial mu}{partial d}|_{c = -1} = & exp (alpha – beta_0 + beta_1 d_n – beta_2 d_n ) (beta_1 – beta_2) \
end{align}
$$

My questions are:

  1. Are the derivations for the interaction effect and the effect of $c$ correct?
    If yes, this means that these effects are functions of $d$, but fitting the model returns the effects (on the log scale) as constants. Where does this difference come from?
  2. How do I derive and interpret the effect for $d$? Ie. what are the meaningful points at which to evaluate the function $mu$? Or maybe is there a better way of doing this?


Get this bounty!!!