## #StackBounty: #bayesian #multivariate-analysis #psychometrics #irt #marketing Bayesian Multiple Item Response Models?

### Bounty: 50

I’m interested in extending the ideas of IRT models beyond binary outcomes, right/wrong, to multiple outcomes. The main interest area is for marketing: Suppose a survey asks participants, "What is the biggest deciding factor in whether or not you make a purchase?" From the answer set {price, durability, ease of use/setup}. Perhaps not all participants evaluate every single item, so data sparsity exists.

In a traditional IRT model, two classes of factors are learned: student ability and question difficulty (usually for cognitive modeling.) So these factors are one-dimensional, a smart student performs well on most questions, a hard question stumps most students, etc. The simplest form is `sigmoid(ability - difficulty)`. Somewhat like logistic regression, Bernoulli, Binomial, or Beta likelihoods can be used.

I’m interested in inferring a 3D vector for each individual and each product. It would be very useful for me to know, for example, a broomstick: `[price=0.5, durability=0.5, ease_of_use=0]` (and similar findings for individuals; it would help me understand how they make purchase decisions.)

My instinct is to take the element-wise difference between vectors for any given {individual:product} pair and pass this through a softmax function, then pass this transformed parameter to a Categorical, Multinomial, or Dirichlet likelihood. (Thoughts?)

Can anyone recommend a Bayesian modeling strategy? (What priors worked well, what was your likelihood function, etc.)

Get this bounty!!!

## Introduction

I am trying to learn Bayesian methods, and to that end, I picked up an application of interest to me to develop the concepts in practice.

## Context

Suppose I wrote an initial version of a performance-sensitive piece of software, and want to optimize its execution time. I may have a baseline version and an "improved" version (or at least, I suspect it may be an improvement — I need to measure).

I’m looking to quantify how likely it is that this new version is actually be an improvement (as opposed to being equivalent or possibly even worse than the baseline), as well as how much — is it 20% faster? 100% faster? 10% slower? Also I’d like to give credible intervals rather than only point estimates of the speedup.

To that end, I time a number of runs of the two version of the software, trying to keep all other factors the same (input data, hardware, OS, etc.) I also try to kill every running app and service, and even turn off networking, to make sure that, to the extent possible by modern feature-heavy code, these apps have the CPU all to themselves. I also disable Turbo Boost on my CPU to prevent CPU clock rate changes over time and temperature, and run my fans at maximum to minimize the change of CPU thermal throttling (and in practice my computer’s thermal solution is good enough that I’ve never seen this happen). I’ve tried to restrict the portion of the code being measured to the computational part only, so no I/O to add variability.

Despite my best efforts, this is not an embedded system with a single-core processor running on bare metal, so there is some variability, possibly due to OS processes that remain and take up a little bit of CPU, CPU affinity of processes, as well as microarchitectural sources of variability such as caches, out of order execution and hyperthreading.

## Current model and code

Currently I’m using the BEST model, implemented by the following code in Python using PyMC3 (heavily inspired by the linked document), in case it is of interest. The arguments are timings of the baseline version (`baseline`) and improved version (`opt`):

``````def statistical_analysis(baseline, opt):
# Inspired by https://docs.pymc.io/notebooks/BEST.html
y = pd.DataFrame(
dict(
value=np.r_[baseline, opt],
group=np.r_[['baseline']*len(baseline), ['opt']*len(opt)]
)
)

μ_m = y.value.mean()
μ_s = y.value.std()
σ_low = µ_s/1000
σ_high = µ_s*1000

with pm.Model() as model:
baseline_mean = pm.Normal('baseline_mean', mu=μ_m, sd=1000*μ_s)
opt_mean = pm.Normal('opt_mean', mu=μ_m, sd=1000*μ_s)
baseline_std = pm.Uniform('baseline_std', lower=µ_s/1000,
upper=1000*µ_s)
opt_std = pm.Uniform('opt_std', lower=µ_s/1000, upper=1000*µ_s)
ν = pm.Exponential('ν_minus_one', 1/29.) + 1
λ_baseline = baseline_std**-2
λ_opt = opt_std**-2

dist_baseline = pm.StudentT('baseline', nu=ν, mu=baseline_mean,
lam=λ_baseline, observed=baseline)
dist_opt = pm.StudentT('opt', nu=ν, mu=opt_mean,
lam=λ_opt, observed=opt)

diff_of_means = pm.Deterministic('difference of means',
baseline_mean - opt_mean)
ratio_of_means = pm.Deterministic('ratio of means',
baseline_mean/opt_mean)

trace = pm.sample(draws=3000,tune=2000)

baseline_hdi = az.hdi(trace['baseline_mean'])
baseline_out = (baseline_hdi[0],
trace['baseline_mean'].mean(),
baseline_hdi[1])

opt_hdi = az.hdi(trace['opt_mean'])
opt_out = (opt_hdi[0], trace['opt_mean'].mean(), opt_hdi[1])

speedup_hdi = az.hdi(trace['ratio of means'])
speedup = (speedup_hdi[0],
trace['ratio of means'].mean(),
speedup_hdi[1])

dif = trace['difference of means'] > 0
prob = (dif > 0).sum()/len(dif)

return (baseline_out, opt_out, speedup, prob)
``````

The `prob` variable indicates how likely it is that a difference exists, and `speedup` includes the mean as well as 95% HDI for the ratio of execution time of the baseline version to the improved version. The remaining variables are the mean as well as 95% HDI of the execution time of baseline and improved versions.

## Issues with the model

The BEST model assumes a Student t-distribution for the values of the execution time, but I have a hunch this is not an adequate modeling assumption.

Given a certain piece of code, one could in principle tally up every single instruction executed, and figure out exactly how fast an "undisturbed" CPU could run it, given the amount of execution resources like ALUs and load/store units, the latency of each instruction, etc. Therefore, there exists a minimum value, bounded by the CPU hardware capabilities, such that the code will never run faster than this. We cannot measure this minimum, though, because the measurements are contaminated by the sources of noise previously mentioned.

Thus, I’d like to think that my model should be the sum of a constant value (the minimum) and some distribution with positive values only, and probably a heavy tailed one, seeing as some outlier event may happen during the execution of the code (the system decides to update an app, or run a backup, or whatever).

## Edit: some data

To give an idea of the kind of distribution that may be found in practice, I measured 5000 executions of the serial and a parallel version of the same code, for the same input data, and generated histograms for both, with 250 bins each. I’m not claiming this is necessarily representative, but it shows how inadequate the Student t-distribution is for this problem.

First, the serial version:

And now for the parallel version:

## The question

This leads me to the question:

What are some distributions that might be a good fit to this model?

Get this bounty!!!

## #StackBounty: #bayesian #graphical-model #latent-variable #irt #stan How to incorporate multiple likelihoods in a probabilistic graphic…

### Bounty: 50

Data composition: In beta testing of a video game, users were assigned tasks in a many-to-many relationship. At the end of every day, users were asked to self evaluate (for each task) whether they progressed, stagnated, or atrophied on the task assigned. (In this game, if too long of a period elapsed, challenges, enemies, etc. repopulated.)

Note the current and ending classes are ‘one hot’:

``````user, task, curr_1, curr_2, curr_3, end_1, end_2, end_3, lag
1,    A,      1,      0,      0,     0,     1,     0,   0
1,    A,      0,      1,      0,     0,     0,     0,   1
...
``````

Analysis: I’m developing a Bayesian probabilistic graphical model that has two components: (A) Models the ending class given: {current class, number of days} where there are 3 unique starting/end classes: {progression, stagnation, atrophy} and (B) A latent factor model that learns the user and item associations with these classes taking inspiration from (multiple) item response theory.

Parameters:

• C (3,3) matrix of coefficients representing the relative strength of association between current class and ending class.
• B (3,3) matrix of coefficients representing the sensitivity to lag (time since first day) of the association between current and ending classes. (ie atrophy -> atrophy might have a strong relationship but not have a strong sensitivity to time elapsed whereas stagnation -> atrophy might have a weak relationship, but be very sensitive to time elapsed.)
• Theta (1,3) vector The predicted ending class given current class and time elapsed.
• z_user (N,3) matrix where each user has an association with each ending class
• z_task (K,3) matrix where each task has an association with each ending class

Model:

For each row in data (user, item, current_class, ending_class, time elapsed):

``````v = C*curr_class + B*C*time // vector, all but 1 element will be 0 due to one-hot nature
theta = softmax(v)
ending_class ~Dirichlet(theta, ending_class) //likelihood
theta ~Dirichlet(ending_class, z_user - z_task) //likelihood
z_user ~Normal(0,1) //prior
C ~Normal(0,1) //prior
B ~Normal(0,1) //prior
``````

My questions:

1. In my design, I’m using two likelihood functions where theta can be penalized by the ending class in two different ways: Either <C,B> OR <z_item,z_user> needs updating (or potentially all four.) Is the scheme/design valid? Should I alternate use likelihood functions, holding one "side of the problem" constant will evaluating the other? These are very generic priors, would you recommend others?
2. I’m struggling to implement this into Stan and would request some help (optional)
3. Any other advice you’d like to offer?

Edit1: Why this is useful to me: Without the time component, I’m able to learn task difficulty and user ability. This helps rank users and tasks in terms of their ability/difficulty. However, the problem is that it doesn’t give us the ability to quickly assess new users and tasks; with the time component, this could greatly speed up our ability to accurately infer where a new user and/or task stand.

For example, we don’t want to say "This brand new task is the hardest thing we’ve ever made!" Then find out that our model only believed it to be so because a handful of users tried it over a short period of time.)

Get this bounty!!!

## #StackBounty: #bayesian #anova #bayesian-network How to report the parameter learning results from dynamic Bayesian network for test an…

### Bounty: 50

I have a Dynamic Bayesian Network which I used in my research. Network is shown below:

It was employed in an educational video game and I ran the experiment for test and control groups separately. Parameter learning for the two groups resulted in different conditional probabilities and I could explain the differences that occurred. Below is an example of the learned probabilities for the test group:

I have a lot more tables for the param learning results and have two questions in this regard. First, is there a better way to report these results instead of using so many tables? Second, similar to what we do in classical ANOVA statistics where we compare the results from test and control group, is there a good way to compare the differences in Bayesian statistics? Is there a better way to interpret the results?

Get this bounty!!!

## #StackBounty: #bayesian #conditional-probability The probability of a sequence of conditional probabilities with a strict ordering

### Bounty: 50

Let’s assume that we are observing a sequence of choices by an individual. Each choice situation is indexed by $$t = 1,…, T$$ and ordered such that a choice made in $$t = 1$$ precedes a choice made in $$t = 2$$. Furthermore, we cannot observe a choice in $$t = 2$$ unless a particular choice was made in $$t = 1$$. As an observer, I am interested in calculating the probability of observing a particular sequence of choices made by an individual.

The problem can be structured as a decision tree like this:

It is obvious from the structure of the problem that observing a choice in $$t = 2$$ is conditional on choosing $$S_1$$ and that observing a choice in $$t = 3$$ is conditional on choosing $$S_1$$ and $$S_2$$ in $$t = 1$$ and $$t = 2$$ respectively. To solve this I thought of using Bayes’ theorem, which states that the conditional probability of $$A$$ given $$B$$ is:

$$P(A|B) = frac{P(B|A)P(A)}{P(B)}$$

To put this in the context of the current problem, let $$A$$ be the probability of observing a choice in $$t$$ and $$B$$ be the probability that you chose $$S$$ in $$t-1$$. Now, $$P(B|A) = 1$$ because the probability that you chose $$S$$ in $$t-1$$ conditional on us observing a choice in period $$t$$ is known with certainty given the strict ordering on $$t$$. This means that $$P(A|B)$$, i.e. the probability of observing a choice conditional on choosing $$S$$ in the previous period, reduces to the ratio $$P(A)/P(B)$$. However, there is no guarantee that $$P(A) < P(B)$$ which means that $$P(A|B)$$ is no longer bound by the unit interval. Now this creates obvious problems.

Let us look at a numerical example to illustrate the practical implications of this. In $$t=1$$ the probability of choosing $$S$$ is .57, in $$t=2$$ it’s 0.34 and in $$t=3$$ it is .73. The probabilities of the others vary, which can happen, but they are specifically chosen extreme here to illustrate the problem.

Now, applying Bayes’ theorem above, I get the following:

Where $$P(A|B)$$ in period $$t-1$$ is $$P(B)$$ in period $$t$$. Finally, the probability of observing the sequence of choices is the product over the conditional probabilities. However, given the likelihood of getting $$P(A|B) > 1$$, I am concerned that this may not be the correct application of the theorem or even if it is possible in sequence like this. In practice, the sequence of observed choices can be very long.

Get this bounty!!!

## #StackBounty: #r #bayesian #mcmc #prior #hierarchical-bayesian For Prior definition in bayesian regression with R package MCMCglmm, how…

### Bounty: 50

I understand the strength of the Prior is set via parameter nu however, I can not find information what nu expresses in statistical terms, e.g. how strong would a prior that is similar to the number of variables x be in this example?

``````#Inverse Wishart (multivariate, variables=x)

prior.miw<-list(R=list(V=diag(x), nu=x),G=list(G1=list(V=diag(x),
nu=x)))
``````

I also saw a lot of examples for weak priors with nu=0.01, does it mean we have a 1/100 degree of belief in the prior compared to the posterior?

Get this bounty!!!

## #StackBounty: #regression #bayesian #covariance Interpretation of multiple regressions posterior distribution

### Bounty: 50

I’m interested in how we evaluate performance of Bayesian regression (linear, multiple, logistic, etc.) The posterior distribution will capture the relative likeliness of any parameter combination. So a 2D heatmap, for example of B1 and B2 (coefficients) might give us some insight into their relationship.

Recently, a colleague of mine mentioned that the posterior’s covariance matrix is effectively "all you need." I want to ask, is this oversimplifying the matter (and even if so) what does the posterior covariance matrix tell you?

My guesses are:

(1) Along the diagonal you get single parameter’s variance. The lower the number, the more confidence we have in the estimate. Whereas high variance might indicate that we’re less confident in our estimate.

(2) Covariance between parameters might be trickier to interpret. The direction (+/-) of the covariance might give an indication of the nature of the relationship (is an increase in one parameter associated with an increase, decrease or neither in the other.)

(3) The magnitude of the covariance gives me pause. Does a small value imply high confidence in the relationship or little to no association? (Very different meanings!)

(4) I can imagine a situation where the variance of B1 is quite small, so perhaps we’re confident in the estimate, whereas the variance of B2 might be rather large, so less confident. I’m not sure how this would affect our understanding of covariance direction and magnitude.

*All the above assumes proper analysis, no multicollinearity, collider bias, etc.

Any thoughts?

Get this bounty!!!

## #StackBounty: #r #regression #logistic #bayesian log-odds and it's standard error as priors in logistic regression

### Bounty: 50

I’m attempting to complete a Bayesian logistic regression with the outcome of whether or not a crash occurred. I have various covariates in my model that are widely used to predict crash occurrence. As such, I’m using informed priors from prior publications that report the odds ratio and it’s 95% C.I for each covariate.

Here’s an example of a prior provided by the model I’m pulling from

crash at night (OR 13.1; 95% CI 5.0 to 31.5) : log-odds (1.12,.20) from $$frac{log(31.5-5)}{3.92}$$

I wanted to apply the log-odds of these results and their standard error in my updated model as priors. My first thought was to apply the log-odds and it’s a standard error on a normal prior. I’m using logic from the sources 1 & 2 listed at the end of the post.

My question, if my assumptions about applying these log-odds and SE’s on a normal prior are correct, can I simply transform the SE of the log odds to variance and implement?

a normal prior:

βk = (μβk2βk)

requires a variance rather than an SE. According to citation 3 log-odds SE and be transformed into log-odds VAR:

$$SE[log(OR)] = sqrt{VAR[log(OR)]} => SE^2 = VAR[log(OR)]$$

therefore, if I square the standard error x then I should be able to apply this as my final prior:

βk = (1.12,.04)

Is this assumption correct or am I way off? Is there a better way of implementing log-odd priors and their SE’s into a logistic regression model?

Thanks!

1. AdamO (https://stats.stackexchange.com/users/8013/adamo), Prior for Bayesian multiple logistic regression, URL (version: 2016-03-16): https://stats.stackexchange.com/q/202046

"Basically, you have the flexibility to parametrize estimation however
you see fit, but using a model which is linear on the log odds scale
makes sense for many reasons. Furthermore, using a normal prior for
log odds ratios should give you very approximately normal posteriors."

1. Sander Greenland, Bayesian perspectives for epidemiological research: I. Foundations and basic methods, International Journal of Epidemiology, Volume 35, Issue 3, June 2006, Pages 765–775, https://doi.org/10.1093/ije/dyi312

"To start, suppose we model these a priori ideas by placing 2:1 odds
on a relative risk (RR) between ½ and 2, and 95% probability on RR
between ¼ and 4. These bets would follow from a normal prior for the
log relative risk ln (RR) that satisfies…"

1. StatsStudent (https://stats.stackexchange.com/users/7962/statsstudent), How do I calculate the standard deviation of the log-odds?, URL (version: 2020-04-19): https://stats.stackexchange.com/q/266116

Get this bounty!!!

## #StackBounty: #r #regression #logistic #bayesian log-odds and it's standard error as priors in logistic regression

### Bounty: 50

I’m attempting to complete a Bayesian logistic regression with the outcome of whether or not a crash occurred. I have various covariates in my model that are widely used to predict crash occurrence. As such, I’m using informed priors from prior publications that report the odds ratio and it’s 95% C.I for each covariate.

Here’s an example of a prior provided by the model I’m pulling from

crash at night (OR 13.1; 95% CI 5.0 to 31.5) : log-odds (1.12,.20) from $$frac{log(31.5-5)}{3.92}$$

I wanted to apply the log-odds of these results and their standard error in my updated model as priors. My first thought was to apply the log-odds and it’s a standard error on a normal prior. I’m using logic from the sources 1 & 2 listed at the end of the post.

My question, if my assumptions about applying these log-odds and SE’s on a normal prior are correct, can I simply transform the SE of the log odds to variance and implement?

a normal prior:

βk = (μβk2βk)

requires a variance rather than an SE. According to citation 3 log-odds SE and be transformed into log-odds VAR:

$$SE[log(OR)] = sqrt{VAR[log(OR)]} => SE^2 = VAR[log(OR)]$$

therefore, if I square the standard error x then I should be able to apply this as my final prior:

βk = (1.12,.04)

Is this assumption correct or am I way off? Is there a better way of implementing log-odd priors and their SE’s into a logistic regression model?

Thanks!

1. AdamO (https://stats.stackexchange.com/users/8013/adamo), Prior for Bayesian multiple logistic regression, URL (version: 2016-03-16): https://stats.stackexchange.com/q/202046

"Basically, you have the flexibility to parametrize estimation however
you see fit, but using a model which is linear on the log odds scale
makes sense for many reasons. Furthermore, using a normal prior for
log odds ratios should give you very approximately normal posteriors."

1. Sander Greenland, Bayesian perspectives for epidemiological research: I. Foundations and basic methods, International Journal of Epidemiology, Volume 35, Issue 3, June 2006, Pages 765–775, https://doi.org/10.1093/ije/dyi312

"To start, suppose we model these a priori ideas by placing 2:1 odds
on a relative risk (RR) between ½ and 2, and 95% probability on RR
between ¼ and 4. These bets would follow from a normal prior for the
log relative risk ln (RR) that satisfies…"

1. StatsStudent (https://stats.stackexchange.com/users/7962/statsstudent), How do I calculate the standard deviation of the log-odds?, URL (version: 2020-04-19): https://stats.stackexchange.com/q/266116

Get this bounty!!!

## #StackBounty: #probability #bayesian #mathematical-statistics #posterior #conjugate-prior Help with the prior distribution

### Bounty: 50

The question is as follows:

Consider an SDOF mass-spring system. The value of the mass is known and is equal to 1 kg.
The value of the spring stiffness is unknow and based on the experience and judgement the following is assumed. Value of stiffness is in the following range [0.5, 1.5] N/m.

To have a more accurate estimate of the value of the stiffness an experiment is performed where in the natural frequency of the system is observed. The following observation are made:

``````  Observation 1     Freq = 1.021 rad/sec
Observation 2     Freq = 1.015 rad/sec
Observation 3     Freq = 0.994 rad/sec
Observation 4     Freq = 1.005 rad/sec
Observation 5     Freq = 0.989 rad/sec
``````
1. Based on the information provided write the functional form of prior PDF.
2. Plot the likelihood function with different number of observations.
3. Based on the information provided write the functional form of the posterior PDF.
4. Plot the posterior distribution.

My work so far:

spring constant $$k = sqrt{{w}/{m}}$$
m = 1kg, so $$w = k^{2}$$.

$$k sim Uniform(0.5, 1.5)$$,

so pdf of w = $$f(w) = 2w$$

where $$w epsilon [sqrt{0.5},sqrt{1.5}]$$

So prior distribution is linear in the range root(0.5), root(1.5).

$$Likelihood = L = 2^{5}(1.0211.015..0.989) approx 2.04772$$

This is what I have done so far. I am new to Bayesian inference and I am not sure how to proceed after this or if what I have done so far is correct. Pleas advice on how to find the posterior function.

Get this bounty!!!