#StackBounty: #bayesian #normal-distribution #marginal-distribution #integral marginalize over multiple Gaussian distributions

Bounty: 50

How can I marginalize over $H$ given following distributions where all of them are Gaussian distributions?
$$int P(X_1|alpha H,sigma_x^2)P(X_2|beta H,sigma_x^2)P(H|0,sigma^2_H)dH$$

well, the mathematical trick for two could be used namely the “completion of squares”. I am wondering how I can extend it to above integral?


Get this bounty!!!

#StackBounty: #bayesian #estimation #bootstrap Is it acceptable to use Boostrap/Jackknife to estimate the variance of an MAP estimators?

Bounty: 50

Suppose we obtain a point estimate using a maximum a posteriori estimator $hat{theta}_{MAP}$. Note that I’m aware Bayesian approaches generally do seek point estimates, but suppose this is an example where a point estimator is specifically needed, and we want to incorporate some prior information using a prior distribution. Then we need some way to quantify the accuracy of our estimator $hat{theta}_{MAP}$ – and suppose that an analytical approach to deriving it is not available.

My question is whether it is acceptable to use standard bootstrap or jackknife methods to estimate the variance of $hat{theta}_{MAP}$? For example, suppose we obtain thousands of bootstrap resamples, each with a boostrap estimate $hat{theta}_{MAP}^b$, and estimate the standard error of our estimator based off of these boostrap sample estimates.

The reason I am confused is that all mentions of stanndard bootstrap/jackknife that I have read consider it solely in the frequentist domain, and I can’t really find any reference to normal bootstrap/jackknife when it comes to Bayesian point estimates. Again, I guess this is because Bayesian approaches generally do not seek point estimates over descriptive statistics of the posterior.


Get this bounty!!!

#StackBounty: #machine-learning #hypothesis-testing #bayesian #feature-selection #naive-bayes Confusion about 1- vs 2-tailed tests for …

Bounty: 50

I’m taking a course on pattern recognition and machine learning at the University of Qatar. I was given an old exam as a study guide, with answers marked by the TA. This is one of the questions:

Suppose $x_i (i=1,2,…,N)$ be attribute values for $N$ samples from
class $W_1$ with mean $mu_1 $ and $y_i (i=1,2,…,N)$ be attribute
values for $N$ samples from class $W_2$ with mean $mu_2 $.

For feature selection using hypothesis testing, how define $H_0$
hypothesis?

  • Option (A): $qquadmu_1 -mu_2=0 $

  • Option (B): $qquadmu_1 -mu_2>0 $

The correct answer is supposed to be (B).

Can someone explain this to me? I don’t understand why (B) should be correct. What I learned from Theodoridis & Koutroumbas (2009) Pattern Recognition, 4e, section 5.3.2, pp 273-4, implies the answer is (A):

enter image description here

Update: As one of the answer I think the problem setter has classification view for this problem. so what is the $H_0$ in this situation based on Bayesian? I’m sure with this assumption (B) be the answer, but I need a true assumption of this case.


Get this bounty!!!

#StackBounty: #bayesian #pymc #stan #marketing #customer-lifetime-value Why do Pareto/NBD models require custom likelihood functions in…

Bounty: 50

I’m interested in Bayesian modeling of customer lifetime value (CLV), preferably via PyMC3. I’ve found that research in this area started mid-to-late 1900’s and has remained active since. It would seem that some combination of Exponential, poisson, Negative Binomial, Gamma, and Pareto distributions are frequently used. However, the likelihood function is not a "stock" distribution built into popular probabilistic programming tools (such as PyMC3 and Stan.) In consequence, various authors have derived custom likelihood functions to ensure successful posterior sampling.

Here are two such implementations:

  1. PyMC3 implementation
  2. R/Stan implementation

And this paper by Fader/Hardie seems to contribute some ideas central to the first link and at the very least, inspirational to the second.

Here is my current understanding of the Bayesian CLV design:

  1. The number of purchases made by any given customer (irregardless of dollar) are Poisson distributed when (and only when) in an active state; so far, this reminds me of zero-inflated Poisson regression. Customers vary, but pooling share information between customers and the Gamma distribution is used to accomplish this effect in some way.

  2. Each customer has a fixed but latent lifetime. The exponential distribution is used to model time until the next period of activity. I’m not positive on the link between these ideas; perhaps an extremely long period of inactivity, associated with a very small likelihood of the exponential PDF mitigates future events described by the Poisson distribution in some way.

  3. The PyMC3 implementation models the per customer average purchase value, which seems to be an integral part of the model in virtually any situation, unless a given business sells only one product at a fixed price or variations in price are negligible.

My biggest point of confusions are (A) the link between inactivity and a customer never returning, (B) How parameters are pooled such that customers receive their own "parameters" but inter-parameter communication exists, (C) How a given dollar value, the customer’s CLV, can be inferred by combining these elements, and (D) How/Why a customer likelihood function is necessary to achieve this effect.

In this question, I’m soliciting answers that clarify any/all of the above.


Get this bounty!!!

#StackBounty: #bayesian #pymc #stan #marketing #customer-lifetime-value Intuitions behind Bayesian CLV models?

Bounty: 50

I’m interested in Bayesian modeling of customer lifetime value (CLV), preferably via PyMC3. I’ve found that research in this area started mid-to-late 1900’s and has remained active since. It would seem that some combination of Exponential, poisson, Negative Binomial, Gamma, and Pareto distributions are frequently used. However, the likelihood function is not a "stock" distribution built into popular probabilistic programming tools (such as PyMC3 and Stan.) In consequence, various authors have derived custom likelihood functions to ensure successful posterior sampling.

Here are two such implementations:

  1. PyMC3 implementation
  2. R/Stan implementation

And this paper by Fader/Hardie seems to contribute some ideas central to the first link and at the very least, inspirational to the second.

Here is my current understanding of the Bayesian CLV design:

  1. The number of purchases made by any given customer (irregardless of dollar) are Poisson distributed when (and only when) in an active state; so far, this reminds me of zero-inflated Poisson regression. Customers vary, but pooling share information between customers and the Gamma distribution is used to accomplish this effect in some way.

  2. Each customer has a fixed but latent lifetime. The exponential distribution is used to model time until the next period of activity. I’m not positive on the link between these ideas; perhaps an extremely long period of inactivity, associated with a very small likelihood of the exponential PDF mitigates future events described by the Poisson distribution in some way.

  3. The PyMC3 implementation models the per customer average purchase value, which seems to be an integral part of the model in virtually any situation, unless a given business sells only one product at a fixed price or variations in price are negligible.

My biggest point of confusions are (A) the link between inactivity and a customer never returning, (B) How parameters are pooled such that customers receive their own "parameters" but inter-parameter communication exists, (C) How a given dollar value, the customer’s CLV, can be inferred by combining these elements, and (D) How/Why a customer likelihood function is necessary to achieve this effect.

In this question, I’m soliciting answers that clarify any/all of the above.


Get this bounty!!!

#StackBounty: #regression #bayesian #optimization #gradient-descent #expectation-maximization Bayesian Regression- Expectation Maximiza…

Bounty: 50

In Bayesian regression, we have $y_i=x_i^{T}w+epsilon_i$ where $w sim mathcal{N}(0,alpha)$ and $epsilon_i sim mathcal{N}(0,frac{1}{beta})$.
Inference of $alpha$ and $beta$ is done by maximizing the likelihood(or marginal likelihood) given the data. This paper(Appendix A.1) explains that maximization can be done by expectation-maximization(EM).

My questions is that why do we need EM to do the maximization. I understand that EM can be used by treating the $w$ as the hidden variable. However, $w$ can be integrated out and we can apply gradient descent(GD). Why don’t we use GD?


Get this bounty!!!

#StackBounty: #bayesian #moments #density-estimation #maximum-entropy #invariance What determines the functional form of maximum entrop…

Bounty: 50

I’m familiar with the maximum entropy (ME) principle in statistical mechanics, where, for example, the Boltzmann distribution $p(epsilon_i|beta)$ is identified as the ME distribution constrained by normalizability and a given average energy $langle E rangle$, where the inverse temperature $beta$ is a Lagrange multiplier.

E. T. Jaynes called this "no-data inference", i.e., from $p(epsilon_i|beta)$ we make a host of other predictions (e.g. calculate the average pressure $langle P rangle$) based only on the information ingested by ME, not on directly observed data.

But now suppose I have a data set of measured values $x_{1:N} in mathbb{R}^N$ and I want to determine a pdf $p(x|cdot)$ that describes my knowledge about the $x_{1:N}$ using ME.

After I have determined an invariant measure $m(x)$ to describe my initial state of ignorance, one way to proceed would be moment density estimation. In other words, I could start calculating $I$ empirical moments of the $x_{1:N}$,
$$ langle x^i rangle, quad i = 1 cdots I, $$
use these to numerically determine the Lagrange multipliers $lambda_{1:I}$, and finally end up with
$$ p(x|lambda_{1:I}) = Z(lambda_{1:I})^{-1} m(x) expBig(sum_{i=1}^I lambda_i x^i Big). $$

Here’s what troubling me:

  • Since I actually dispose of the $x_{1:N}$, I can calculate an empirical average of any function $f(x)$ over them. For example, $f(x) = x$ or $f(x) = arctan{log |x|}$.
  • The form of the function $f(x)$ directly determines the form of $p(x|cdot)$ and consequently the sufficient statistic through the constraint $langle f(x) rangle = int f(x) p(x) dx$.
  • But what determines the functional form of $f(x)$? Here’s what I think: In physics, it is presumably the measuring apparatus or "natural" expressions occuring frequently in physical theory (e.g. we talk about energy $E$ and not $sqrt{E}$, so we supply $langle E rangle$ and not $langle sqrt{E} rangle$). But what should we do when we actually have data, and we could choose any $f(x)$, and hence $p(x|cdot)$, we like?
  • What does the invariant measure $m(x)$ actually accomplish in this regard? Suppose $x > 0$ and $y = x^2$. If I setup a ME pdf $p(x|cdot)$ using $langle x rangle$ and $langle x^2 rangle$, is this equivalent to the ME pdf $p(y|cdot)$ using $langle sqrt{y} rangle$ and $langle y rangle$?


Get this bounty!!!

#StackBounty: #distributions #bayesian #gaussian-process #kernel-trick Meaning of "Distribution over functions" in Gaussian P…

Bounty: 50

I’m reviewing PyMC3’s Gaussian Process documents and it’s illuminated that I might have a flawed understanding of what "distribution over functions" actually means. Consider the below code:

# A one dimensional column vector of inputs.
X = np.linspace(0, 1, 10)[:, None]

with pm.Model() as model:
    # Specify the covariance function.
    cov_func = pm.gp.cov.ExpQuad(1, ls=0.1)

    # Specify the GP.  The default mean function is `Zero`.
    gp = pm.gp.Marginal(cov_func=cov_func)

    # Place a GP prior over the function f.
    sigma = pm.HalfCauchy("sigma", beta=3)
    y_ = gp.marginal_likelihood("y", X=X, y=y, noise=sigma)

...

# After fitting or sampling, specify the distribution
# at new points with .conditional
Xnew = np.linspace(-1, 2, 50)[:, None]

with model:
    fcond = gp.conditional("fcond", Xnew=Xnew)

As you can see, the (RBF) kernel hyperparameters are fixed; no priors were attached to either in pm.gp.cov.ExpQuad(1, ls=0.1). It was my understanding that Kernel(sigma_f=1, length=0.1) is a single function, not a distribution over functions.

But perhaps I’m thinking about it wrong; rather a Kernel with fixed hyperparameters generates a covariance function (often the mean function is assumed to be all zeros.) When the mean and covariance functions are supplied as parameters to multivariate gaussian, which can be sampled from, each draw is a sample from the distribution over functions.

Any thoughts?


Get this bounty!!!

#StackBounty: #bayesian #autoencoders #kullback-leibler Why are we checking the difference between q(z|x), and p(z|x) in variational en…

Bounty: 50

I’m trying to understand how VAEs work, because I didn’t understand how cross entropy between $x$ (input fed into the encoder), and $p(x|z)$ (output of decoder) minus KL divergence between $p(z|x)$, and $p(z)$ result in the latent space being so continuous. Although, I still don’t understand it, the first step is to understand why are we calculating the loss as $KL(q(z|x)||p(z|x))$, because as I read, this is the first "idea", but because of $p(x)$ is intractable, we use the fact that maximizing ELBO minimizes KL, and breaking down ELBO in an equation where we can calculate each piece.

The first question in my mind was why aren’t we comparing $p(x|z)$, and $q(x|z)$. At the end of the day we care about how well $x$ was constructed from $z$. The encoder produces $q(z|x)$, but where does $p(z|x)$ come from, and what do we know about it? We’re calculating $p(x|z)$ with the decoder, not $p(z|x)$. Or does it mean if we can transform $p(x|z)$ to $p(z|x)$, and the difference between $q(z|x)$, and $p(z|x)$ is small, then $p$ is predicting $x$ from $z$ accurately, as $p(x|z)$?

In this understanding, we’re trying to build two networks mirroring each other. But I just don’t understand seeing the current equations how it is performing well. We’re comparing two distributions that maybe similar to each other, but in neural network perspective, can be wrong (poor reconstruction, or wrong place in latent space). So it shouldn’t be enough to build networks that are exact opposite of each other. Our only pillar is $x$ fed into the encoder, and if it would mean $q(x|z)$, it would make sense, but as I’ve read so far, the input fed into the encoder is simply $x$. On the other side, if the input fed into encoder would be $q(x|z)$, it would prove that if the difference between $q(z|x)$, and $p(z|x)$ is small, it doesn’t mean $p$ is predicting $x$ from $z$ accurately, as $p(x|z)$, at least early stages of the training.

So the question is does comparing $q(z|x)$ with $p(z|x)$ prove how effectively the encoder, and decoder mirroring each other? In my head they could mirror each other effectively by placing the latent variables not with the continuousity as they do, so how does the loss we’re calculating ensures that? Also, regardless of the tractability of the breakdown, could comparing $q(x|z)$ with $p(x|z)$ measure the loss just as effective as comparing $q(z|x)$ with $p(z|x)$?

Are my thoughts, and imagination right, or I’m missing something important?

Sorry for asking questions that may be stupid, or obvious, but I don’t have too much knowledge of probabilities, just built simple neural networks, and it is very hard to imagine in a programmer perspective.


Get this bounty!!!

#StackBounty: #bayesian #python #mcmc #poisson-regression #marketing Poisson regression for market media modeling?

Bounty: 50

I put together a related question linked for added context; it generally overviews a model proposed by Google in 2017 for (Marketing) Media Mix Modelling (MMM.) The model updates the pretense of MMM in that it accounts for delay (advertising spend today might have peak influential effect on customers 2-3 days later) and saturation (diminishing returns in target variable after a certain amount of spend.)

As detailed in the question, the model is extremely sensitive to variable scale; finding appropriate priors is a non-trivial task. In best case scenario, the effect of inputs is minimized and the intercept is effectively predicted for all observations. At worst, the sampler fails to converge or fails to compute gradients and quits.

In response, I’ve created an architecture of my own that has promising results: Namely, I still use the decay function Google proposed; however, I’ve multiplied this output per channel by 0.5^{Power} where power is inferred per channel parameter. This mitigates the effect of scale disparity. Using pyMC3 code:

import arviz as az
import pymc3 as pm

with pm.Model() as train_model:
    #var,      dist, pm.name,          params,  shape   
    p     = pm.Gamma('power',          2 , 2,   shape=X.shape[1]) # raises shrinkage to power                         

    alpha = pm.Beta('alpha',           3 , 3,   shape=X.shape[1]) # retain rate in adstock 
    theta = pm.Uniform('theta',        0 , 12,  shape=X.shape[1]) # delay in adstock

    tau    = pm.Normal('intercept',    0,  5                    ) # model intercept
    noise = pm.Gamma('noise',          3,  1                    ) # variance about y     
    
    
    computations = []
    for idx,col in enumerate(train.columns):
        delay = geometric_adstock(x=train[col].values, alpha=alpha[idx], theta=theta[idx], L=12)
        comp = 0.5**p[idx] * delay
        computations.append(comp)

    
    y_hat = pm.Normal('y_hat', mu= tau + sum(computations),
                  sigma=noise, 
                  observed=train_y)
    
    trace_train = pm.sample(chains=4)

And the trained model’s performance on unseen test data:

Test data

So far, I’ve just built up background context for my actual question.

This approach works well in this situation; however, I would like to find a way to capture saturation in this model by other means. Of note, the scale of my data is such that the inputs (marketing spend) are much higher than the output (new customers gained that interval.) Because of the events per interval nature of my output, would Poisson regression be appropriate?

It makes little sense for my output to take on a negative number (very unlikely that a marketing campaign would cause customers to leave, though I suppose possible). Likewise, I think that Poisson sort of has a saturation effect built into it, whereas linear regression inherently doesn’t taper off expectations at higher level inputs. (Could someone verify my thinking?)

Lastly, due to the 0.5^{power} transformation, all inputs are drawn towards the same scale as the target variable. I might not be able to infer a saturation rate per channel (as Google’s approach tried to do) but once all channels are similarly scaled, a single saturation effect, captured by the Poisson likelihood should function well, regardless. (Is this line of thought reasonable?)

lastly, I have not implemented this yet, as I wanted to explore its theoretical justifications/shortcomings first. However, I’m wondering if the exponential nature of the Poisson PDF and the power parameters "fight" each other; it might produce bizarre curvature and tricky geometry for a gradient based sampler. Any thoughts?

Edit:

See the below distribution of sales data. Note there are only 200 examples in my dataset. So my data might be normally distributed with skewness by chance or Poisson distributed and properly represented.

Sales Distribution


Get this bounty!!!