#StackBounty: #probability #modeling #fitting #mixture Categorical mixture model when mixture components are not PDFs (don't sum to…

Bounty: 50

I constructed a model that behaves the way I want, it successfully recovers parameters from simulated data, etc. However, I get the feeling that I re-invented the wheel, so to speak – surely someone has come across this problem before, solved it, there is someone I can cite, some name for the technique, some better way to do it, etc.

I have observations $Y={y_{is}}$, where $sin{1,ldots,S}$ indicates a particular site, and $i$ indexes observations within a site $s$. Each $y_{is}$ takes one of $C$ possible labels: $y_{is} in {1,ldots,C}$.

The probability that $y_{is}=c$ is influenced by $K$ different categorical predictors, where each $k$ gives a probability distribution for the labels $C$ for each site $s$, i.e. $theta_{k,s}=(theta_{k,s,1},ldots,theta_{k,s,C})$ is a probability distribution at site $s$ over the labels $C$. All $theta$ are known; the only unknown is how likely it is that $y_{is}$ was drawn from $theta_k$.

At this point, it sounds like a typical mixture distribution, in which $alpha_k$ is the mixture proportion (i.e., the probability that you draw from $theta_k$):

$$
P(y_{is}=cmidTheta) = sum_{k=1}^K alpha_ktheta_{k,s,c}
$$

$$sum_{k=1}^Kalpha_k=1$$

However, for a mixture distribution to work, each $theta$ is a PDF, such that $sum_{c=1}^Ctheta_{k,s_i,c}=1$, but in my case $theta$ is not a real probability distribution, but instead $sum_{c=1}^Ctheta_{k,s_i,c} in [0,1]$. Since $y_{is}in C$ but it is possible that $sum_{c=1}^CP(y_{is}=cmidTheta)<1$, this model clearly does not work.

EDIT #2 (updated the model specification, following comments):

The model I have come up with that works as intended is to create a normalized version of $theta$, $phi_{k,s,c}=frac{theta_{k,s,c}}{sum_{c=1}^Ctheta_{k,s,c}}$, and a new variable $P(k mid s)=sum_{c=1}^Ctheta_{k,s,c}$. Notice that since $theta_{k,s,c}=P(k mid s)phi_{k,s,c}$ it is now possible to formulate my problem as a sort of mixture model:

$$
P(y_{is}=cmidTheta) = sum_{k=1}^K alpha_{k,s}phi_{k,s,c}
$$

$$sum_{c=1}^Cphi_{k,s,c}=1$$
$$
alpha_{k,s}=frac{beta_kP(k mid s)}{sum_{k=1}^K beta_kP(k mid s)}
$$

$$P(k mid s) in [0,1]$$

In this model, all $phi$ and $P(k mid s)$ are known, and we seek to find the parameters $(beta_1,ldots,beta_K$). To fit the model, we can set $beta_1=1$ and fit the rest of the $beta$s using MLE or grid search.

What is different about this model, compared to a the multinomial mixture models I’ve seen before, is that the mixture proportions $alpha_{k,s}$ come from a combination of a known parameter ($P(k mid s)$) and an unknown parameter ($beta_k$). So, really, I seek to fit a hyperparameter on the mixture proportion, rather than the mixture proportion itself.

Does this type of model have a name/literature behind it? Alternatively, can my problem be solved using some other technique (e.g. some sort of Dirichlet-multinomial regression or something…) that is citeable/has been well-characterized/etc.?

EDIT #3: Here is an alternate model, where the model is very typical, but the manner in which I would have to fit it is atypical. As far as I can tell, it yields exactly the same information as the model in EDIT #2.

Let:
$$D supset C$$
$$D-C={other}$$
$$phi_{k,s}={theta_{k,s,1},…,theta_{k,s,C},1-sum_{c=1}^Ctheta_{k,s,c}}$$

Note that we’ve constructed a proper PDF $phi$ by adding a missing category, "other", to $C$. Now let:

$$Z = {z_{js}}$$
$$P(z_{js} = d mid Theta)=sum_{k=1}^Kalpha_kphi_{k,s,d}$$
$$Y = {zmid z in Z, z neq other}$$

Then:
$$P(y_{is} = c mid Theta) = frac{P(z_{js} = c mid Theta)}{1-P(z_{js} = other mid Theta)}$$

In other words, $Y$ is constructed by making some unknown number of draws from $D$, and removing all the draws with a value of "other"; alternatively each $y$ is a the result of drawing repeatedly from $D$ until you get a value besides "other". In this model, we know each $phi$ and we know $Y$ (but not $Z$), and we want to find the $alpha$s knowing just this subset of data.

Notice that in the Edit #2 model, there are an infinite number of fits, so long as the ratios of $beta$s are constant. When those $beta$s are normalized to sum to 1, then they will be equal to the $alpha$s in the Edit #3 model.

Here’s a toy example with numbers:

Let:
$$C in {red,green,blue}$$
$$K in {pencil,pen}$$
$$S in {wall, table}$$
$$
theta_{wall,pencil}=[0.2,0.4,0.4],
theta_{wall,pen}=[0.8,0.1,0.1],$$

$$theta_{table,pencil}=[0,0,0.1],
theta_{table,pen}=[0.3,0.3,0.4]
$$

Notice that $theta_{table,pencil}$ does not sum to 1; imagine that e.g. $theta_{table,pencil,orange}=0.9$, but $C$ cannot be orange. In my desired model, when the "mixture-like" parameters for $theta$ are equal to the same value (i.e. when $beta_{pencil}=beta_{pen}$), then I want the distribution of $C$ at $wall$ to be [0.5,0.25,0.25], and the distribution of $C$ at $table$ to be [0.2727,0.2727,0.4546]. If $beta_{pencil}=2beta_{pen}$, then I want the distribution of $C$ at $wall$ to be [0.4,0.3,0.3], and the distribution of $C$ at $table$ to be [0.25,0.25,0.5].

Using the model in Edit #2, we can e.g. get:

$$P(y_{is}=blue mid beta_{pencil}=2/3,beta_{pen}=1/3)=frac{(2/3)(0.1)}{(2/3)(0.1)+(1/3)(1)}(1)+frac{(1/3)(1)}{(2/3)(0.1)+(1/3)(1)}(0.4)=0.5$$

Using the model in Edit #3, we can get the same result:

$$P(y_{is}=blue mid alpha_{pencil}=2/3,alpha_{pen}=1/3)=frac{(2/3)(0.1)+(1/3)(0.4)}{1-((2/3)(0.9)+(1/3)(0))}=0.5$$

Edit #2 is a non-standard model because it’s a mixture model where I’m fitting a hyperparameter on the mixture proportion instead of the mixture proportion itself; it’s also weird because multiplying each $beta$ by the same constant yields an identical model (so there are infinite fits, and you have to either fix the value of one $beta$ term, or care about the ratio of two $beta$s.

Edit #3 is a non-standard model because I’m fitting a typical mixture model, but the mixture model generates a superset of the data that I’m fitting it to.

For my application, I think both models are theoretically defensible. And, as far as I can tell, the models are equivalent in that given some data $Y$ they both recover the same parameters (that is, $beta$ in Edit #2 and $alpha$ in Edit #3 are the same). However, I don’t know whether one model will be easier to fit, or whether e.g. it might be easier to estimate the standard errors on the coefficients of one vs. the other model. Is there a good reason to use the model in Edit #2 vs. Edit #3?


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.