*Bounty: 50*

*Bounty: 50*

I have a binary dependent variable $t$ and categorical features. We can even simplify to binary features since I can 1-hot encode the categorical variables.

The purpose is to estimate the probability of $t=1$.

In principle I can use a logistic regression.

But, given the categorical nature of the input data they actually define a table of $2^D$ cells. So I could instead just estimate the proportion of $t=1$ samples in each cell.

I think this should be similar to the logistic regression in that both approaches assume a binomial likelihood function. However the logistic regression assumes that the log odds are a linear function of the input variable (which is not assumed under the density estimation procedure). I think this assumption is not critical here given the binary nature of the inputs.

So, the question is, are the two approaches different? If yes, in what aspect are they different?

One difference would be of course that the estimation method for logistic regression is iterative so in some cases there might be convergence issues.

One would be tempted to say that as $D$ increases many cells in the table will be (near to) empty. But I think logistic regression would suffer as well in those cases.

As additional questions (connected to the first one): is there anything wrong in my line of thought? Which of the two approaches should perform better?