Bounty: 50
I ran a MC simulation of $10^5$ GLM regressions (logistic, logit link) in R. To do so, I assumed:
- The outcomes ($y$) were repeatedly sampled from a Bernoulli distribution ($N=1000$)
- The one explanatory variable ($x$) was sampled independently from y from a half-normal distribution ($x≥0$)
- I then calculated the accuracy of predictions (with cutoff 0.5) as true positive + true negatives over all N
Naïvely, I was perhaps expecting an mean/median accuracy of 0.5, but that wasn’t true. The average accuracy was around 51.5%. Is there a good intuition or theoretical result for this?