*Bounty: 50*

*Bounty: 50*

Suppose I have some observations $(x_i,y_i)$ from some population

```
x y
1 1
1 0
nan 1
nan 0
...
4 1
```

I would like to build a model to predict $y|x$

Assume all cases have $y$ and that I remove all, for example, 50% of the cases that have missing $x$.

We then have a predictive model $m$ for $y|{x, text{$x$ not missing}}$. This could be extremely useful. For example, suppose that we are trying to predict some disease; any time someone presents with $x$ not missing, we can use our model. It’s too bad that we cannot say anything for those with $x$ missing, but we have generally improved the world for the subpopulation for which $x$ is collected.

If, however, we remove cases with missing $x$, and we then (1) decide to use $m$ on cases with $x$ missing or (2) make some statement about the whole population based on our estimated coefficient, this is clearly not correct. For (1), we would be using the model on a population different than the one for which it was trained. For (2), we would be ignoring the bias we might have introduced by removing cases with missing $x$.

I think it is for the second reason especially that removing missing data gets a bad wrap. However, using $m$ as originally described *if you are honest about the model not applying to cases with missing x* seems to be a good idea (although not the $best$ idea, which would be to impute and have $m’$ for the whole population), and not to be kind of incorrect like (1) or (2). In this way, removing missing data does not introduce ‘bias’ it introduces constraints on the usability of the model?