*Bounty: 100*

I am wondering, within the context of causal inference, what it means to “non-parametrically” identify a causal effect within the super-population perspective. For example, in Hernan/Robins Causal Inference Book Draft:

https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/2019/02/hernanrobins_v1.10.38.pdf

It defines non-parametric identification on pg. 43 and 123 as:

…identification that does not require any modeling assumptions when

the size of the study population is quasi-infinite. By acting as if we

could obtain an unlimited number of individuals for our studies, we

could ignore random fluctuations and could focus our attention on

systematic biases due to confounding, selection, and measurement.

Statisticians have a name for problems in which we can assume the size

of the study population is effectively infinite: identification

problems.

I understand the **identification** part to mean that under the strong ignorability assumption, there is only ONE way for the observed data to correspond to a causal effect estimand. What confuses me is why we need to assume the size of the study is quasi-infinite.

For example, in the book it gives an example of a 20 person study where **each** subject was representative of 1 billion identical subjects, and to view the hypothetical super-population as that of 20 billion people. Specifically, on pg. 13 it states that:

… we will assume that counterfactual outcomes are

deterministic and that we have recorded data on every subject in a

very large (perhaps hypothetical) super-population. This is equivalent

to viewing our population of 20 subjects as a population of 20 billion

subjects in which 1 billion subjects are identical to the 1st subject, 1 billion

subjects are identical to the 2nd subject, and so on.

My confusion here is what it means to assume a single person is representative of 1 billion identical individuals. Is it assuming that each of the 1 billion are identical with respect to their outcomes and treatment only, but differ with respect to the covariates? Or is it assuming the individual is a summary measure of the 1 billion? My instinct is that the notion of the 1 billion is entertaining the fact we may draw many times without having a case where we have a lack of samples. I.e., small sample sizes result in more unstable estimates.

Essentially, what is so crucial about assuming there are many identical individuals in the “background”, if they are just going to be the same as a patient you observe? What happens or breaks down if instead of the 1 billion, we only had 2 identical individuals?

Thank you for any insight.

Get this bounty!!!