Bounty: 50
I have a very specific question regarding how the causal tree in the causal forest/generalized random forest optimizes for heterogeneity in treatment effects.
This question comes from the Athey & Imbens (2016) paper “Recursive partitioning for heterogeneous causal effects” from PNAS. Another paper is Wager & Athey (2018), “Estimation and inference of heterogeneous treatment effects using random forests” in JASA (arxiv.org link here). I know that the answer to my question is in those papers, but I, unfortunately, can’t parse some of the equations to extract it. I know I understand an algorithm well when I can express it in words, so it has been irking me that I can’t do so here.
In my understanding, an honest causal tree is generally constructed by:
Given a dataset with an outcome $Y$, covariates $X$, and a randomized condition $W$ that takes on the value of 0 for control and 1 for treatment:
 Split the data into subsample $I$ and subsample $J$
 Train a decision tree on subsample $I$ predicting $Y$ from $X$, with the requirement that each terminal node has at least $k$ observations from each condition in subsample $J$

Apply the decision tree constructed on subsample $I$ to subsample $J$

At each terminal node, get the mean of predictions for the $W$ = 1 cases from subsample $J$ and subtract the mean of predictions for the $W$ = 0 cases from subsample $J$; the resulting difference is the estimated treatment effect
Any future, outofsample cases (such as those used after deploying the model) will be dropped down the tree and assigned the predicted treatment effect for the node in which they end are placed.
This is called “honest,” because the actual training and estimation are done on completely different data. Athey and colleagues have a nice asymptotic theory showing that you can derive variance estimates for these treatment effects, which is part of the motivation behind making them “honest.”
This is then applied to a causal random forest by using bagging or bootstrapping.
Now, Athey & Imbens (2016) note that this procedure uses a modified mean squared error criterion for splitting, which rewards “a partition for finding strong heterogeneity in treatment effects and penalize a partition that creates variance in leaf estimates” (p. 7357).
My question is: Can you explain how this is the case, using words?
In the previous two sections before this quotation, Modifying Conventional CART for Treatment Effects and Modifying the Honest Approach, the authors use the Rubin causal model/potential outcomes framework to derive an estimation for the treatment effect.
They note that we are not trying to predict $Y$—like in most machine learning cases—but the difference between the expectation of $Y$ in two conditions, given some covariates $X$. In line with the potential outcomes framework, this is “infeasible”: We can only measure the outcome of someone in one of the two conditions.
In a series of equations, they show how we can use a modified splitting criterion that predicts the treatment effect. They say: “…the treatment effect analog is infeasible, but we can use an unbiased estimate of it, which leads to…” (p. 7357) and they show the equation for it using observed data. As someone who has a background in social science and applied statistics, I can’t connect the dots between what they have set up and how we can estimate it from the data.
Any help at explaining how this criterion maximizes the variance in treatment effects (i.e., the heterogeneity of causal effects) OR any correction on my description of how to build a causal tree that might be leading to my confusion would be greatly appreciated.