Chapter 11 Propensity scores

The probability of an individual with covariates \(X=x\) being treated, \(\pi(x) := P(A=1 \mid X=x)\), is known as the propensity score. This has some very useful properties: first it is a balancing score, which means that \(A\mathbin{\perp\hspace{-3.2mm}\perp}X\mid \pi(X)\) and so it can be used for covariate adjustment (see Section 13).

The term the propensity score is misleading, since it is only relative to the particular set of covariates \(X\); if we considered a subset or some other information, then the relevant propensity score would be different. This fact is extremely useful for efficient adjustment; see Section 14.

11.1 Balancing scores

A function \(b\) of covariates \(X\) is said to be a balancing score if \(A\mathbin{\perp\hspace{-3.2mm}\perp}X\mid b(X)\). Any propensity score \(\pi(x)\) is a balancing score with respect to the covariates \(X\), so indeed if \(X\) is a sufficient adjustment set then \[\begin{align*} p(y \,|\,do(a)) &= \sum_{x} p(x) \cdot p(y \mid a, x)\\ &= \sum_{x} p(x) \cdot \frac{p(a\mid x)}{p(a\mid x)} \cdot p(y \mid a, x)\\ &= \sum_{x} \frac{p(x, a, y)}{p(a\mid x)}. \end{align*}\] This suggests that we can compute expectations with respect to \(do(a)\) simply by estimating the propensity score and then reweighting the observations by \(\pi(x)^{-1}\) when \(A= 1\) and \(\{1-\pi(x)\}^{-1}\) when \(A= 0\). Such an approach gives a Horvitz-Thompson estimator, and it was first used for estimation in populations subject to missingness.

11.2 Horvitz-Thompson estimator

The Horvitz-Thompson estimator originated in the study of inference under missing data (Horvitz and Thompson 1952). The idea is that we can estimate the mean of a potential outcome under some treatment value \(A=a\) by reweighting observations such that this was the factual value using the inverse propensity score. That is, \[\begin{align} \hat\mu^1 &= \frac{1}{n} \sum_{i=1}^n \frac{A_i Y_i}{\pi(X_i)}; \tag{11.1} \end{align}\] we have that \[\begin{align*} \mathbb{E}\left[\frac{AY}{{\pi}({\boldsymbol X})} \right] &= \mathbb{E}\left[ \mathbb{E}\left[ \frac{AY(1)}{{\pi}({\boldsymbol X})} \middle| Y(1), {\boldsymbol X}\right] \right]\\ &= \mathbb{E}\left[\frac{Y(1)}{{\pi}({\boldsymbol X})} \mathbb{E}[A\mid Y(1), {\boldsymbol X}] \right]\\ &= \mathbb{E}\left[\frac{Y(1)}{{\pi}({\boldsymbol X})} \mathbb{E}[A\mid {\boldsymbol X}] \right]\\ &= \mathbb{E}\left[\frac{Y(1)}{{\pi}({\boldsymbol X})} \pi({\boldsymbol X}) \right] = \mathbb{E}Y(1), \end{align*}\] and hence (11.1) is an unbiased estimate of \(Y(A=1)\) under the usual causal assumptions. In the above, the first equality uses consistency, the third equality causal sufficiency, the fourth the definition of \(\pi\), while the final one relies on positivity. This illustrates why all three of these assumptions are extremely important for causal inference.

One disadvantage of the Horvitz-Thompson estimator is that it is not guaranteed to respect the range of the outcome. In practice we must estimate the value of the function \(\pi\) using (say) \(\hat\pi\). Supposing that \(Y\) is binary, each non-zero entry in the sum will be strictly larger than 1; if \(\hat\pi\) is very small it is quite possible that one of the entries in (11.1) will be large enough to drag the entire average outside the range \([0,1]\) which it must logically be inside.

References

Horvitz, Daniel G, and Donovan J Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85.