Chapter 12 Doubly Robust Estimation

We have seen that we effectively have a duality between the outcome regression and the propensity score: we can estimate either of these, and use the estimate to obtain a consistent and asymptotically normal estimate of the causal quantity of interest. An obvious question then becomes: is there an estimator that uses both of these things, and is guaranteed to be correct provided that at least one of my estimators is?

The answer turns out to be yes!

12.1 Estimating equations

An estimating equation consists of a score—that is, a function of the data and parameters—that is averaged over independent and identically distributed observations. A score has expectation zero at the true value of the parameter (and not otherwise); thus, finding a zero of the empirical estimating equation will give an approximate zero of the true underlying equation, and hence the value will be a consistent estimator of the true value. Further, the estimator will (under some smoothness conditions) also be asymptotically normal with standard error related to the derivative of the score at the true value.

The most common example of a score is the derivative of the log-likelihood; we know that this is zero at the true parameter, and we consider its sum over our data. As another example, in Chapter 11 we saw the Horvitz-Thompson estimator, which uses the score \[\begin{align*} \psi_1(\theta, \pi; {\boldsymbol X},A,Y) = \frac{YA}{\pi({\boldsymbol X})} - \theta, \end{align*}\] where \(\theta\) is the expectation of \(Y(1)\). In this case note that we have a nuisance function \(\pi\) that needs to be estimated. We will still obtain a consistent estimator for \(\theta\) provided that we use a consistent estimator of \(\pi\).

We can add any function of the form \(g({\boldsymbol X},A)\{Y(1) - Q_1({\boldsymbol X})\}\) to \(\psi_1\) and still have an unbiased estimating equation, where \(Q_a({\boldsymbol X}) = \mathbb{E}[Y(a) \mid {\boldsymbol X}]\). From this family of estimating equations, there is one which is semiparameteric efficient, meaning it is the one that minimizes the asymptotic variance of the resulting unbiased estimator. This turns out to be \[\begin{align} \psi_1^*(\theta, \pi; {\boldsymbol X},A,Y) = \frac{A(Y - Q_1({\boldsymbol X}_i))}{\pi({\boldsymbol X})} + Q_1({\boldsymbol X}_i) - \theta; \tag{12.1} \end{align}\] see Chapter 13 of (Tsiatis 2006) for a proof of this.

12.2 Augmented inverse probability weighting

Given its derivation from the usual IPW estimator, the estimating equation (12.1) yields what is commonly called the augmented inverse probability weighted estimator (AIPW) for \(\mathbb{E}Y(1)\): \[\begin{align} \hat\mu_1^{dr} &= \frac{1}{n} \sum_{i=1}^n \left\{ \frac{A_i (Y_i - Q_1({\boldsymbol X}_i))}{\pi({\boldsymbol X}_i)} + Q_1({\boldsymbol X}_i) \right\}. \end{align}\] First derived in James M. Robins and Rotnitzky (1995), it is very widely used in modern causal inference methods. The corresponding estimator for \(\mathbb{E}Y(0)\) is \[\begin{align*} \hat\mu_0^{dr} &= \frac{1}{n} \sum_{i=1}^n \left\{ \frac{(1-A_i) (Y_i - Q_0({\boldsymbol X}_i))}{1-\pi({\boldsymbol X}_i)} + Q_0({\boldsymbol X}_i) \right\}. \end{align*}\]

The AIPW estimator has the property of being doubly robust; that is, \(\hat\mu_a^{dr}\)
remains consistent for \(\mathbb{E}Y(a)\) even if we misspecify either the outcome model (\(Q_a\)) or the propensity score model (\(\pi\)).

To see this for \(a=1\), we write: \[\begin{align} \hat\mu_1^{dr} &= \sum_{i=1}^n \hat Q_1({\boldsymbol X}_i) + \sum_{i=1}^n \frac{A_i}{\hat\pi({\boldsymbol X}_i)} (Y_i - \hat Q_1({\boldsymbol X}_i)) \tag{12.2}\\ &= \sum_{i=1}^n \frac{A_i Y_i}{\hat\pi({\boldsymbol X}_i)} + \sum_{i=1}^n \left\{1 - \frac{A_i}{\hat\pi({\boldsymbol X}_i)} \right\} \hat Q_1({\boldsymbol X}_i). \tag{12.3} \end{align}\] Then if \(\hat{Q}_1({\boldsymbol X}) \to^p \mathbb{E}[Y(1) \mid {\boldsymbol X}]\), then the first term in (12.2) clearly tends to \(\mathbb{E}Y(1)\) by the law of large numbers, and the second term has expectation \[\begin{align*} \mathbb{E}\frac{A}{\hat\pi({\boldsymbol X})} (Y - \hat Q_1({\boldsymbol X})) &= \mathbb{E}\left[ \frac{A}{\hat\pi({\boldsymbol X})} \mathbb{E}[Y - \hat Q_1({\boldsymbol X}) \mid {\boldsymbol X}={\boldsymbol X}, A=a] \right]. \end{align*}\] Now, by the law of large numbers, if \(Q_1\) is consistent then \(\mathbb{E}[Y - \hat Q_1({\boldsymbol X}) \mid {\boldsymbol X}={\boldsymbol X}, A=a] \to^p 0\) as \(n \to \infty\), so this whole expression tends to zero. Hence we only require consistency of \(\hat{Q}_1\) to obtain consistency of \(\mu_1^{dr}\). Similarly, if \(\hat\pi({\boldsymbol X}) \to^p \pi({\boldsymbol X})\) then the expectation of \(1 - \frac{A}{\hat\pi({\boldsymbol X})}\) in (12.3) will tend in probability to 0, leaving only the Horvitz-Thompson estimator.

References

Robins, James M, and Andrea Rotnitzky. 1995. “Semiparametric Efficiency in Multivariate Regression Models with Missing Data.” Journal of the American Statistical Association 90 (429): 122–29.
Tsiatis, Anastasios A. 2006. Semiparametric Theory and Missing Data. Springer.