Chapter 20 Double Machine Learning
Double (or debiased) machine learning is an increasingly common approach to estimating causal effects (Chernozhukov et al. 2018). The basic idea is the same as the approach of Belloni, Chernozhukov, and Hansen (2014).
We estimate separate high-dimensional models for the treatment and outcome. The methods make extensive use of cross-fitting, i.e. splitting the data into separate components and using each to predict the other. This allows for high-dimensional estimation while preventing over-fitting.
Mathematically speaking, much more complicated models can be used but still give an unbiased estimator of a (low-dimensional) causal effect.
20.1 Conditions for Double ML
A crucial condition for double ML to work is Neyman orthogonality, which says that the derivative of the estimating equation (at the true parameters) with respect to any nuisance parameters should be zero.
Suppose our score function is \(\psi(W; \theta, \eta)\), with parameters of interest \(\theta\) and nuisance parameters \(\eta\). Then we need: \[\begin{align*} \left. \frac{\partial}{\partial \eta} \mathbb{E}\psi(W; \theta_0, \eta) \right|_{\eta=\eta_0} = 0, \end{align*}\] where \((\theta_0,\eta_0)\) are the true parameters.
If we are given a score function that is not Neyman orthogonal, we can often change it to become so. Consider the linear model example, where the usual score is \[\begin{align*} \tilde{\psi}_\beta(W; \beta, \gamma) &= (Y - \beta A- \gamma^T {\boldsymbol X}) \cdot A\tilde{\psi}_\gamma(W; \beta, \gamma)\\ &= (Y - \beta A- \gamma^T {\boldsymbol X}) \cdot {\boldsymbol X}. \end{align*}\] Suppose we consider a directional derivative \(\delta \cdot h\) with \(h \in \mathbb{R}^{|{\boldsymbol X}|}\), then we have \[\begin{align*} \left. \frac{\partial}{\partial \gamma} \tilde{\psi}_\beta(W; \beta, \gamma_0+\delta h) \right|_{\delta\to 0}. \end{align*}\] In particular, this is not zero.
Now, we can reparametrize the nuisance parameter \(\gamma\) as \(\eta = (\gamma, \mu)\), where we choose \(\mu\) so that the new score for \(\beta\) is \[\begin{align*} \psi_\beta(W; \beta, \eta) &= \tilde{\psi}_\beta(W; \beta, \gamma) - \mu^T \tilde{\psi}_\gamma(W; \beta, \gamma)\\ &= (Y - \beta A- \gamma^T {\boldsymbol X}) (A- \mu^T {\boldsymbol X}). \end{align*}\] If we pick \(\mu = \alpha\), then note that the expectation of second factor is 0.
Hence, small errors in the estimation of \(\gamma\) and \(\alpha\) will not affect the estimate of \(\beta\). In particular: \[\begin{align*} \frac{\partial}{\partial \gamma} \psi_\beta(W; \beta, \gamma, \alpha) &= - {\boldsymbol X}(A- \alpha^T {\boldsymbol X}) \\ \frac{\partial}{\partial \alpha} \psi_\beta(W; \beta, \gamma, \alpha) &= - {\boldsymbol X}(Y - \beta A- \gamma^T {\boldsymbol X}), \end{align*}\] and these both have expectation 0.
Moral: Neyman orthogonality is very helpful for robustness to misspecification.
20.2 Eligibility for 401(k) Example
Chernozhukov et al. (2018) analyse data on 401(k) savings plans, and whether eligibility to enroll leads to an increase in net assets. They consider a dataset of 9,915 individuals, measuring:
- age: age in years;
- inc: income;
- educ: years of education;
- fsize: family size;
- marr: indicator of being married;
- twoearn: two earners in household;
- db: member of defined benefit pension scheme;
- pira: eligible for Individual Retirement Allowance;
- hown: homeowner.
We can move to implement this in R.
library(DoubleML)
library(mlr3)
library(data.table)
library(dplyr)
## note that the DoubleML package uses data.table objects
dat <- fetch_401k(return_type = "data.table", instrument = TRUE)
# Initialize DoubleMLData (data-backend of DoubleML)
dml = DoubleMLData$new(dat,
y_col = "net_tfa",
d_cols = "e401",
x_cols = c("age", "inc", "educ", "fsize",
"marr", "twoearn", "db", "pira", "hown"))
mod <- DoubleMLIRM$new(dml,
ml_m = lrn("classif.cv_glmnet", s = "lambda.min"),
ml_g = lrn("regr.cv_glmnet",s = "lambda.min"),
n_folds = 10, n_rep = 10)
mod$fit() ## fit the model## beta.e401 se.e401
## 1752 3653
We can also try using a more flexible set of covariates.
## add quadratic terms to age, income, education and family size
formula_flex = formula("~ -1 + poly(age, 2, raw=TRUE) +
poly(inc, 2, raw=TRUE) + poly(educ, 2, raw=TRUE) +
poly(fsize, 2, raw=TRUE) + marr + twoearn + db + pira + hown")
features_flex = data.frame(model.matrix(formula_flex, dat))
model_data = data.table("net_tfa" = dat[, net_tfa],
"e401" = dat[, e401], features_flex)
## initialize and fit model
dml_f <- DoubleMLData$new(model_data, y_col = "net_tfa",
d_cols = "e401")
mod_f <- DoubleMLIRM$new(dml_f,
ml_m = lrn("classif.cv_glmnet", s = "lambda.min"),
ml_g = lrn("regr.cv_glmnet",s = "lambda.min"),
n_folds = 10, n_rep = 5)
mod_f$fit()We obtain a much smaller standard error.
## beta.e401 se.e401
## 8722 1299