Chapter 19 Post-double selection
Post-double selection is quite an old idea now (2014!), but it gives good intuition as to how other methods—such as double machine learning—are able to control the type I error while still learning very complex functions.
Suppose we have the following set up, where \({\boldsymbol X}\), is high-dimensional (say \(|{\boldsymbol X}| = p\)).
Figure 19.1: A SWIG.
It is clear that we can identify the causal effect of \(A\) on \(Y\), since assuming independent observations and the model implied by the SWIG: \[\begin{align*} \mathbb{E}Y(a)\;=\; \sum_{\boldsymbol x}P({\boldsymbol x}) \cdot \mathbb{E}[Y \,|\,a, {\boldsymbol x}] \; = \; \mathbb{E}\left[ \frac{Y \Ind_{\{A=a\}}}{P(A=a\,|\,{\boldsymbol X})} \right]; \end{align*}\] however, statistically we may still have difficulties.
We do not know what form the expressions for \(\mathbb{E}[Y \,|\,a, {\boldsymbol x}]\), \(P({\boldsymbol x})\), or \(P(a\,|\,{\boldsymbol x})\) should take.
Even if we knew the families, actually estimating the parameters may be infeasible with a finite dataset of reasonable size.
19.1 Frisch-Waugh-Lovell Theorem
Suppose we have \(n\) i.i.d. observations \(({\boldsymbol X}_i,A_i,Y_i)\) such that \[\begin{align*} A_i &= \alpha^T {\boldsymbol X}_i + \delta_i & Y_i &= \beta A_i + \gamma^T {\boldsymbol X}_i + \varepsilon_i, \end{align*}\] where \({\boldsymbol X}_i\) has fewer than \(n-1\) entries.
Consider two different ways of obtaining an estimate of \(\beta\):
regress \(Y\) on \({\boldsymbol X}\) and \(A\) using OLS, and look at \(\hat\beta\);
regress \(Y\) on \({\boldsymbol X}\) to obtain residual \(r_Y\); and then \(A\) on \({\boldsymbol X}\) to obtain \(r_A\); then regress \(r_Y\) on \(r_A\), and take the linear coefficient \(\tilde\beta\).
Theorem 19.1 (@frisch1933partial; @lovell1963seasonal) The estimates for \(\beta\) from methods 1 and 2 are the same.
Proof. Note that \(r_A= A- \hat{\alpha}^T {\boldsymbol X}\), so \(r_A\mathbin{\perp\hspace{-3.2mm}\perp}{\boldsymbol X}\).
Then \[\begin{align*} \mathbb{E}[Y \mid {\boldsymbol X},A] &= \beta A+ \gamma^T {\boldsymbol X}\\ &= \beta (r_A+ \alpha^T {\boldsymbol X}) + \gamma^T {\boldsymbol X}\\ &= \beta r_A+ (\alpha + \gamma)^T {\boldsymbol X}. \end{align*}\] Then, since \({\boldsymbol X}\mathbin{\perp\hspace{-3.2mm}\perp}r_A\), we must have that regressing \(Y\) on \({\boldsymbol X}\) gives an estimate of \(\alpha + \gamma\).
Hence \[\begin{align*} \mathbb{E}r_Y &= \beta \mathbb{E}r_A, \end{align*}\] giving the result.
19.2 Sparsity
Suppose that we have \[\begin{align*} \mathbb{E}[A\,|\,{\boldsymbol X}={\boldsymbol x}] &= \alpha^T {\boldsymbol x}\\ \mathbb{E}[Y \,|\,A=a, {\boldsymbol X}={\boldsymbol x}] &= \beta a+ \gamma^T {\boldsymbol x}. \end{align*}\] Assume also that \(\log p = o(n^{1/3})\) and there exist subsets \({\boldsymbol B}\) and \({\boldsymbol D}\) of size at most \(s_n \ll n\) such that: \[\begin{align*} \mathbb{E}[A\,|\,{\boldsymbol x}] &= \alpha_{{\boldsymbol B}}^T {\boldsymbol x}+ r_n\\ \mathbb{E}[Y \,|\,A=a, {\boldsymbol X}={\boldsymbol x}] &= \beta a+ \gamma_{\boldsymbol D}^T {\boldsymbol x}+ t_n, \end{align*}\] where the approximation error is stochastically smaller than the estimation error: i.e. \[\begin{align*} &&&&\mathbb{E}\| r_n \|_2 &\lesssim \sqrt{\frac{s_n}{n}} \quad &&\text{and} \quad &\mathbb{E}\| t_n \|_2 &\lesssim \sqrt{\frac{s_n}{n}}.&&&& \end{align*}\]
In other words, a much smaller subset of covariates is sufficient to approximately make \(A\) and \(Y\) unconfounded. Figure @(fig:pds2) contains a graphical representation.
Figure 19.2: Illustration of how strong effects can render two variables approximately unconfounded.
The idea is that if we account for variables in both \({\boldsymbol B}\) and \({\boldsymbol D}\), then we will be guaranteed to have good control of the bias in estimating \(\beta\). In principle we can use any consistent selection method to choose \({\boldsymbol B}\) and \({\boldsymbol D}\). In practice, Belloni, Chernozhukov, and Hansen (2014) recommend a version of the lasso.
19.2.1 Simulation
Here we perform a simulated example. Suppose that \[\begin{align*} A_i &= \alpha \sum_{i=1}^7 X_i + \delta_i\\ Y_i &= \beta A_i + \gamma \sum_{i=4}^{10} X_i + \varepsilon_i \end{align*}\] where \(\delta_i, \varepsilon_i \sim N(0,1)\) (independently), and we are given 1000 covariates in \({\boldsymbol X}\), where each \(X_{ij} \sim N(0,1)\) independently.
Set \(\beta = \gamma = 2\) and \(\alpha = 1\), and pick \(n=100\).
alpha <- 1
gamma <- beta <- 2
n <- 100; p <- 1000
## simulate data
set.seed(123)
Z <- matrix(rnorm(n*p), n, p)
X <- Z
Y <- Z
dat <- data.frame(Y=Y, X=X, Z)
names(dat) <- c("Y","X",paste0("Z",seq_len(p)))
head(dat[,1:9])## Y X Z1 Z2 Z3 Z4 Z5 Z6 Z7
## 1 -0.5605 -0.710 2.199 -0.715 -0.0736 -0.6019 1.0740 -0.728 0.356
## 2 -0.2302 0.257 1.312 -0.753 -1.1687 -0.9937 -0.0273 -1.540 -0.658
## 3 1.5587 -0.247 -0.265 -0.939 -0.6347 1.0268 -0.0333 -0.693 0.855
## 4 0.0705 -0.348 0.543 -1.053 -0.0288 0.7511 -1.5161 0.119 1.153
## 5 0.1293 -0.952 -0.414 -0.437 0.6707 -1.5092 0.7904 -1.365 0.276
## 6 1.7151 -0.045 -0.476 0.331 -1.6505 -0.0951 -0.2107 0.590 0.144
We can try a na"ive model, and obtain the wrong answer.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0854 0.0922 0.926 0.357
## X -0.0468 0.0952 -0.491 0.625
Notice that the estimate \(\hat\beta = -0.05\) is not within 2 s.e.s (0.19) of \(\beta = 2\).
Then we can try using the R package hdm, which implements
double selection.
library(hdm) ## library for implementation
lasso_out = rlassoEffect(y=dat[,"Y",drop=FALSE],
d=dat[,"X",drop=FALSE],
x=Z, method="double selection")
sum_out <- summary(lasso_out)
sum_out## [1] "Estimates and significance testing of the effect of target variables"
## Estimate. Std. Error t value Pr(>|t|)
## X -1.15e-17 3.21e-01 0 1
Note this solution \(\tilde\beta = 0\), is (well) within two s.e.s (0.64) of \(\beta = 2\).
d401k <- DoubleML::fetch_401k(return_type = "data.frame")
X <- model.matrix(~ -1 + (age + inc + educ + fsize + marr + twoearn + db
+ pira + hown)^2, data = d401k)
X <- X[, apply(X, 2, var) != 0] # exclude all constant variables
y <- d401k$net_tfa
effects_401 <- rlassoATE(x = X, d=d401k$e401, y = y)
summary(effects_401)## Estimation and significance testing of the treatment effect
## Type: ATE
## Bootstrap: not applicable
## coeff. se. t-value p-value
## TE 550 4094 0.13 0.89
This estimate (550) is plausible, but the standard errors (4094) are very large!