Chapter 19 Post-double selection
Post-double selection is quite an old idea now (2014!), but it gives good intuition as to how other methods—such as double machine learning—are able to control the type I error while still learning very complex functions.
Suppose we have the following set up, where \({\boldsymbol X}\), is high-dimensional (say \(|{\boldsymbol X}| = p\)).
 
Figure 19.1: A SWIG.
It is clear that we can identify the causal effect of \(A\) on \(Y\), since assuming independent observations and the model implied by the SWIG: \[\begin{align*} \mathbb{E}Y(a)\;=\; \sum_{\boldsymbol x}P({\boldsymbol x}) \cdot \mathbb{E}[Y \,|\,a, {\boldsymbol x}] \; = \; \mathbb{E}\left[ \frac{Y \Ind_{\{A=a\}}}{P(A=a\,|\,{\boldsymbol X})} \right]; \end{align*}\] however, statistically we may still have difficulties.
- We do not know what form the expressions for \(\mathbb{E}[Y \,|\,a, {\boldsymbol x}]\), \(P({\boldsymbol x})\), or \(P(a\,|\,{\boldsymbol x})\) should take. 
- Even if we knew the families, actually estimating the parameters may be infeasible with a finite dataset of reasonable size. 
19.1 Frisch-Waugh-Lovell Theorem
Suppose we have \(n\) i.i.d. observations \(({\boldsymbol X}_i,A_i,Y_i)\) such that \[\begin{align*} A_i &= \alpha^T {\boldsymbol X}_i + \delta_i & Y_i &= \beta A_i + \gamma^T {\boldsymbol X}_i + \varepsilon_i, \end{align*}\] where \({\boldsymbol X}_i\) has fewer than \(n-1\) entries.
Consider two different ways of obtaining an estimate of \(\beta\):
- regress \(Y\) on \({\boldsymbol X}\) and \(A\) using OLS, and look at \(\hat\beta\); 
- regress \(Y\) on \({\boldsymbol X}\) to obtain residual \(r_Y\); and then \(A\) on \({\boldsymbol X}\) to obtain \(r_A\); then regress \(r_Y\) on \(r_A\), and take the linear coefficient \(\tilde\beta\). 
Theorem 19.1 (@frisch1933partial; @lovell1963seasonal) The estimates for \(\beta\) from methods 1 and 2 are the same.
Proof. Note that \(r_A= A- \hat{\alpha}^T {\boldsymbol X}\), so \(r_A\mathbin{\perp\hspace{-3.2mm}\perp}{\boldsymbol X}\).
Then \[\begin{align*} \mathbb{E}[Y \mid {\boldsymbol X},A] &= \beta A+ \gamma^T {\boldsymbol X}\\ &= \beta (r_A+ \alpha^T {\boldsymbol X}) + \gamma^T {\boldsymbol X}\\ &= \beta r_A+ (\alpha + \gamma)^T {\boldsymbol X}. \end{align*}\] Then, since \({\boldsymbol X}\mathbin{\perp\hspace{-3.2mm}\perp}r_A\), we must have that regressing \(Y\) on \({\boldsymbol X}\) gives an estimate of \(\alpha + \gamma\).
Hence \[\begin{align*} \mathbb{E}r_Y &= \beta \mathbb{E}r_A, \end{align*}\] giving the result.
19.2 Sparsity
Suppose that we have \[\begin{align*} \mathbb{E}[A\,|\,{\boldsymbol X}={\boldsymbol x}] &= \alpha^T {\boldsymbol x}\\ \mathbb{E}[Y \,|\,A=a, {\boldsymbol X}={\boldsymbol x}] &= \beta a+ \gamma^T {\boldsymbol x}. \end{align*}\] Assume also that \(\log p = o(n^{1/3})\) and there exist subsets \({\boldsymbol B}\) and \({\boldsymbol D}\) of size at most \(s_n \ll n\) such that: \[\begin{align*} \mathbb{E}[A\,|\,{\boldsymbol x}] &= \alpha_{{\boldsymbol B}}^T {\boldsymbol x}+ r_n\\ \mathbb{E}[Y \,|\,A=a, {\boldsymbol X}={\boldsymbol x}] &= \beta a+ \gamma_{\boldsymbol D}^T {\boldsymbol x}+ t_n, \end{align*}\] where the approximation error is stochastically smaller than the estimation error: i.e. \[\begin{align*} &&&&\mathbb{E}\| r_n \|_2 &\lesssim \sqrt{\frac{s_n}{n}} \quad &&\text{and} \quad &\mathbb{E}\| t_n \|_2 &\lesssim \sqrt{\frac{s_n}{n}}.&&&& \end{align*}\]
In other words, a much smaller subset of covariates is sufficient to approximately make \(A\) and \(Y\) unconfounded. Figure @(fig:pds2) contains a graphical representation.
 
Figure 19.2: Illustration of how strong effects can render two variables approximately unconfounded.
The idea is that if we account for variables in both \({\boldsymbol B}\) and \({\boldsymbol D}\), then we will be guaranteed to have good control of the bias in estimating \(\beta\). In principle we can use any consistent selection method to choose \({\boldsymbol B}\) and \({\boldsymbol D}\). In practice, Belloni, Chernozhukov, and Hansen (2014) recommend a version of the lasso.
19.2.1 Simulation
Here we perform a simulated example. Suppose that \[\begin{align*} A_i &= \alpha \sum_{i=1}^7 X_i + \delta_i\\ Y_i &= \beta A_i + \gamma \sum_{i=4}^{10} X_i + \varepsilon_i \end{align*}\] where \(\delta_i, \varepsilon_i \sim N(0,1)\) (independently), and we are given 1000 covariates in \({\boldsymbol X}\), where each \(X_{ij} \sim N(0,1)\) independently.
Set \(\beta = \gamma = 2\) and \(\alpha = 1\), and pick \(n=100\).
alpha <- 1
gamma <- beta <- 2
n <- 100; p <- 1000
## simulate data
set.seed(123)
Z <- matrix(rnorm(n*p), n, p)
X <- Z 
Y <- Z 
dat <- data.frame(Y=Y, X=X, Z)
names(dat) <- c("Y","X",paste0("Z",seq_len(p)))
head(dat[,1:9])##         Y      X     Z1     Z2      Z3      Z4      Z5     Z6     Z7
## 1 -0.5605 -0.710  2.199 -0.715 -0.0736 -0.6019  1.0740 -0.728  0.356
## 2 -0.2302  0.257  1.312 -0.753 -1.1687 -0.9937 -0.0273 -1.540 -0.658
## 3  1.5587 -0.247 -0.265 -0.939 -0.6347  1.0268 -0.0333 -0.693  0.855
## 4  0.0705 -0.348  0.543 -1.053 -0.0288  0.7511 -1.5161  0.119  1.153
## 5  0.1293 -0.952 -0.414 -0.437  0.6707 -1.5092  0.7904 -1.365  0.276
## 6  1.7151 -0.045 -0.476  0.331 -1.6505 -0.0951 -0.2107  0.590  0.144We can try a na"ive model, and obtain the wrong answer.
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.0854     0.0922   0.926    0.357
## X            -0.0468     0.0952  -0.491    0.625Notice that the estimate \(\hat\beta = -0.05\) is not within 2 s.e.s (0.19) of \(\beta = 2\).
Then we can try using the R package hdm, which implements
double selection.
library(hdm) ## library for implementation
lasso_out = rlassoEffect(y=dat[,"Y",drop=FALSE],
                         d=dat[,"X",drop=FALSE],
                         x=Z, method="double selection")
sum_out <- summary(lasso_out)
sum_out## [1] "Estimates and significance testing of the effect of target variables"
##   Estimate. Std. Error t value Pr(>|t|)
## X -1.04e-17   3.03e-01       0        1Note this solution \(\tilde\beta = 0\), is (well) within two s.e.s (0.61) of \(\beta = 2\).
d401k <- DoubleML::fetch_401k(return_type = "data.frame")
X <- model.matrix(~ -1 + (age + inc + educ + fsize + marr + twoearn + db
                       + pira + hown)^2, data = d401k)
X <- X[, apply(X, 2, var) != 0] # exclude all constant variables
y <- d401k$net_tfa
effects_401 <- rlassoATE(x = X, d=d401k$e401, y = y)
summary(effects_401)## Estimation and significance testing of the treatment effect 
## Type: ATE 
## Bootstrap: not applicable 
##    coeff.  se. t-value p-value
## TE    550 4094    0.13    0.89This estimate (550) is plausible, but the standard errors (4094) are very large!