Chapter 19 Post-double selection

Post-double selection is quite an old idea now (2014!), but it gives good intuition as to how other methods—such as double machine learning—are able to control the type I error while still learning very complex functions.

Suppose we have the following set up, where \({\boldsymbol X}\), is high-dimensional (say \(|{\boldsymbol X}| = p\)).

A SWIG.

Figure 19.1: A SWIG.

It is clear that we can identify the causal effect of \(A\) on \(Y\), since assuming independent observations and the model implied by the SWIG: \[\begin{align*} \mathbb{E}Y(a)\;=\; \sum_{\boldsymbol x}P({\boldsymbol x}) \cdot \mathbb{E}[Y \,|\,a, {\boldsymbol x}] \; = \; \mathbb{E}\left[ \frac{Y \Ind_{\{A=a\}}}{P(A=a\,|\,{\boldsymbol X})} \right]; \end{align*}\] however, statistically we may still have difficulties.

  • We do not know what form the expressions for \(\mathbb{E}[Y \,|\,a, {\boldsymbol x}]\), \(P({\boldsymbol x})\), or \(P(a\,|\,{\boldsymbol x})\) should take.

  • Even if we knew the families, actually estimating the parameters may be infeasible with a finite dataset of reasonable size.

19.1 Frisch-Waugh-Lovell Theorem

Suppose we have \(n\) i.i.d. observations \(({\boldsymbol X}_i,A_i,Y_i)\) such that \[\begin{align*} A_i &= \alpha^T {\boldsymbol X}_i + \delta_i & Y_i &= \beta A_i + \gamma^T {\boldsymbol X}_i + \varepsilon_i, \end{align*}\] where \({\boldsymbol X}_i\) has fewer than \(n-1\) entries.

Consider two different ways of obtaining an estimate of \(\beta\):

  1. regress \(Y\) on \({\boldsymbol X}\) and \(A\) using OLS, and look at \(\hat\beta\);

  2. regress \(Y\) on \({\boldsymbol X}\) to obtain residual \(r_Y\); and then \(A\) on \({\boldsymbol X}\) to obtain \(r_A\); then regress \(r_Y\) on \(r_A\), and take the linear coefficient \(\tilde\beta\).

Theorem 19.1 (@frisch1933partial; @lovell1963seasonal) The estimates for \(\beta\) from methods 1 and 2 are the same.

Proof. Note that \(r_A= A- \hat{\alpha}^T {\boldsymbol X}\), so \(r_A\mathbin{\perp\hspace{-3.2mm}\perp}{\boldsymbol X}\).

Then \[\begin{align*} \mathbb{E}[Y \mid {\boldsymbol X},A] &= \beta A+ \gamma^T {\boldsymbol X}\\ &= \beta (r_A+ \alpha^T {\boldsymbol X}) + \gamma^T {\boldsymbol X}\\ &= \beta r_A+ (\alpha + \gamma)^T {\boldsymbol X}. \end{align*}\] Then, since \({\boldsymbol X}\mathbin{\perp\hspace{-3.2mm}\perp}r_A\), we must have that regressing \(Y\) on \({\boldsymbol X}\) gives an estimate of \(\alpha + \gamma\).

Hence \[\begin{align*} \mathbb{E}r_Y &= \beta \mathbb{E}r_A, \end{align*}\] giving the result.

19.2 Sparsity

Suppose that we have \[\begin{align*} \mathbb{E}[A\,|\,{\boldsymbol X}={\boldsymbol x}] &= \alpha^T {\boldsymbol x}\\ \mathbb{E}[Y \,|\,A=a, {\boldsymbol X}={\boldsymbol x}] &= \beta a+ \gamma^T {\boldsymbol x}. \end{align*}\] Assume also that \(\log p = o(n^{1/3})\) and there exist subsets \({\boldsymbol B}\) and \({\boldsymbol D}\) of size at most \(s_n \ll n\) such that: \[\begin{align*} \mathbb{E}[A\,|\,{\boldsymbol x}] &= \alpha_{{\boldsymbol B}}^T {\boldsymbol x}+ r_n\\ \mathbb{E}[Y \,|\,A=a, {\boldsymbol X}={\boldsymbol x}] &= \beta a+ \gamma_{\boldsymbol D}^T {\boldsymbol x}+ t_n, \end{align*}\] where the approximation error is stochastically smaller than the estimation error: i.e. \[\begin{align*} &&&&\mathbb{E}\| r_n \|_2 &\lesssim \sqrt{\frac{s_n}{n}} \quad &&\text{and} \quad &\mathbb{E}\| t_n \|_2 &\lesssim \sqrt{\frac{s_n}{n}}.&&&& \end{align*}\]

In other words, a much smaller subset of covariates is sufficient to approximately make \(A\) and \(Y\) unconfounded. Figure @(fig:pds2) contains a graphical representation.

Illustration of how strong effects can render two variables approximately unconfounded.

Figure 19.2: Illustration of how strong effects can render two variables approximately unconfounded.

The idea is that if we account for variables in both \({\boldsymbol B}\) and \({\boldsymbol D}\), then we will be guaranteed to have good control of the bias in estimating \(\beta\). In principle we can use any consistent selection method to choose \({\boldsymbol B}\) and \({\boldsymbol D}\). In practice, Belloni, Chernozhukov, and Hansen (2014) recommend a version of the lasso.

19.2.1 Simulation

Here we perform a simulated example. Suppose that \[\begin{align*} A_i &= \alpha \sum_{i=1}^7 X_i + \delta_i\\ Y_i &= \beta A_i + \gamma \sum_{i=4}^{10} X_i + \varepsilon_i \end{align*}\] where \(\delta_i, \varepsilon_i \sim N(0,1)\) (independently), and we are given 1000 covariates in \({\boldsymbol X}\), where each \(X_{ij} \sim N(0,1)\) independently.

Set \(\beta = \gamma = 2\) and \(\alpha = 1\), and pick \(n=100\).

alpha <- 1
gamma <- beta <- 2
n <- 100; p <- 1000

## simulate data
set.seed(123)
Z <- matrix(rnorm(n*p), n, p)
X <- Z 
Y <- Z 
dat <- data.frame(Y=Y, X=X, Z)
names(dat) <- c("Y","X",paste0("Z",seq_len(p)))

head(dat[,1:9])
##         Y      X     Z1     Z2      Z3      Z4      Z5     Z6     Z7
## 1 -0.5605 -0.710  2.199 -0.715 -0.0736 -0.6019  1.0740 -0.728  0.356
## 2 -0.2302  0.257  1.312 -0.753 -1.1687 -0.9937 -0.0273 -1.540 -0.658
## 3  1.5587 -0.247 -0.265 -0.939 -0.6347  1.0268 -0.0333 -0.693  0.855
## 4  0.0705 -0.348  0.543 -1.053 -0.0288  0.7511 -1.5161  0.119  1.153
## 5  0.1293 -0.952 -0.414 -0.437  0.6707 -1.5092  0.7904 -1.365  0.276
## 6  1.7151 -0.045 -0.476  0.331 -1.6505 -0.0951 -0.2107  0.590  0.144

We can try a na"ive model, and obtain the wrong answer.

sum_lm <- summary(lm(Y ~ X, data=dat))
sum_lm$coef
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.0854     0.0922   0.926    0.357
## X            -0.0468     0.0952  -0.491    0.625
coef <- sum_lm$coef

Notice that the estimate \(\hat\beta = -0.05\) is not within 2 s.e.s (0.19) of \(\beta = 2\).

Then we can try using the R package hdm, which implements double selection.

library(hdm) ## library for implementation
lasso_out = rlassoEffect(y=dat[,"Y",drop=FALSE],
                         d=dat[,"X",drop=FALSE],
                         x=Z, method="double selection")

sum_out <- summary(lasso_out)
sum_out
## [1] "Estimates and significance testing of the effect of target variables"
##   Estimate. Std. Error t value Pr(>|t|)
## X -1.04e-17   3.03e-01       0        1

Note this solution \(\tilde\beta = 0\), is (well) within two s.e.s (0.61) of \(\beta = 2\).

d401k <- DoubleML::fetch_401k(return_type = "data.frame")
X <- model.matrix(~ -1 + (age + inc + educ + fsize + marr + twoearn + db
                       + pira + hown)^2, data = d401k)
X <- X[, apply(X, 2, var) != 0] # exclude all constant variables
y <- d401k$net_tfa
effects_401 <- rlassoATE(x = X, d=d401k$e401, y = y)
summary(effects_401)
## Estimation and significance testing of the treatment effect 
## Type: ATE 
## Bootstrap: not applicable 
##    coeff.  se. t-value p-value
## TE    550 4094    0.13    0.89

This estimate (550) is plausible, but the standard errors (4094) are very large!

References

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. “Inference on Treatment Effects After Selection Among High-Dimensional Controls.” Review of Economic Studies 81 (2): 608–50.