This document describes how to use the causl
package to
perform plasmode simulation; that is, producing datasets that combining
real covariates (and possibly treatments) with synthetic outcomes. This
enables us to have realistic datasets, but also to have knowledge of the
underlying causal effect, making them very useful for testing the
relative effectiveness of different inference methods.
Start by loading the causl
and dplyr
packages.
library(dplyr)
library(causl)
We will use the dataset from a competition held at the Atlantic Causal Inference Conference in 2016. To obtain this, first run the following commands:
install.packages("remotes")
remotes::install_github("vdorie/aciccomp/2016")
Then load the aciccomp2016
package.
library(aciccomp2016)
(dat <- as_tibble(input_2016)) # show 10 rows of first few variables
## # A tibble: 4,802 × 58
## x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 x_11 x_12 x_13
## <int> <fct> <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 29 C 1 7 60 85 0 0 1 0 0 1 0
## 2 27 C 0 0 64 178 0 0 0 0 0 2 1
## 3 27 C 0 0 60 102 0 0 0 0 0 1 0
## 4 37 C 0 0 65 174 0 0 0 0 0 1 0
## 5 24 C 20 14 63 129 0 0 0 0 0 1 0
## 6 27 C 40 15 63 135 0 0 0 0 0 1 1
## 7 26 C 20 8 69 140 0 0 0 1 0 0 0
## 8 33 C 0 0 60 110 0 0 0 1 0 0 1
## 9 28 C 3 5 61 160 0 0 0 0 0 3 1
## 10 31 C 0 0 63 114 0 1 0 0 0 0 0
## # ℹ 4,792 more rows
## # ℹ 45 more variables: x_14 <int>, x_15 <int>, x_16 <int>, x_17 <int>,
## # x_18 <int>, x_19 <int>, x_20 <int>, x_21 <fct>, x_22 <int>, x_23 <int>,
## # x_24 <fct>, x_25 <int>, x_26 <int>, x_27 <int>, x_28 <int>, x_29 <int>,
## # x_30 <int>, x_31 <int>, x_32 <int>, x_33 <int>, x_34 <int>, x_35 <int>,
## # x_36 <int>, x_37 <int>, x_38 <int>, x_39 <int>, x_40 <int>, x_41 <int>,
## # x_42 <int>, x_43 <int>, x_44 <int>, x_45 <int>, x_46 <int>, x_47 <int>, …
Let us consider a model for the causal effect of smoking on birthweight. We will start with a binary variable \(A\) to indicate whether the mother smoked during the pregnancy, and later extend this to a zero-inflated continuous variable. We start by setting up the model inputs:
forms <- list(list(),
list(A ~ x_1 + x_3 + x_4),
list(Y ~ A),
list(~ 1))
fams <- list(integer(0), 5, 1, 1)
pars <- list(A = list(beta=c(-1.5,0.03,0.02,0.05)),
Y = list(beta=c(3200, -500), phi=400^2),
cop = list(beta=-1))
Then call rfrugalParam
to simulate \(A\) and \(Y\).
set.seed(123) # to obtain consistent results
datAY <- rfrugalParam(formulas=forms, family=fams, pars=pars, dat=dat)
We can now check that this basic simulation was performed correctly. First we fit the treatment variable using the correct model.
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY)
summary(glmA)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6041 0.14387 -11.15 7.18e-29
## x_1 0.0340 0.00580 5.86 4.61e-09
## x_3 0.0250 0.00365 6.83 8.49e-12
## x_4 0.0469 0.00723 6.48 9.24e-11
Indeed, the parameters appear correct. Then we can use inverse
weighting to estimate the parameters for the outcome model. We will need
to load the survey
package to get robust standard errors
after the weighting.
library(survey)
ps <- predict(glmA, type="response")
wt <- datAY$A/ps + (1-datAY$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY))
summary(glmY)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3171 12.1 261.5 0.0e+00
## A -487 16.7 -29.2 5.4e-173
We can also compare with a naïve method that ignores reweighting.
glmYn <- glm(Y ~ A, data=datAY)
summary(glmYn)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3261 9.78 333.6 0
## A -692 15.08 -45.9 0
We can also use the function gen_cop_pars()
to sample
parameters for the copula, rather than having to specify a vector for
every bivariate copula. To do this, we simply provide the list of
formulas, the dataset, and a range that we would like the (partial)
correlations to ultimately fall into.
For example, if we want the correlations in the range \((-0.5,0)\), we can call
pars2 <- pars
pars2$cop <- gen_cop_pars(formulas = forms, data=dat, range=c(-0.5,0))
This method gives (partial) correlations -0.155, -0.187, and -0.28.
We can then simulate using this set of parameters.
datAY2 <- rfrugalParam(formulas=forms, family=fams, pars=pars2, dat=dat)
We can check the parameters in the same manner as above, and indeed we obtain an intercept of 3189 (se \(=\) 8.8) and a causal effect of -500.2 (13.2).
We can also use a more complicated method for generating copula
parameters. Suppose we want these to depend upon x_2
, which
is a factor with six levels. Then we specify
forms3 <- list(list(),
list(A ~ x_1 + x_3 + x_4),
list(Y ~ A),
list(~ x_2))
and simulate as:
pars3 <- pars
pars3$cop <- gen_cop_pars(formulas = forms3, data=dat, range=c(-0.5,0))
datAY3 <- rfrugalParam(formulas=forms3, family=fams, pars=pars3, dat=dat)
This time we obtain an intercept of 3171 (12.2) and a causal effect of -492.4 (17.6). In this case the copula parameter is a vector of six entries; for example for the \(Y\)-\(X_1\) effect this is -1.54, 0.03, 0, -0.02, -0.02, 0.03.
Until now we have only used Gaussian copulas, but we can easily
specify a different family. For example, suppose we wish to consider
Student t-copulas. In this case we need to choose a degrees of freedom
parameter, though this is easily managed with
gen_cop_pars()
which allows additional arguments as
...
.
pars4 <- pars3
(pars4$cop <- gen_cop_pars(forms3, data=dat, range=c(-0.5,0), par2=4))
## $Y
## $Y$x_1
## $Y$x_1$beta
## [1] -1.47732 0.00689 -0.02940 0.05119 -0.03702 -0.00343
##
## $Y$x_1$par2
## [1] 4
##
##
## $Y$x_3
## $Y$x_3$beta
## [1] -1.50961 -0.00347 -0.00272 -0.03956 -0.00152 0.02338
##
## $Y$x_3$par2
## [1] 4
##
##
## $Y$x_4
## $Y$x_4$beta
## [1] -1.47034 -0.02513 0.02569 0.02009 0.00919 -0.02600
##
## $Y$x_4$par2
## [1] 4
fams4 <- fams
fams4[[4]] <- 2
datAY4 <- rfrugalParam(formulas=forms3, family=fams4, pars=pars4, dat=dat)
Now the estimates are 3175 (11.4) and a causal effect of -491.8 (15.8).
We also introduce the function adj_vars()
, which allows
users to modify the strength of the partial correlations (or other
parameters) by scaling the parameters of the linear predictor. The
argument factor
can be modified to control the strength of
the strong
and weak
variables respectively;
the defauly values are 5 and 0.2. Returning to the example above,
suppose that we want x_4
to be much more closely related
than x_1
or x_3
. Then we can apply:
pars5 <- pars3
pars_tmp <- gen_cop_pars(forms3, data=dat, range=c(-0.25,0))
(pars5$cop <- causl:::adj_vars(pars_tmp, strong="x_4", weak=c("x_1","x_3"), factor=c(2,0.25)))
## $Y
## $Y$x_1
## $Y$x_1$beta
## [1] -0.310090 0.001969 -0.002711 -0.000179 -0.001014 -0.003484
##
##
## $Y$x_3
## $Y$x_3$beta
## [1] -3.05e-01 2.82e-03 -3.29e-04 -8.65e-05 5.67e-04 -2.26e-03
##
##
## $Y$x_4
## $Y$x_4$beta
## [1] -2.486542 0.006965 -0.005539 0.000561 0.048086 -0.016233
fams5 <- fams
fams5[[4]] <- 5 # use Frank copula
datAY5 <- rfrugalParam(formulas=forms3, family=fams5, pars=pars5, dat=dat)
This time the estimates are 3206 (8.2) and a causal effect of -507.7 (12.7).