This document describes how to use the causl package to
perform plasmode simulation; that is, producing datasets that combining
real covariates (and possibly treatments) with synthetic outcomes. This
enables us to have realistic datasets, but also to have knowledge of the
underlying causal effect, making them very useful for testing the
relative effectiveness of different inference methods.
Start by loading the causl and dplyr
packages.
library(dplyr)
library(causl)
We will use the dataset from a competition held at the Atlantic Causal Inference Conference in 2016. To obtain this, first run the following commands:
install.packages("remotes")
remotes::install_github("vdorie/aciccomp/2016")
Then load the aciccomp2016 package.
library(aciccomp2016)
(dat <- as_tibble(input_2016)) # show 10 rows of first few variables
## # A tibble: 4,802 × 58
## x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 x_11 x_12 x_13
## <int> <fct> <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 29 C 1 7 60 85 0 0 1 0 0 1 0
## 2 27 C 0 0 64 178 0 0 0 0 0 2 1
## 3 27 C 0 0 60 102 0 0 0 0 0 1 0
## 4 37 C 0 0 65 174 0 0 0 0 0 1 0
## 5 24 C 20 14 63 129 0 0 0 0 0 1 0
## 6 27 C 40 15 63 135 0 0 0 0 0 1 1
## 7 26 C 20 8 69 140 0 0 0 1 0 0 0
## 8 33 C 0 0 60 110 0 0 0 1 0 0 1
## 9 28 C 3 5 61 160 0 0 0 0 0 3 1
## 10 31 C 0 0 63 114 0 1 0 0 0 0 0
## # ℹ 4,792 more rows
## # ℹ 45 more variables: x_14 <int>, x_15 <int>, x_16 <int>, x_17 <int>,
## # x_18 <int>, x_19 <int>, x_20 <int>, x_21 <fct>, x_22 <int>, x_23 <int>,
## # x_24 <fct>, x_25 <int>, x_26 <int>, x_27 <int>, x_28 <int>, x_29 <int>,
## # x_30 <int>, x_31 <int>, x_32 <int>, x_33 <int>, x_34 <int>, x_35 <int>,
## # x_36 <int>, x_37 <int>, x_38 <int>, x_39 <int>, x_40 <int>, x_41 <int>,
## # x_42 <int>, x_43 <int>, x_44 <int>, x_45 <int>, x_46 <int>, x_47 <int>, …
Let us consider a model for the causal effect of smoking on birthweight. We will start with a binary variable \(A\) to indicate whether the mother smoked during the pregnancy, and later extend this to a zero-inflated continuous variable. We start by setting up the model inputs:
forms <- list(list(),
list(A ~ x_1 + x_3 + x_4),
list(Y ~ A),
list(~ 1))
fams <- list(integer(0), 5, 1, 1)
pars <- list(A = list(beta=c(-1.5,0.03,0.02,0.05)),
Y = list(beta=c(3200, -500), phi=400^2),
cop = list(beta=-1))
Then call rfrugalParam to simulate \(A\) and \(Y\).
set.seed(123) # to obtain consistent results
datAY <- rfrugalParam(formulas=forms, family=fams, pars=pars, dat=dat)
## Inversion method selected: using pair-copula parameterization
We can now check that this basic simulation was performed correctly. First we fit the treatment variable using the correct model.
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY)
summary(glmA)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.4378 0.14246 -10.09 5.97e-24
## x_1 0.0292 0.00577 5.06 4.09e-07
## x_3 0.0196 0.00360 5.44 5.22e-08
## x_4 0.0423 0.00716 5.91 3.46e-09
Indeed, the parameters appear correct. Then we can use inverse
probability weighting (IPW) to estimate the parameters for the outcome
model. We will need to load the survey package to get
robust standard errors after the weighting.
library(survey)
ps <- predict(glmA, type="response")
wt <- datAY$A/ps + (1-datAY$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY))
summary(glmY)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3205 8.4 381.6 0
## A -520 12.3 -42.3 0
We can also compare with a naïve method that ignores reweighting.
glmYn <- glm(Y ~ A, data=datAY)
summary(glmYn)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3257 7.37 442.2 0
## A -640 11.42 -56.1 0
We can also use the function gen_cop_pars() to sample
parameters for the copula, rather than having to specify a vector for
every bivariate copula. To do this, we simply provide the list of
formulas, the dataset, and a range that we would like the (partial)
correlations to ultimately fall into.
For example, if we want the correlations in the range \((-0.5,0)\), we can call
pars2 <- pars
pars2$cop <- gen_cop_pars(formulas = forms, data=dat, range=c(-0.5,0))
This method gives (partial) correlations -0.078, -0.143, and 0.095.
We can then simulate using this set of parameters.
set.seed(124)
datAY2 <- rfrugalParam(formulas=forms, family=fams, pars=pars2, dat=dat)
## Inversion method selected: using pair-copula parameterization
We can check the parameters in the same manner as above, and indeed we obtain an intercept of 3194 (se \(=\) 7.9) and a causal effect of -496.8 (12).
We can also use a more complicated method for generating copula
parameters. Suppose we want these to depend upon x_2, which
is a factor with six levels. Then we specify
forms3 <- list(list(),
list(A ~ x_1 + x_3 + x_4),
list(Y ~ A),
list(~ x_2))
and simulate as:
set.seed(125)
pars3 <- pars
pars3$cop <- gen_cop_pars(formulas = forms3, data=dat, range=c(-0.5,0))
datAY3 <- rfrugalParam(formulas=forms3, family=fams, pars=pars3, dat=dat)
## Inversion method selected: using pair-copula parameterization
This time we obtain an intercept of 3199 (8.5) and a causal effect of -489.1 (12.4). In this case the copula parameter is a vector of six entries; for example for the \(Y\)-\(X_1\) effect this is (-1.47, -0.01, 0.04, 0, 0.01, -0.05).
Until now we have only used Gaussian copulas, but we can easily
specify a different family. For example, suppose we wish to consider
Student t-copulas. In this case we need to choose a degrees of freedom
parameter, though this is easily managed with
gen_cop_pars() which allows additional arguments as
....
set.seed(126)
pars4 <- pars3
(pars4$cop <- gen_cop_pars(forms3, data=dat, range=c(-0.5,0), par2=4))
## $Y
## $Y$x_1
## $Y$x_1$beta
## [1] -1.49492 0.00826 -0.01525 0.01971 -0.01109 -0.00152
##
## $Y$x_1$par2
## [1] 4
##
##
## $Y$x_3
## $Y$x_3$beta
## [1] -1.51246 0.01611 -0.01106 -0.00215 -0.01853 0.02195
##
## $Y$x_3$par2
## [1] 4
##
##
## $Y$x_4
## $Y$x_4$beta
## [1] -1.517160 0.000418 -0.004662 -0.021514 -0.002447 -0.010080
##
## $Y$x_4$par2
## [1] 4
fams4 <- fams
fams4[[4]] <- 2
datAY4 <- rfrugalParam(formulas=forms3, family=fams4, pars=pars4, dat=dat)
## Inversion method selected: using pair-copula parameterization
Now the estimates are 3218 (8.5) and a causal effect of -498.1 (12.3).
We also introduce the function adj_vars(), which allows
users to modify the strength of the partial correlations (or other
parameters) by scaling the parameters of the linear predictor. The
argument factor can be modified to control the strength of
the strong and weak variables respectively;
the default values are 5 and 0.2. Returning to the example above,
suppose that we want x_4 to be much more closely related
than x_1 or x_3. Then we can apply:
set.seed(127)
pars5 <- pars3
pars_tmp <- gen_cop_pars(forms3, data=dat, range=c(-0.25,0))
(pars5$cop <- causl:::adj_vars(pars_tmp, strong="x_4", weak=c("x_1","x_3"), factor=c(2,0.25)))
## $Y
## $Y$x_1
## $Y$x_1$beta
## [1] -3.16e-01 -2.12e-03 -1.29e-03 4.74e-06 2.13e-03 2.60e-03
##
##
## $Y$x_3
## $Y$x_3$beta
## [1] -0.308757 -0.000327 0.001470 0.000348 -0.000276 0.001578
##
##
## $Y$x_4
## $Y$x_4$beta
## [1] -2.500827 -0.005808 -0.002827 0.017982 -0.000621 -0.015468
fams5 <- fams
fams5[[4]] <- 5 # use Frank copula
datAY5 <- rfrugalParam(formulas=forms3, family=fams5, pars=pars5, dat=dat)
## Inversion method selected: using pair-copula parameterization
This time the estimates are 3202 (8.1) and a causal effect of -500.5 (12.4).