Introduction

This document describes how to use the causl package to perform plasmode simulation; that is, producing datasets that combining real covariates (and possibly treatments) with synthetic outcomes. This enables us to have realistic datasets, but also to have knowledge of the underlying causal effect, making them very useful for testing the relative effectiveness of different inference methods.

Start by loading the causl and dplyr packages.

library(dplyr)
library(causl)

Dataset

We will use the dataset from a competition held at the Atlantic Causal Inference Conference in 2016. To obtain this, first run the following commands:

install.packages("remotes")
remotes::install_github("vdorie/aciccomp/2016")

Then load the aciccomp2016 package.

library(aciccomp2016)
(dat <- as_tibble(input_2016)) # show 10 rows of first few variables
## # A tibble: 4,802 × 58
##      x_1 x_2     x_3   x_4   x_5   x_6   x_7   x_8   x_9  x_10  x_11  x_12  x_13
##    <int> <fct> <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1    29 C         1     7    60    85     0     0     1     0     0     1     0
##  2    27 C         0     0    64   178     0     0     0     0     0     2     1
##  3    27 C         0     0    60   102     0     0     0     0     0     1     0
##  4    37 C         0     0    65   174     0     0     0     0     0     1     0
##  5    24 C        20    14    63   129     0     0     0     0     0     1     0
##  6    27 C        40    15    63   135     0     0     0     0     0     1     1
##  7    26 C        20     8    69   140     0     0     0     1     0     0     0
##  8    33 C         0     0    60   110     0     0     0     1     0     0     1
##  9    28 C         3     5    61   160     0     0     0     0     0     3     1
## 10    31 C         0     0    63   114     0     1     0     0     0     0     0
## # ℹ 4,792 more rows
## # ℹ 45 more variables: x_14 <int>, x_15 <int>, x_16 <int>, x_17 <int>,
## #   x_18 <int>, x_19 <int>, x_20 <int>, x_21 <fct>, x_22 <int>, x_23 <int>,
## #   x_24 <fct>, x_25 <int>, x_26 <int>, x_27 <int>, x_28 <int>, x_29 <int>,
## #   x_30 <int>, x_31 <int>, x_32 <int>, x_33 <int>, x_34 <int>, x_35 <int>,
## #   x_36 <int>, x_37 <int>, x_38 <int>, x_39 <int>, x_40 <int>, x_41 <int>,
## #   x_42 <int>, x_43 <int>, x_44 <int>, x_45 <int>, x_46 <int>, x_47 <int>, …

Model

Let us consider a model for the causal effect of smoking on birthweight. We will start with a binary variable \(A\) to indicate whether the mother smoked during the pregnancy, and later extend this to a zero-inflated continuous variable. We start by setting up the model inputs:

forms <- list(list(),
              list(A ~ x_1 + x_3 + x_4),
              list(Y ~ A),
              list(~ 1))
fams <- list(integer(0), 5, 1, 1)
pars <- list(A = list(beta=c(-1.5,0.03,0.02,0.05)),
             Y = list(beta=c(3200, -500), phi=400^2),
             cop = list(beta=-1))

Then call rfrugalParam to simulate \(A\) and \(Y\).

set.seed(123)   # to obtain consistent results
datAY <- rfrugalParam(formulas=forms, family=fams, pars=pars, dat=dat)

We can now check that this basic simulation was performed correctly. First we fit the treatment variable using the correct model.

glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY)
summary(glmA)$coefficients
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -1.6041    0.14387  -11.15 7.18e-29
## x_1           0.0340    0.00580    5.86 4.61e-09
## x_3           0.0250    0.00365    6.83 8.49e-12
## x_4           0.0469    0.00723    6.48 9.24e-11

Indeed, the parameters appear correct. Then we can use inverse weighting to estimate the parameters for the outcome model. We will need to load the survey package to get robust standard errors after the weighting.

library(survey)
ps <- predict(glmA, type="response")
wt <- datAY$A/ps + (1-datAY$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY))
summary(glmY)$coef
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     3171       12.1   261.5  0.0e+00
## A               -487       16.7   -29.2 5.4e-173

We can also compare with a naïve method that ignores reweighting.

glmYn <- glm(Y ~ A, data=datAY)
summary(glmYn)$coef
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     3261       9.78   333.6        0
## A               -692      15.08   -45.9        0

Automatic parameter generation

We can also use the function gen_cop_pars() to sample parameters for the copula, rather than having to specify a vector for every bivariate copula. To do this, we simply provide the list of formulas, the dataset, and a range that we would like the (partial) correlations to ultimately fall into.

For example, if we want the correlations in the range \((-0.5,0)\), we can call

pars2 <- pars
pars2$cop <- gen_cop_pars(formulas = forms, data=dat, range=c(-0.5,0))

This method gives (partial) correlations -0.155, -0.187, and -0.28.

We can then simulate using this set of parameters.

datAY2 <- rfrugalParam(formulas=forms, family=fams, pars=pars2, dat=dat)

We can check the parameters in the same manner as above, and indeed we obtain an intercept of 3189 (se \(=\) 8.8) and a causal effect of -500.2 (13.2).

We can also use a more complicated method for generating copula parameters. Suppose we want these to depend upon x_2, which is a factor with six levels. Then we specify

forms3 <- list(list(),
              list(A ~ x_1 + x_3 + x_4),
              list(Y ~ A),
              list(~ x_2))

and simulate as:

pars3 <- pars
pars3$cop <- gen_cop_pars(formulas = forms3, data=dat, range=c(-0.5,0))
datAY3 <- rfrugalParam(formulas=forms3, family=fams, pars=pars3, dat=dat)

This time we obtain an intercept of 3171 (12.2) and a causal effect of -492.4 (17.6). In this case the copula parameter is a vector of six entries; for example for the \(Y\)-\(X_1\) effect this is -1.54, 0.03, 0, -0.02, -0.02, 0.03.

Other copula families

Until now we have only used Gaussian copulas, but we can easily specify a different family. For example, suppose we wish to consider Student t-copulas. In this case we need to choose a degrees of freedom parameter, though this is easily managed with gen_cop_pars() which allows additional arguments as ....

pars4 <- pars3
(pars4$cop <- gen_cop_pars(forms3, data=dat, range=c(-0.5,0), par2=4))
## $Y
## $Y$x_1
## $Y$x_1$beta
## [1] -1.47732  0.00689 -0.02940  0.05119 -0.03702 -0.00343
## 
## $Y$x_1$par2
## [1] 4
## 
## 
## $Y$x_3
## $Y$x_3$beta
## [1] -1.50961 -0.00347 -0.00272 -0.03956 -0.00152  0.02338
## 
## $Y$x_3$par2
## [1] 4
## 
## 
## $Y$x_4
## $Y$x_4$beta
## [1] -1.47034 -0.02513  0.02569  0.02009  0.00919 -0.02600
## 
## $Y$x_4$par2
## [1] 4
fams4 <- fams
fams4[[4]] <- 2
datAY4 <- rfrugalParam(formulas=forms3, family=fams4, pars=pars4, dat=dat)

Now the estimates are 3175 (11.4) and a causal effect of -491.8 (15.8).

Strength of relationships

We also introduce the function adj_vars(), which allows users to modify the strength of the partial correlations (or other parameters) by scaling the parameters of the linear predictor. The argument factor can be modified to control the strength of the strong and weak variables respectively; the defauly values are 5 and 0.2. Returning to the example above, suppose that we want x_4 to be much more closely related than x_1 or x_3. Then we can apply:

pars5 <- pars3
pars_tmp <- gen_cop_pars(forms3, data=dat, range=c(-0.25,0))
(pars5$cop <- causl:::adj_vars(pars_tmp, strong="x_4", weak=c("x_1","x_3"), factor=c(2,0.25)))
## $Y
## $Y$x_1
## $Y$x_1$beta
## [1] -0.310090  0.001969 -0.002711 -0.000179 -0.001014 -0.003484
## 
## 
## $Y$x_3
## $Y$x_3$beta
## [1] -3.05e-01  2.82e-03 -3.29e-04 -8.65e-05  5.67e-04 -2.26e-03
## 
## 
## $Y$x_4
## $Y$x_4$beta
## [1] -2.486542  0.006965 -0.005539  0.000561  0.048086 -0.016233
fams5 <- fams
fams5[[4]] <- 5  # use Frank copula
datAY5 <- rfrugalParam(formulas=forms3, family=fams5, pars=pars5, dat=dat)

This time the estimates are 3206 (8.2) and a causal effect of -507.7 (12.7).