*********************************************************************** *********************************************************************** **** **** **** chap9.do **** **** **** **** Chapter 9 -- Missing Data, Imputation **** **** Stata do file to reproduce examples 9.1-2 **** **** in Chapter 9 of **** **** Snijders, Tom A.B., and Bosker, Roel J. **** **** Multilevel Analysis: **** **** An Introduction to Basic and Advanced Multilevel Modeling, **** **** second edition **** **** London etc.: Sage Publishers, 2012 **** **** **** **** Contributed by Jon Fahlander **** **** **** *********************************************************************** *********************************************************************** *Simulating the data used throughout the chapter clear set obs 4000 gen id = _n gen group= group(20) gen X1 = invnormal(uniform()) egen meanX1 = mean(X1), by(group) gen R1= invnormal(uniform()) drawnorm u, mean(0) sd(0.4) bysort group: gen U=u if _n==1 replace U=U[_n-1] if U==. table U gen Y1= 0.5+0.2*X1+U +R1 gen E1= invnormal(uniform()) gen X2=X1+Y1+E1 hist X2 replace Y1=. if X2>2 *The estimated model after listwize deletion *Model 1 xtmixed Y1 X1 meanX1 || group : , var **Model 2 multiple imputation while ignoring the multilevel structure * Ignoring the multilevel structure lets us use Stata's "mi"-suite of multiple imputation commands egen meanX2 = mean(X2), by(group) mi set mlong // Tells stata that the dataset is to be imputed and the format mi register imputed Y1 // Tells stata that Y1 is the variable to be imputed mi describe mi impute reg Y1 X1 X2, add(5) // Tells to actually impute the Y1 variable using the one level regression model *Model 2 xtmixed Y1 X1 meanX1 || group : if _mi_m==1|_mi_m==0,var // Runing our multilevel model of interest on the entire dataset *Model 3 *Resimulate the data clear set obs 4000 gen id = _n gen group= group(20) gen X1 = invnormal(uniform()) egen meanX1 = mean(X1), by(group) gen R1= invnormal(uniform()) drawnorm u, mean(0) sd(0.4) bysort group: gen U=u if _n==1 replace U=U[_n-1] if U==. table U gen Y1= 0.5+0.2*X1+U +R1 xtmixed Y1 X1 || group: gen E1= invnormal(uniform()) gen X2=X1+Y1+E1 egen meanX2 = mean(X2), by(group) replace Y1=. if X2>2 save "/MI_simulatedData.dta", replace **The imputation **Creating 50 imputed dependent variables. *The new variables are stored as Y1imp1-Y1imp50 set more off forvalues i=1/50 { xtmixed Y1 X1 X2 meanX1 meanX2|| group: matrix A = e(b) local sd2=exp(A[1,6]) drawnorm u`i', mean(0) sd(`sd2') local sd1=exp(A[1,6]) drawnorm e`i', mean(0) sd(`sd1') predict y1hat gen samp= e(sample) gen Y1imp`i'=Y1+u`i'+e`i' replace Y1imp`i'=y1hat if samp==0 drop y1hat samp e`i' u`i' } **Run the regression on each newly imputed dependent variable and store the fixed effects and their variances *1) Create place-holders for the esitmates. Matrices with 50 columns. matrix estimates = J(5,50,.) matrix variances=J(3,50,.) *Have a look at the matrices we just created matrix list estimates matrix list variances *2) Run 50 regressions on 50 different versions of the dependent variable and store them in the place-holders set more off forvalues i=1/50 { xtmixed Y1imp`i' X1 meanX1 || group: matrix A=e(b)' matrix VAR=e(V)' forvalues j=1/5{ matrix estimates[`j',`i']=A[`j',1] } forvalues j=1/3{ matrix variances[`j',`i']=VAR[`j',`j'] } } *Name each row so we remember what estimates they represent matrix rownames estimates = X1 meanX1 const logSdU logSde matrix rownames variances = X1_est_variance meanX1_est_variance const_est_variance matrix list estimates matrix list variances *The average fixed effect estimate *(install matfunc in order to easily sum each row) net cd http://www.fss.uu.nl/soc/iscore/stata/ net install matfunc **Equation 9.3 *Sum all 50 fixed effect estimates and divide by 50 matsum estimates, r(sumM) d matrix thetaBar= sumM/50 matrix list thetaBar **The average standard error-variances of the fixed effects **Equation 9.4 matsum variances, r(sumV) d matrix variancesBar= sumV/50 matrix list variancesBar *The between imputation variance in theta *Equation 9.5 *As a step in the calculation, create a conformable matrix with the mean of estimates in 50 columns matrix repThetaBar =thetaBar forvalues i=1/49 { matrix repThetaBar = repThetaBar, thetaBar } matrix list repThetaBar matrix list estimates *Subtract the mean estimates matrix betweenImpDiff= repThetaBar-estimates matrix list betweenImpDiff *A place holder for the squared differences matrix betweenImpDiff2 = J(3,5,.) *Square the differences with a loop across all rows and columns forvalues i = 1/3 { forvalues j = 1/5 { matrix betweenImpDiff2[`i',`j']= betweenImpDiff[`i',`j']* betweenImpDiff[`i',`j'] } } *Squared difference matrix list betweenImpDiff2 *Sum each row and divide by M-1=50-1 *Equation 9.5 matsum betweenImpDiff2, r(SUMbetweenImpVar) d matrix EbetweenImpVar=SUMbetweenImpVar/(5-1) matrix list EbetweenImpVar *The standard error of the combined estimate *Equation 9.6 matrix thetaBarSD = J(3,1,.) matrix list thetaBarSD matrix list variancesBar matrix list EbetweenImpVar forvalues i = 1/3 { matrix thetaBarSD[`i',1] =(variancesBar[`i',1]+(1+1/5)*EbetweenImpVar[`i',1])^(1/2) } matrix rownames thetaBarSD = X1 meanX1 const *The imputed values and their standard deviation matrix list thetaBar matrix list thetaBarSD *Matrix with row and column names matrix results = J(3,2,.) matrix results = thetaBar[1..3,1],thetaBarSD matrix rownames results = X1 meanX1 const matrix colnames results = mean sd *Results model 3 matrix list results ********************************************** ********************************************** ** ** ** Multiple imputation by chained equations ** ** -the plug in solution (p. 146) ** ** ** ********************************************** ********************************************** *There is no canned routine to apply multiple imputation by chained equations *in a way that takes the multilevel structure of the data into account. It is however not to *difficult to implement such a routine from scratch. Here is an example of the *plug-in method discussed on page 146. *Load the data *(found on the book homepage in a file called "mlbook2_data_preparations.zip") clear all use "mlbook_2nded_withmissingvalues.dta", clear keep langPOST langPRET pupilNR_new schoolnr IQ_verb IQ_perf ses Minority recode _all (-99=.) sum _all *Demean ses and IQ_verb egen meanSes=mean(ses) egen meanIQ=mean(IQ_verb) gen demeanSES=ses-meanSes gen demeanIQ=IQ_verb-meanIQ drop IQ_verb ses rename (demeanIQ demeanSES) (IQ_verb ses) sum _all *Inspect the data to identify the missing values inspect _all hist IQ_verb hist IQ_perf tab langPOST tab langPRET tab ses tab pupilNR_new, missing tab schoolnr *Install mvpatterns and mdesc to inspect the missing values pattern *For some reason mvpatterns can not be found with "findit", but the following works: net install dm91, from(http://www.stata.com/stb/stb61) help mvpatterns *To install mdesc type net install mdesc *and if that does not work try findit mdesc *and then click on the suggested links to install. * Inspect the missing values mdesc _all mvpatterns _all *The mvpatterns command gives you the number of individuals that have a given *pattern of missing variables. In this case 3464 individuals have no missing values, * 302 individuals have missing values in only the 3rd of the variables with missing values. *Only few observations have missing values across several variables, which is good. set more off keep langPOST langPRET pupilNR_new IQ_verb IQ_perf ses Minority schoolnr order schoolnr langPOST langPRET pupilNR_new IQ_verb IQ_perf ses Minority * Initialise the process by filling out the missing observations . foreach var of varlist langPOST langPRET pupilNR_new IQ_verb IQ_perf ses Minority { gen `var'_m = missing(`var') xtmixed `var' || schoolnr : matrix A = e(b) local sd2=exp(A[1,2]) drawnorm u, mean(0) sd(`sd2') local sd1=exp(A[1,3]) drawnorm e, mean(0) sd(`sd1') predict yhat predict group_res, reffects replace group_res=u if group_res==. replace `var'= yhat+ group_res +e if `var'==. drop yhat group_res e u } *Create interactions and group mean variables. egen gmeanIQverb = mean(IQ_verb) if IQ_verb!=., by(schoolnr) egen gmeanSES = mean(ses)if ses!=., by(schoolnr) gen IQXSES = IQ_verb*ses gen IQMeanXsesMean=gmeanSES * gmeanIQverb *save it save "/Users/Jon/Documents/Snijders/SecondEditionExamples/mlbook_2nded_withmissingvalues_FirstImputedDataset.dta", replace *Number the variables to make it easier to do a loop rename ( langPOST langPRET pupilNR_new IQ_verb IQ_perf ses Minority langPOST_m langPRET_m pupilNR_new_m IQ_verb_m IQ_perf_m ses_m Minority_m ) /// (var1 var2 var3 var4 var5 var6 var0 var1_m var2_m var3_m var4_m var5_m var6_m var0_m ) order schoolnr var1 var2 var3 var4 var5 var6 var0 *The plug-in solution mentioned on page... ignoring uncertainty in the beta's by just imputing the point estimates *All dependent variables are treated as linear multilevel models. This could easily be improved upon. forvalues j=1/300 { //We here ask for 300 loops (to make nice traceplots, see below) To save time you might try 5-10 instead set more off local i=1 while `i'<=5 { local last=`i'-1 local next=`i'+1 quietly xtmixed var`i' var`next' - var`last' || schoolnr : matrix A = e(b) matrix list A local sd2=exp(A[1,8]) drawnorm u, mean(0) sd(`sd2') local sd1=exp(A[1,9]) drawnorm e, mean(0) sd(`sd1') predict yhat predict group_res, reffects replace group_res=u if group_res==. replace var`i'= yhat+ group_res +e if var`i'_m==1 drop yhat group_res e u local last=`i'-1 order var`i', after(var`last') local i=`i'+1 } quietly xtmixed var6 var0 - var5 || schoolnr : matrix A = e(b) matrix list A local sd2=exp(A[1,8]) drawnorm u, mean(0) sd(`sd2') local sd1=exp(A[1,9]) drawnorm e, mean(0) sd(`sd1') predict yhat predict group_res, reffects replace group_res=u if group_res==. replace var6= yhat+ group_res +e if var6_m==1 drop yhat group_res e u order schoolnr var1 var2 var3 var4 var5 var6 var0 *Calculate the interaction terms based on the new data drop gmeanIQverb gmeanSES IQXSES IQMeanXsesMean egen gmeanIQverb = mean(var4) if var4!=., by(schoolnr) egen gmeanSES = mean(var6)if var6!=., by(schoolnr) gen IQXSES = var4* var6 gen IQMeanXsesMean=gmeanSES * gmeanIQverb *Regress the final model of interest xtmixed var1 var4 var6 IQXSES gmeanIQverb gmeanSES IQMeanXsesMean || schoolnr : *Save the results matrix A`j' = e(b) matrix list A`j' *A counter to see how far you have come disp "HEY THERE! I just finished loop nr " `j' } *Listing all the matrices forvalues i=1/300{ matrix list A`i' } *Creating one big matrice matrix B=A1 forvalues i=2/300{ matrix B=A\A`i' } matrix list B svmat B rename (B1 B2 B3 B4 B5 B6 B7 B8 B9 ) /// ( IQ_verb ses gmeanIQverb gmeanSES IQXSES IQMeanXsesMean cons logVar2 logVar1) save "/Users/Jon/Documents/Snijders/SecondEditionExamples/mlbook_2nded_IMPUTED300.dta", replace *Graphs to visually inspect the result *The estimates vary but seem to have converged to an equilibrium use "/Users/Jon/Documents/Snijders/SecondEditionExamples/mlbook_2nded_IMPUTED300.dta", replace gen iteration=_n twoway (line IQ_verb iteration , sort) , name(IQ_verb) twoway (line ses iteration, sort) , name(ses) twoway (line gmeanIQverb iteration, sort) , name(gmeanIQverb) twoway (line gmeanSES iteration, sort) , name(gmeanSES) twoway (line IQXSES iteration, sort), name(IQXSES) twoway (line IQMeanXsesMean iteration, sort) , name(IQMeanXsesMean) twoway (line cons iteration, sort) , name(intercept)