Radon data

Example from Gelman and Hill (2007) - more info in lecture notes.

The measurement \(y\) is the log of the radon level.

Measurements can be taken in the basement or on the ground floor: \(x=0\) for basement, \(x=1\) for ground floor.

Background radon levels vary from county to county, depending on local geology. It is natural to suppose that this background determines a baseline level \(\alpha_j\) (in county \(j\)) and each house varies around that local mean. Baseline levels are not unrelated – they are assumed to vary around the Minnesota-mean level \(\alpha\).

This leads us to a hierarchical model (lectures) with a random effect on the intercept.

[There is another covariate \(u_j\) giving the (log of the) soil uranium yield in county \(j\). We ignore this.]

options(digits = 3)
rad <- read.table("http://www.stats.ox.ac.uk/~laws/LMs/data/radon2.txt", header = TRUE)
rad[c(1:5, 917:919), ]
##     radon floor uranium county
## 1     2.2     1   0.502      1
## 2     2.2     0   0.502      1
## 3     2.9     0   0.502      1
## 4     1.0     0   0.502      1
## 5     3.1     0   0.429      2
## 917   5.0     0   0.914     84
## 918   3.7     0   1.427     85
## 919   2.9     0   1.427     85
attach(rad)
y <- log (ifelse (radon==0, .1, radon))
n <- length(y)
x <- floor
u <- log(uranium)
summary(rad)
##      radon          floor          uranium          county    
##  Min.   : 0.0   Min.   :0.000   Min.   :0.414   Min.   : 1.0  
##  1st Qu.: 1.9   1st Qu.:0.000   1st Qu.:0.622   1st Qu.:21.0  
##  Median : 3.6   Median :0.000   Median :0.908   Median :44.0  
##  Mean   : 4.8   Mean   :0.166   Mean   :0.934   Mean   :43.5  
##  3rd Qu.: 6.0   3rd Qu.:0.000   3rd Qu.:1.201   3rd Qu.:70.0  
##  Max.   :48.2   Max.   :1.000   Max.   :1.696   Max.   :85.0
plot(rad)

par(mfrow = c(1, 2))
plot(radon ~ county)
plot(log(radon) ~ county)

Fit a hierarchical model with a random effect on the intercept:

library(nlme)
## Warning: package 'nlme' was built under R version 3.4.4
rad.lme1 <- lme(y ~ x, random = ~1 | county)
summary(rad.lme1)
## Linear mixed-effects model fit by REML
##  Data: NULL 
##    AIC  BIC logLik
##   2179 2199  -1086
## 
## Random effects:
##  Formula: ~1 | county
##         (Intercept) Residual
## StdDev:       0.328    0.756
## 
## Fixed effects: y ~ x 
##              Value Std.Error  DF t-value p-value
## (Intercept)  1.462    0.0516 833   28.34       0
## x           -0.693    0.0704 833   -9.84       0
##  Correlation: 
##   (Intr)
## x -0.288
## 
## Standardized Within-Group Residuals:
##      Min       Q1      Med       Q3      Max 
## -4.39892 -0.61553  0.00291  0.64048  3.42808 
## 
## Number of Observations: 919
## Number of Groups: 85

In terms of the model

\[y_{ij}=\alpha+b_{j}+\beta x_{ij} +\epsilon_{ij}, \quad \epsilon_{ij}\sim N(0,\sigma_y^2), \quad b_j\sim N(0,\sigma_{\alpha}^2)\]

our estimates are \(\hat{\sigma}_{\alpha} =\) 0.328, \(\hat{\sigma}_{y} =\) 0.756, \(\hat{\alpha} =\) 1.462, \(\hat{\beta} =\) -0.693

and our “estimates” for the \(J=85\) random effects \(\hat{b}_j\) can be extracted using random.effects() or ranef()

head(ranef(rad.lme1))
##   (Intercept)
## 1     -0.2701
## 2     -0.5340
## 3      0.0176
## 4      0.0429
## 5     -0.0154
## 6      0.0186

The fitted values \(\hat{y}_{ij} = \hat{\alpha} + \hat{b}_j + \hat{\beta}x_{ij}\) can be obtained using fitted()

head(fitted(rad.lme1))
##     1     1     1     1     2     2 
## 0.499 1.192 1.192 1.192 0.928 0.928

The default fitting method is REML (as indicated in the output above). We can fit using maximum likelihood instead:

rad.lme2 <- lme(y ~ x, random = ~1 | county, method = "ML")

We can compare the model rad.lme2 which has random effects for each county, to the simpler model rad.lm which makes no allowance for county (i.e. it assumes all counties are the same, this is the pooled model):

rad.lm <- lm(y ~ x)
anova(rad.lme2, rad.lm)
##          Model df  AIC  BIC logLik   Test L.Ratio p-value
## rad.lme2     1  4 2172 2191  -1082                       
## rad.lm       2  3 2253 2267  -1124 1 vs 2    83.4  <.0001

Here the test from anova() is a likelihood ratio test: the model rad.lm is nested within rad.lme2, we are testing \(\sigma^2_\alpha=0\). The test favours the more general model, it rejects the hypothesis \(\sigma^2_\alpha=0\). It compares the likelihood ratio value to an assumed asymptotic \(\chi^2_1\) distribution (the \(\chi^2\) has 1 degree of freedom as the single parameter \(\sigma^2_\alpha\) is held at zero under the null hypothesis). The test is approximate, it requires several assumptions: as well as being asymptotic (i.e. requiring a large sample size), another important assumption is that the parameters under the null are not on the boundary of the parameter space – but here the value \(\sigma^2_\alpha=0\) is on the boundary of the set \([0, \infty)\) of possible \(\sigma^2_\alpha\) values! The boundary effect means that such tests tend to be conservative (Faraway, 2016), i.e. the approximate p-values tend to be larger than they should be, so if the test using the \(\chi^2\) approximation is significant then we can be fairly confident that it is actually significant. Faraway (2016, Chapter 10) assesses the significance of such tests using the parametric bootstrap – so we stop this discussion at this point as the bootstrap doesn’t appear until next term in the Computational Statistics course.

Not surprisingly, in a model with fixed effects for each county (plus a fixed effect for explanatory variable \(x\)), the effect of county is also significant:

rad.lm2 <- lm(y ~ x + as.factor(county))
anova(rad.lm2)
## Analysis of Variance Table
## 
## Response: y
##                    Df Sum Sq Mean Sq F value  Pr(>F)    
## x                   1     48    48.0   83.86 < 2e-16 ***
## as.factor(county)  84    144     1.7    2.99 1.5e-15 ***
## Residuals         833    477     0.6                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As for linear models we can look at diagnostic plots that use the residuals \(e = y - \hat{y}\), or standardised versions of \(e\):

# "p" below is short for "pearson" and gives standardised residuals,
# without type = "p" we get non-standardised residuals
qqnorm(rad.lme1, ~ resid(., type = "p"))

plot(rad.lme1, resid(., type = "p") ~ fitted(.))

Also to assess the assumed normality of the random effects:

qqnorm(rad.lme1, ~ ranef(., type = "p"))

For a little more on this example you could look at http://www.stats.ox.ac.uk/~nicholls/sb1a/L14-15.R