Model checking

A plot of the residuals $$e$$ against the fitted values $$\widehat{y}$$ allows us to check multiple model assumptions with a single plot. (We plot the points $$(\widehat{y}_i,e_i)$$ for $$i=1,\dots,n$$.)

• The residuals $$e$$ and the fitted values $$\widehat{y}$$ are independent under the model, so a plot of $$e$$ against $$\widehat{y}$$ should show no correlation.

• The variance of each $$y_i$$ is assumed to be $$\sigma^2$$, so this variance should not change with the value of $$y$$ (or with any explanatory $$x$$). So the $$e$$ versus $$\widehat{y}$$ plot should not show e.g. increasing variance as $$\widehat{y}$$ increases.

• The $$e$$ versus $$\widehat{y}$$ plot can also help detect if the relationship between $$y$$ and $$x_1,\dots,x_p$$ is not the linear function ($$\beta_1x_1+\dots+\beta_px_p$$) that we have assumed it to be.

Oxford house price data

ohp.lm3 <- lm(price ~ month * type, data = ohp3)
plot(resid(ohp.lm3) ~ fitted(ohp.lm3), data = ohp3, col = colour, pch = symbol,
xlab = "Fitted values", ylab = "Residuals")

The model ohp.lm3 can maybe be simplified (e.g. two of the “non-parallel lines” looked almost identical) but the residual plot above reveals an underlying problem: the residuals show non-constant variance and correlation with $$\widehat{y}$$, the variance appears to be increasing as $$\widehat{y}$$ increases.

Ignoring correlation often leads to incorrect (over-significant) p-values. This problem with the residuals will not be fixed by simplifying the model.

Simulated data

set.seed(123)
x <- runif(20, min = 0, max = 3)
eps <- rnorm(20)
y <- x + x^2 + eps
lmod <- lm(y ~ x)
plot(y ~ x)
abline(lmod)
curve(x + x^2, add = TRUE, lty = 2)

The data come from $$y = x + x^2 + \epsilon$$ where $$\epsilon\sim N(0, 1)$$. We fit a straight-line model $$y = \beta_0 + \beta_1 x$$ and look at the residuals. The Resdiuals versus Fitted values plot below reveals the curvature in the data.

plot(rstudent(lmod) ~ fitted(lmod), xlab = "Fitted values", ylab = "(Studentised) residuals")

Heights data

library(alr4)
set.seed(321)
(samp <- sample(1:nrow(Heights), size = 50))
##  [1] 1315 1288  328  350  536  468  620  397  617 1102  827  495 1045   62
## [15]  806  275  860  549  395  869  862 1341 1260  657  777 1015 1340  581
## [29]  168  800  284  955  908 1333  889  588  962   35  899  723 1096  369
## [43]  610 1152 1117  456  189 1203  298  235
H50 <- Heights[samp, ]

So H50 contains 50 cases, chosen at random, from the original Heights data.

h.lm <- lm(dheight ~ mheight, data = H50)
plot(dheight ~ mheight, data = H50)
abline(h.lm)

(Studentised) residuals versus Fitted values, and a Q-Q plot of (Studentised) residuals:

par(mfrow = c(1, 2))
plot(rstudent(h.lm) ~ fitted(h.lm), main = "(Studentised) residuals vs Fitted Values",
xlab = "Fitted values", ylab = "(Studentised) residuals")
qqnorm(resid(h.lm), main = "Q-Q plot of (Studentised) residuals")
qqline(resid(h.lm))

What do Q-Q plots typically look like?

set.seed(123)
par(mfrow = c(2, 2))
n <- 50
x <- rnorm(n); qqnorm(x, main = "Normal data"); qqline(x)
x <- exp(rnorm(n)); qqnorm(x, main = "Lognormal data (skewed)"); qqline(x)
x <- rcauchy(n); qqnorm(x, main = "Cauchy = t(1) data (long-tailed)"); qqline(x)
x <- runif(n); qqnorm(x, main = "Uniform data (short-tailed)"); qqline(x)

More examples

Oxford house prices:
par(mfrow = c(1, 2))
plot(rstandard(ohp.lm3) ~ fitted(ohp.lm3), data = ohp3, col = colour, pch = symbol,
xlab = "Fitted values", ylab = "Standardised residuals")
plot(rstudent(ohp.lm3) ~ fitted(ohp.lm3), data = ohp3, col = colour, pch = symbol,
xlab = "Fitted values", ylab = "Studentised residuals")

Pigs:
pigs <- data.frame(Gain = c(89, 78, 114, 79, 68, 59, 85, 61, 62, 61, 83, 82),
Litter = rep(c("I", "II", "III", "IV"), 3),
Diet = rep(c("A", "B", "C"), each = 4))
pigs.lm <- lm(Gain ~ Litter + Diet, data = pigs)
par(mfrow = c(1, 2))
plot(rstandard(pigs.lm) ~ fitted(pigs.lm), data = pigs, col = as.numeric(Litter), pch = as.numeric(Diet),
xlab = "Fitted values", ylab = "Standardised residuals")
plot(rstudent(pigs.lm) ~ fitted(pigs.lm), data = pigs, col = as.numeric(Litter), pch = as.numeric(Diet),
xlab = "Fitted values", ylab = "Studentised residuals")

Leverage

par(mfrow = c(1, 2))
plot(hatvalues(ohp.lm3), col = ohp3$colour, pch = ohp3$symbol,
main = "ohp", ylab = "Leverage")
plot(hatvalues(pigs.lm), col = as.numeric(pigs$Litter), pch = as.numeric(pigs$Diet),
main = "pigs", ylab = "Leverage")

Cook’s distance

par(mfrow = c(1, 2))
plot(cooks.distance(ohp.lm3) ~ month, data = ohp3, col = colour, pch = symbol,
main = "ohp", ylab = "Cook's distance")
n <- nrow(ohp3)
p <- ohp.lm3$rank abline(h = 8/(n - 2*p)) plot(cooks.distance(pigs.lm), col = as.numeric(pigs$Litter), pch = as.numeric(pigs\$Diet),
main = "pigs", ylab = "Cook's distance")