Weighted regression

Our example uses simulated data.

# similar to example in Chapter 1 of Venables and Ripley (2002)
options(digits = 3)
set.seed(357)

x <- seq(1, 20, by = 0.5)
sqrtw <- 1/x # std dev of y will be proportional to 1/sqrtw, i.e. proportional to x
y <- x + rnorm(x)/sqrtw
simdata <- data.frame(x, y, sqrtw)
rm(x, y, sqrtw)
head(simdata)
##     x      y sqrtw
## 1 1.0 -0.241 1.000
## 2 1.5  0.625 0.667
## 3 2.0  2.789 0.500
## 4 2.5  6.261 0.400
## 5 3.0  5.300 0.333
## 6 3.5  4.611 0.286
plot(y ~ x, data = simdata)
abline(0, 1) # true line

We can fit an unweighted regression. We can also fit a weighted regression.

fm <- lm(y ~ x, data = simdata) # unweighted model fitted, even though variances are not equal
fm1 <- lm(y ~ x, data = simdata, weight = sqrtw^2) # weighted regression
par(mfrow=c(1, 2))
plot(rstudent(fm) ~ fitted(fm), main = "unweighted regression",
     xlab = "Fitted values", ylab = "Studentised residuals")
plot(rstudent(fm1) ~ fitted(fm1), main = "weighted regression",
     xlab = "Fitted values", ylab = "Studentised residuals")

Above left: in the unweighted regression the studentised residuals clearly do not have constant variance, the variance is clearly increasing with the fitted value.

Above right: the weighted regression fixes the problem.

The weighted regression has fitted values \(\widehat{y}' = X'\widehat{\beta} = W^{1/2}X\widehat{\beta}\) so the weighted residuals \(e' = y'-\widehat{y}' = W^{1/2}(y-X\widehat{\beta})\) have variance matrix \(\sigma^2(I-H)\) with \(H = W^{1/2} X(X^TWX)^{-1} X^TW^{1/2}\). When we make a weighted regression, the \(\widehat\beta\) above determines the fitted values in both weighted and unweighted coordinate systems. R returns the fitted values \(\widehat{y} = X\widehat\beta\) and residuals \(e = (y-X\widehat\beta)\) in the unweighted coordinates rather than \(\widehat{y}'\) and \(e'\). However, the standardised and studentised residuals are the same in the two coordinate systems. We plot the studentised residuals \(r'(e)\) against the fitted values \(\widehat y\) since they have the same diagnostic properties (independence and unit variance) as plotting \(r'(e')\) against \(\widehat y'\).

par(mfrow=c(1, 2))
qqnorm(rstudent(fm), main = "Q-Q Plot: unweighted regression")
qqline(rstudent(fm))
qqnorm(rstudent(fm1), main = "Q-Q Plot: weighted regression")
qqline(rstudent(fm1))

The heteroscedasticity (i.e. non-constant variance) appears to show up as non-normality in the left-hand Q-Q plot.

The weighted regression line (the green line below) should usually be the better estimate.

plot(y ~ x, data = simdata)
abline(0, 1) # true line, black
abline(fm, col = 2) # unweighted regression line, red
abline(fm1, col = 3) # weighted regression line, green
legend("topleft", c("true", "unweighted", "weighted"), lty = 1, col = 1:3)