# Weighted regression

Our example uses simulated data.

# similar to example in Chapter 1 of Venables and Ripley (2002)
options(digits = 3)
set.seed(357)

x <- seq(1, 20, by = 0.5)
sqrtw <- 1/x # std dev of y will be proportional to 1/sqrtw, i.e. proportional to x
y <- x + rnorm(x)/sqrtw
simdata <- data.frame(x, y, sqrtw)
rm(x, y, sqrtw)
head(simdata)
##     x      y sqrtw
## 1 1.0 -0.241 1.000
## 2 1.5  0.625 0.667
## 3 2.0  2.789 0.500
## 4 2.5  6.261 0.400
## 5 3.0  5.300 0.333
## 6 3.5  4.611 0.286
plot(y ~ x, data = simdata)
abline(0, 1) # true line

We can fit an unweighted regression. We can also fit a weighted regression.

fm <- lm(y ~ x, data = simdata) # unweighted model fitted, even though variances are not equal
fm1 <- lm(y ~ x, data = simdata, weight = sqrtw^2) # weighted regression
par(mfrow=c(1, 2))
plot(rstudent(fm) ~ fitted(fm), main = "unweighted regression",
xlab = "Fitted values", ylab = "Studentised residuals")
plot(rstudent(fm1) ~ fitted(fm1), main = "weighted regression",
xlab = "Fitted values", ylab = "Studentised residuals")

Above left: in the unweighted regression the studentised residuals clearly do not have constant variance, the variance is clearly increasing with the fitted value.

Above right: the weighted regression fixes the problem.

The weighted regression has fitted values $$\widehat{y}' = X'\widehat{\beta} = W^{1/2}X\widehat{\beta}$$ so the weighted residuals $$e' = y'-\widehat{y}' = W^{1/2}(y-X\widehat{\beta})$$ have variance matrix $$\sigma^2(I-H)$$ with $$H = W^{1/2} X(X^TWX)^{-1} X^TW^{1/2}$$. When we make a weighted regression, the $$\widehat\beta$$ above determines the fitted values in both weighted and unweighted coordinate systems. R returns the fitted values $$\widehat{y} = X\widehat\beta$$ and residuals $$e = (y-X\widehat\beta)$$ in the unweighted coordinates rather than $$\widehat{y}'$$ and $$e'$$. However, the standardised and studentised residuals are the same in the two coordinate systems. We plot the studentised residuals $$r'(e)$$ against the fitted values $$\widehat y$$ since they have the same diagnostic properties (independence and unit variance) as plotting $$r'(e')$$ against $$\widehat y'$$.

par(mfrow=c(1, 2))
qqnorm(rstudent(fm), main = "Q-Q Plot: unweighted regression")
qqline(rstudent(fm))
qqnorm(rstudent(fm1), main = "Q-Q Plot: weighted regression")
qqline(rstudent(fm1))

The heteroscedasticity (i.e. non-constant variance) appears to show up as non-normality in the left-hand Q-Q plot.

The weighted regression line (the green line below) should usually be the better estimate.

plot(y ~ x, data = simdata)
abline(0, 1) # true line, black
abline(fm, col = 2) # unweighted regression line, red
abline(fm1, col = 3) # weighted regression line, green
legend("topleft", c("true", "unweighted", "weighted"), lty = 1, col = 1:3)