Box-Cox transformation

The putting data below records the fraction of successful putts as a function of distance in feet. Gelman and Nolan (2001) model these data, see http://www.stat.columbia.edu/~gelman/research/published/golf.pdf

putts <- data.frame(Dist = 2:20,
                    Prop = c(0.93, 0.83, 0.74, 0.59, 0.55, 0.53, 0.46, 0.32, 0.34, 0.32,
                             0.26, 0.24, 0.31, 0.17, 0.13, 0.16, 0.17, 0.14, 0.16))
head(putts)
##   Dist Prop
## 1    2 0.93
## 2    3 0.83
## 3    4 0.74
## 4    5 0.59
## 5    6 0.55
## 6    7 0.53
plot(Prop ~ Dist, data = putts)

We often transform proportion data as \(\log(p/(1-p))\) since this (the log-odds) is the canonical link function for a Bernoulli r.v. (see GLMs). It is a monotone map from \(p\in (0,1)\) to \((-\infty,\infty)\). In this case the odds of failure \((1-p)/p\) is the natural object (it increases with distance). A log turns out to be the wrong transformation to get linear dependence on distance. Using Box-Cox we find a response which is linear in distance.

putts$y <- (1 - putts$Prop)/putts$Prop
par(mfrow = c(1, 2))
plot(y ~ Dist, data = putts)
plot(log(y) ~ Dist, data = putts)

We can estimate the best value of \(\lambda\) by maximising a likelihood using the boxcox() function in the MASS package:

options(digits = 4)
library(MASS)
putts.bc <- boxcox(y ~ Dist, data = putts)

putts.bc$y[60:65]
## [1] -3.842 -3.661 -3.658 -3.820 -4.132 -4.573
putts.bc$x[60:65]
## [1] 0.3838 0.4242 0.4646 0.5051 0.5455 0.5859

From above we see that the MLE is at around 0.46 but the confidence interval covers \(\lambda = 0.5\) which is easier to interpret.

We transform the data using \(\lambda = 0.5\). So we fit the model

\[\sqrt{y} = \beta_1 + \beta_2 x + \epsilon\]

where \(y = (1-p)/p\) as above and \(x\) is distance.

options(digits = 3)
putts.lm <- lm(sqrt(y) ~ Dist, data = putts)
confint(putts.lm)
##               2.5 % 97.5 %
## (Intercept) -0.0637  0.351
## Dist         0.1061  0.140
plot(sqrt(y) ~ Dist, data = putts)
abline(putts.lm)

Note that \(\beta_1\) is not significant, the confidence interval for \(\beta_1\) includes zero. Enforcing \(\beta_1=0\) is natural on physical grounds also, as the odds of failure should go to zero for very short putts. We conclude that the odds of putt-failure increase as the square of the distance. (The re-fitted model with \(\beta_1=0\) enforced is not shown here.)