# Box-Cox transformation

The putting data below records the fraction of successful putts as a function of distance in feet. Gelman and Nolan (2001) model these data, see http://www.stat.columbia.edu/~gelman/research/published/golf.pdf

putts <- data.frame(Dist = 2:20,
Prop = c(0.93, 0.83, 0.74, 0.59, 0.55, 0.53, 0.46, 0.32, 0.34, 0.32,
0.26, 0.24, 0.31, 0.17, 0.13, 0.16, 0.17, 0.14, 0.16))
head(putts)
##   Dist Prop
## 1    2 0.93
## 2    3 0.83
## 3    4 0.74
## 4    5 0.59
## 5    6 0.55
## 6    7 0.53
plot(Prop ~ Dist, data = putts)

We often transform proportion data as $$\log(p/(1-p))$$ since this (the log-odds) is the canonical link function for a Bernoulli r.v. (see GLMs). It is a monotone map from $$p\in (0,1)$$ to $$(-\infty,\infty)$$. In this case the odds of failure $$(1-p)/p$$ is the natural object (it increases with distance). A log turns out to be the wrong transformation to get linear dependence on distance. Using Box-Cox we find a response which is linear in distance.

putts$y <- (1 - putts$Prop)/putts$Prop par(mfrow = c(1, 2)) plot(y ~ Dist, data = putts) plot(log(y) ~ Dist, data = putts) We can estimate the best value of $$\lambda$$ by maximising a likelihood using the boxcox() function in the MASS package: options(digits = 4) library(MASS) putts.bc <- boxcox(y ~ Dist, data = putts) putts.bc$y[60:65]
## [1] -3.842 -3.661 -3.658 -3.820 -4.132 -4.573
putts.bc\$x[60:65]
## [1] 0.3838 0.4242 0.4646 0.5051 0.5455 0.5859

From above we see that the MLE is at around 0.46 but the confidence interval covers $$\lambda = 0.5$$ which is easier to interpret.

We transform the data using $$\lambda = 0.5$$. So we fit the model

$\sqrt{y} = \beta_1 + \beta_2 x + \epsilon$

where $$y = (1-p)/p$$ as above and $$x$$ is distance.

options(digits = 3)
putts.lm <- lm(sqrt(y) ~ Dist, data = putts)
confint(putts.lm)
##               2.5 % 97.5 %
## (Intercept) -0.0637  0.351
## Dist         0.1061  0.140
plot(sqrt(y) ~ Dist, data = putts)
abline(putts.lm)

Note that $$\beta_1$$ is not significant, the confidence interval for $$\beta_1$$ includes zero. Enforcing $$\beta_1=0$$ is natural on physical grounds also, as the odds of failure should go to zero for very short putts. We conclude that the odds of putt-failure increase as the square of the distance. (The re-fitted model with $$\beta_1=0$$ enforced is not shown here.)