Trees data

Can we predict the volume of wood in a tree based on its height and girth? (Girth = diameter, see ?trees).

str(trees)
## 'data.frame':    31 obs. of  3 variables:
##  $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
##  $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
##  $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
head(trees)
##   Girth Height Volume
## 1   8.3     70   10.3
## 2   8.6     65   10.3
## 3   8.8     63   10.2
## 4  10.5     72   16.4
## 5  10.7     81   18.8
## 6  10.8     83   19.7
pairs(trees, upper.panel = NULL)

Let \(v\) denote volume, \(h\) height, and \(g\) girth. We could consider

\[v = \beta_1 + \beta_2 h + \beta_3 g + \epsilon\]

but this does not work well for small trees, it predicts a negative volume when \(h = g = 0\) (the estimate of \(\beta_1\) is negative, and \(\beta_1 = 0\) is firmly rejected). Neither does this model allow variances to be larger for larger trees – a feature that seems apparent in the plot above.

Physical considerations (a.k.a. common sense) are always important: the volume of a cylinder satisfies \(v\propto hg^2\). Instead of \(hg^2\) we will start with \(h^{1+\beta_2} g^{2+\beta_3}\). We need to multiply this by a constant of proportionality (of \(e^{\beta_1}\) say) and also need to allow for error in our model. We will assume a multiplicative error, a random variable, of \(\gamma\). Thus we arrive at

\[v = e^{\beta_1} h^{1+\beta_2} g^{2+\beta_3} \gamma\]

where \(\beta_1,\beta_2,\beta_3\) are parmeters and \(\gamma\) varies around 1 (i.e. \(E(\gamma)=1\), because if \(\gamma\) didn’t satisfy this we could adjust \(\beta_1\) to achieve it). The idea of using a multiplicative error \(\gamma\) is that the volume of larger trees tends to vary more than the volume of smaller trees.

Taking logs,

\[y = \beta_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\]

where \(y=\log\left(\frac{v}{hg^2}\right)\), \(x_2=\log h\), \(x_3=\log g\) and \(\epsilon=\log\gamma\).

options(digits = 3)
trees.lm1 <- lm(log(Volume/(Height*Girth^2)) ~ log(Height) + log(Girth), data = trees)
summary(trees.lm1)
## 
## Call:
## lm(formula = log(Volume/(Height * Girth^2)) ~ log(Height) + log(Girth), 
##     data = trees)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16856 -0.04849  0.00243  0.06364  0.12922 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6.6316     0.7998   -8.29  5.1e-09 ***
## log(Height)   0.1171     0.2044    0.57     0.57    
## log(Girth)   -0.0174     0.0750   -0.23     0.82    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0814 on 28 degrees of freedom
## Multiple R-squared:  0.0118, Adjusted R-squared:  -0.0587 
## F-statistic: 0.168 on 2 and 28 DF,  p-value: 0.846

Later we will look at checks of fit. For now we consider two plots.

qqnorm(resid(trees.lm1))
qqline(resid(trees.lm1))

The above Q-Q plot of the residuals of trees.lm1 fit is well-behaved, except possibly in the upper tail.

plot(resid(trees.lm1) ~ fitted(trees.lm1), main = "Residuals vs Fitted values",
     xlab = "Fitted values", ylab = "Residuals")

Above there is no sign of correlation between residuals and fitted values (though there is a possible funnel shape, increasing variance with fitted value).

A simpler model is:

trees.lm0 <- lm(log(Volume/(Height*Girth^2)) ~ 1, data = trees)

First we carry out an F-test of \(\beta_2 = \beta_3 = 0\) “by hand”:

(rss0 <- sum(resid(trees.lm0)^2))
## [1] 0.188
(rss1 <- sum(resid(trees.lm1)^2))
## [1] 0.185
k <- 2
n.minus.p <- 31 - 3
(F <- ((rss0 - rss1)/k) / (rss1/n.minus.p))
## [1] 0.168
(p <- 1 - pf(F, df1 = k, df2 = n.minus.p))
## [1] 0.846

The p-value of \(p =\) 0.846 is not significant, so the test supports the hypothesis that \(\beta_2 = \beta_3 = 0\), i.e. there is no evidence for dependence of \(y\) on \(x_2\) and \(x_3\).

Alternatively we can do the test using anova():

anova(trees.lm1)
## Analysis of Variance Table
## 
## Response: log(Volume/(Height * Girth^2))
##             Df Sum Sq Mean Sq F value Pr(>F)
## log(Height)  1 0.0019 0.00187    0.28   0.60
## log(Girth)   1 0.0004 0.00035    0.05   0.82
## Residuals   28 0.1855 0.00662

The Sum Sq column in the above table corresponds to “Reduction in RSS” in the notes. Now see lectures for how to use the above table to do the test.

Alternatively we can tell R which specific models we want to compare, and get R to form a table:

anova(trees.lm0, trees.lm1)
## Analysis of Variance Table
## 
## Model 1: log(Volume/(Height * Girth^2)) ~ 1
## Model 2: log(Volume/(Height * Girth^2)) ~ log(Height) + log(Girth)
##   Res.Df   RSS Df Sum of Sq    F Pr(>F)
## 1     30 0.188                         
## 2     28 0.185  2   0.00222 0.17   0.85

Note that the F-value of 0.168, and its p-value of 0.846, also appear in the last line of the output of summary(trees.lm1) above.