Pig diet data

Twelve piglets were fed one of three different diets (A, B and C). Four litters (I, II, III and IV), each of 3 piglets, were used with each diet being allocated at random to one piglet in each litter. (Lunn, 2007.)

The question of interest: is there evidence that the diets differ?

The litters are the blocks.

The diets are the treatments.

And because the three diets are randomly allocated within each block, the experiment is referred to as a randomised block design. The design is complete and balanced (every treatment occurs at least once, an equal number of times in fact, in every block).

pigs <- data.frame(Gain = c(89, 78, 114, 79, 68, 59, 85, 61, 62, 61, 83, 82),
                   Litter = rep(c("I", "II", "III", "IV"), 3),
                   Diet = rep(c("A", "B", "C"), each = 4))
pigs
##    Gain Litter Diet
## 1    89      I    A
## 2    78     II    A
## 3   114    III    A
## 4    79     IV    A
## 5    68      I    B
## 6    59     II    B
## 7    85    III    B
## 8    61     IV    B
## 9    62      I    C
## 10   61     II    C
## 11   83    III    C
## 12   82     IV    C
str(pigs)
## 'data.frame':    12 obs. of  3 variables:
##  $ Gain  : num  89 78 114 79 68 59 85 61 62 61 ...
##  $ Litter: Factor w/ 4 levels "I","II","III",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Diet  : Factor w/ 3 levels "A","B","C": 1 1 1 1 2 2 2 2 3 3 ...

Note that Litter I is the baseline litter, and Diet A is the baseline diet.

plot(Gain ~ as.numeric((Diet)), data = pigs, xaxt = "n", xlab = "Diet",
     col = as.numeric(Litter), pch = as.numeric(Diet))
axis(1, at = c(1, 2, 3), label = c("A", "B", "C"))
legend("topright", c("Litter I", "Litter II", "Litter III", "Litter IV"),
       col = as.numeric(pigs$Litter), pch = as.numeric(pigs$Diet))

Suppose we index observations by \((i, j)\) where \(i\) is the litter (\(i=1,\dots,4\)) and \(j\) is the diet (\(j=1,2,3\)). Then the model with

is:

\[y_{ij} = \alpha + \gamma_i + \tau_j + \epsilon_{ij}\]

for \(i=1,\dots,4\) and \(j=1,2,3\). We are looking for an effect (a mean shift in weight gain) due to diet, and we want to allow for variation (another mean shift) due to litter.

pigs.lm <- lm(Gain ~ Litter + Diet, data = pigs)
(X <- model.matrix(Gain ~ Litter + Diet, data = pigs))
##    (Intercept) LitterII LitterIII LitterIV DietB DietC
## 1            1        0         0        0     0     0
## 2            1        1         0        0     0     0
## 3            1        0         1        0     0     0
## 4            1        0         0        1     0     0
## 5            1        0         0        0     1     0
## 6            1        1         0        0     1     0
## 7            1        0         1        0     1     0
## 8            1        0         0        1     1     0
## 9            1        0         0        0     0     1
## 10           1        1         0        0     0     1
## 11           1        0         1        0     0     1
## 12           1        0         0        1     0     1
## attr(,"assign")
## [1] 0 1 1 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$Litter
## [1] "contr.treatment"
## 
## attr(,"contrasts")$Diet
## [1] "contr.treatment"
options(digits = 3)
summary(pigs.lm)
## 
## Call:
## lm(formula = Gain ~ Litter + Diet, data = pigs)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.250 -4.937 -0.375  2.938 12.750 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    86.25       5.76   14.97  5.6e-06 ***
## LitterII       -7.00       6.65   -1.05   0.3333    
## LitterIII      21.00       6.65    3.16   0.0197 *  
## LitterIV        1.00       6.65    0.15   0.8855    
## DietB         -21.75       5.76   -3.77   0.0092 ** 
## DietC         -18.00       5.76   -3.12   0.0205 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.15 on 6 degrees of freedom
## Multiple R-squared:  0.857,  Adjusted R-squared:  0.738 
## F-statistic: 7.18 on 5 and 6 DF,  p-value: 0.0162

It is clear from the summary that Litter I and Diet A have been taken as the baseline levels for the two categorical variables.

Is Diet predictive for Gain? For this we want to test \(\tau_2 = \tau_3 = 0\).

anova(pigs.lm)
## Analysis of Variance Table
## 
## Response: Gain
##           Df Sum Sq Mean Sq F value Pr(>F)  
## Litter     3   1304     435    6.55  0.025 *
## Diet       2   1082     541    8.14  0.020 *
## Residuals  6    399      66                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So, yes, diet is predictive at 5% significance.

The above analysis is a two-way analysis of variance (two-way ANOVA, there are two factors).

The following shows the orthogonality of this balanced design:

# %*% is matrix multiplication
# t(X) %*% X  is  X^T X
solve(t(X) %*% X) # the inverse of X^T X 
##             (Intercept) LitterII LitterIII LitterIV DietB DietC
## (Intercept)       0.500   -0.333    -0.333   -0.333 -0.25 -0.25
## LitterII         -0.333    0.667     0.333    0.333  0.00  0.00
## LitterIII        -0.333    0.333     0.667    0.333  0.00  0.00
## LitterIV         -0.333    0.333     0.333    0.667  0.00  0.00
## DietB            -0.250    0.000     0.000    0.000  0.50  0.25
## DietC            -0.250    0.000     0.000    0.000  0.25  0.50

Note the zeros in the DietX-LitterY entries of this matrix. Recall that \(\text{var}(\widehat\beta) = \sigma^2 (X^T X)^{-1}\), so the zeros show that \((\widehat\gamma_2, \widehat\gamma_3, \widehat\gamma_4)\) and \((\widehat\tau_2, \widehat\tau_3)\) are independent – blocks and treatments are orthogonal.

A consequence of this orthogonality is that the following two ANOVAs are essentially the same even though the sequence of models fitted is different – the first two rows are just swapped, because of orthogonality.

anova(lm(Gain ~ Litter + Diet, data = pigs)) # the same as done above
## Analysis of Variance Table
## 
## Response: Gain
##           Df Sum Sq Mean Sq F value Pr(>F)  
## Litter     3   1304     435    6.55  0.025 *
## Diet       2   1082     541    8.14  0.020 *
## Residuals  6    399      66                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(Gain ~ Diet + Litter, data = pigs)) # Diet and Litter now in the opposite order
## Analysis of Variance Table
## 
## Response: Gain
##           Df Sum Sq Mean Sq F value Pr(>F)  
## Diet       2   1082     541    8.14  0.020 *
## Litter     3   1304     435    6.55  0.025 *
## Residuals  6    399      66                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Treatment effects are separated from block effects in this orthogonal/complete design. The treatment effects (the Sum Sq for treatments) are not changed by having the block effects in the model. But the block effects do contribute to the estimate \(s^2\) of the error variance.

Ignoring Litter information

Suppose we had no records of which piglets were from which litter. Or suppose we had the litter information but we ignored it.

anova(lm(Gain ~ Diet, data = pigs))
## Analysis of Variance Table
## 
## Response: Gain
##           Df Sum Sq Mean Sq F value Pr(>F)
## Diet       2   1082     541    2.86   0.11
## Residuals  9   1703     189

This is fitting the model:

\[y_{ij} = \alpha + \tau_j + \epsilon_{ij}\] with \(\tau_1=0\). This is a one-way analysis of variance (there is one factor, just the diet variable).

In this analysis, diet is not significant. This is the “wrong analysis” in this case – Litter has been omitted when it should have been included. The estimate of \(\sigma^2\) is much higher in this case (189, was 66 previously). This feeds into the denominator of the F value, the resulting F value is small and not significant. We have “forgotten” to allow for a blocking effect (the different Litters) and this has made a critical difference.