Swiss data

# Swiss fertility data used to illustrate outlier analysis
options(digits = 3)
head(swiss)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

The swiss dataset (see ?swiss for details) has n = 47 observations of fertility (from about 1888) with 5 potentially explanatory variables:

Note that each row (i.e. each data point) is named according to the Swiss province that data point measures.

Question: Which variables explain Fertility?

We can map percentage values in (0, 100) to \(\mathbb{R}\) using the logistic transformation \(x\leftarrow log(x/(100-x))\). Since 0 and 100 may appear, we modify this to \(x\leftarrow log((1+x)/(101-x))\). (We certainly don’t always have to transform so that variables are in \(\mathbb{R}\), but we make this transformation here. It increases sensitivity near 0 and 100.)

sw <- swiss
sw[, -1] <- log((swiss[, -1] + 1) / (101 - swiss[, -1]))
(n <- dim(sw)[1])
## [1] 47
(p <- dim(sw)[2])
## [1] 6
pairs(sw, lower.panel = NULL)

The plan

We will:

  1. fit a normal linear model and look for outliers,

  2. remove them, refit, and look again for outliers,

  3. make a model reduction to something simple which nevertheless predicts the response, and finally

  4. check again for outliers in the reduced model.

# (i) fit and check for outliers
sw1.lm <- lm(Fertility ~ Infant.Mortality + Examination + Education + Catholic + Agriculture, data = sw)
summary(sw1.lm)
## 
## Call:
## lm(formula = Fertility ~ Infant.Mortality + Examination + Education + 
##     Catholic + Agriculture, data = sw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.712  -5.246  -0.259   6.853  12.950 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        76.744     11.461    6.70  4.4e-08 ***
## Infant.Mortality   23.627      6.939    3.40   0.0015 ** 
## Examination        -6.209      3.710   -1.67   0.1019    
## Education          -6.832      2.698   -2.53   0.0153 *  
## Catholic            0.823      0.618    1.33   0.1908    
## Agriculture        -1.870      1.690   -1.11   0.2748    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.4 on 41 degrees of freedom
## Multiple R-squared:  0.597,  Adjusted R-squared:  0.548 
## F-statistic: 12.2 on 5 and 41 DF,  p-value: 2.96e-07

We can find the points of high influence and display them as below. We see that that the data points for V. De Geneve and La Vallee have high influence. They must have high misfit (as measured by standardised residual) or high leverage, or both.

# influence
# round(cooks.distance(sw1.lm), 3)
# round(8/(n - 2*p), 3)
# V. De Geneve and La Vallee are points of high influence

(i <- cooks.distance(sw1.lm) > (8/(n - 2*p)))
##   Courtelary     Delemont Franches-Mnt      Moutier   Neuveville 
##        FALSE        FALSE        FALSE        FALSE        FALSE 
##   Porrentruy        Broye        Glane      Gruyere       Sarine 
##        FALSE        FALSE        FALSE        FALSE        FALSE 
##      Veveyse        Aigle      Aubonne     Avenches     Cossonay 
##        FALSE        FALSE        FALSE        FALSE        FALSE 
##    Echallens     Grandson     Lausanne    La Vallee       Lavaux 
##        FALSE        FALSE        FALSE         TRUE        FALSE 
##       Morges       Moudon        Nyone         Orbe         Oron 
##        FALSE        FALSE        FALSE        FALSE        FALSE 
##      Payerne Paysd'enhaut        Rolle        Vevey      Yverdon 
##        FALSE        FALSE        FALSE        FALSE        FALSE 
##      Conthey    Entremont       Herens     Martigwy      Monthey 
##        FALSE        FALSE        FALSE        FALSE        FALSE 
##   St Maurice       Sierre         Sion       Boudry La Chauxdfnd 
##        FALSE        FALSE        FALSE        FALSE        FALSE 
##     Le Locle    Neuchatel   Val de Ruz ValdeTravers V. De Geneve 
##        FALSE        FALSE        FALSE        FALSE         TRUE 
##  Rive Droite  Rive Gauche 
##        FALSE        FALSE
pairs(sw, lower.panel = NULL, pch = 1 + 15*i, col = 1 + i)

The data points for V. De Geneve and La Vallee have high leverage in the sense that they exceed twice the mean leverage \(p/n\).

How about misfit? V. De Geneve and Rive Gauche are above threshold (which equals 2) for high misfit. Rive Droite and La Vallee have highish misfit. However, Rive Gauche and Rive Droite do not have high enough leverages (0.1 and 0.178) to make them a problem. On the other hand La Vallee has a particularly high leverage (0.379), so its milder misfit (1.7) is still a cause for concern.

# leverage
# round(hatvalues(sw1.lm), 3)
# round(2*p/n, 3)
# V. De Geneve and La Vallee are points of high leverage

# misfit
# round(rstandard(sw1.lm), 3)
# V. De Geneve and Rive Gauche have higher misfit than Rive Droite and La Vallee

# standard diagnostics
par(mfrow = c(2, 2))

qqnorm(rstudent(sw1.lm), main = NULL, pch = 1 + 15*i, col = 1 + i)
qqline(rstudent(sw1.lm))

plot(fitted.values(sw1.lm), rstudent(sw1.lm), pch = 1 + 15*i, col = 1 + i)
text(fitted.values(sw1.lm), rstudent(sw1.lm), abbreviate(row.names(sw)), adj = -0.2)

plot(hatvalues(sw1.lm), ylim = c(0, 0.6), pch = 1 + 15*i, col = 1 + i)
text(hatvalues(sw1.lm), row.names(sw), srt = 90, adj = -0.1)
abline(2*p/n, 0, lty = 2)

plot(cooks.distance(sw1.lm), ylim = c(-0.2, 0.8), pch = 1 + 15*i, col = 1 + i)
text(cooks.distance(sw1.lm), row.names(sw), srt = 90, adj = 1.1)
abline(8/(n - 2*p), 0, lty = 2)

In the first two diagnostic plots above:

  • the Q-Q plot is acceptable, note that two outliers are in the extremes

  • the scatter in the plot of studentised residuals against fitted values is somewhat clustered, with some points “outlying”.

These signs of correlation show signs of model violation somewhere in the model. The pairs() plot further above seems to show reasonable linearity in at least some explanatory variables.

If, after considering background knowledge of the data gathering, and the plausibility of the candidate outlier values on physical grounds, which we don’t have in this case, we decide that high influence points are indeed outliers, we may remove them, and refit.

# (ii) remove, refit, check again for outliers
swr <- sw[-which(i), ]
nr <- dim(swr)[1]

swr1.lm <- lm(Fertility ~ Infant.Mortality + Examination + Education + Catholic + Agriculture, data = swr)
summary(swr1.lm)
## 
## Call:
## lm(formula = Fertility ~ Infant.Mortality + Examination + Education + 
##     Catholic + Agriculture, data = swr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.753  -4.791   0.365   5.688  12.129 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         82.50      12.38    6.67  6.2e-08 ***
## Infant.Mortality    26.96       8.26    3.27   0.0023 ** 
## Examination         -6.79       3.52   -1.93   0.0611 .  
## Education           -5.96       2.55   -2.34   0.0247 *  
## Catholic             1.03       0.60    1.71   0.0951 .  
## Agriculture         -2.83       1.77   -1.61   0.1165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.87 on 39 degrees of freedom
## Multiple R-squared:  0.572,  Adjusted R-squared:  0.517 
## F-statistic: 10.4 on 5 and 39 DF,  p-value: 2.14e-06

Considering we removed just two of 47 data points, the parameters and significance levels have changed a fair bit. For example, Examination and Catholic are now marginally significant. Before we continue, let’s repeat the diagnostics. Outliers can mask other outliers, so we may find new points of high influence in the reduced data set. The new diagnostics are plotted below.

any(j <- cooks.distance(swr1.lm) > (8/(nr - 2*p)))
## [1] FALSE
par(mfrow = c(2, 2))

qqnorm(rstudent(swr1.lm), main = NULL, pch = 1 + 15*j, col = 1 + j)
qqline(rstudent(swr1.lm))

plot(fitted.values(swr1.lm), rstudent(swr1.lm), pch = 1 + 15*j, col = 1 + j)
text(fitted.values(swr1.lm), rstudent(swr1.lm), abbreviate(row.names(swr)), adj = -0.2)

plot(hatvalues(swr1.lm), ylim = c(0, 0.6), pch = 1 + 15*j, col = 1 + j)

text(hatvalues(swr1.lm), row.names(swr), srt = 90, adj = -0.1)
abline(2*p/nr, 0, lty = 2)

plot(cooks.distance(swr1.lm), ylim = c(-0.2, 0.8), pch = 1 + 15*j, col = 1 + j)
text(cooks.distance(swr1.lm), row.names(swr), srt = 90, adj = 1.1)
abline(8/(nr - 2*p), 0, lty = 2)