```
# Swiss fertility data used to illustrate outlier analysis
options(digits = 3)
head(swiss)
```

```
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
```

The `swiss`

dataset (see `?swiss`

for details) has n = 47 observations of fertility (from about 1888) with 5 potentially explanatory variables:

Fertility, a standardized fertility measure

Agriculture, % of males involved in agriculture as occupation

Examination, % draftees receiving highest mark on army examination

Education, % draftees receiving education beyond primary school

Catholic, % catholic

Infant.Mortality, % live births who live less than 1 year.

Note that each row (i.e. each data point) is named according to the Swiss province that data point measures.

Question: Which variables explain Fertility?

We can map percentage values in (0, 100) to \(\mathbb{R}\) using the logistic transformation \(x\leftarrow log(x/(100-x))\). Since 0 and 100 may appear, we modify this to \(x\leftarrow log((1+x)/(101-x))\). (We certainly don’t always have to transform so that variables are in \(\mathbb{R}\), but we make this transformation here. It increases sensitivity near 0 and 100.)

```
sw <- swiss
sw[, -1] <- log((swiss[, -1] + 1) / (101 - swiss[, -1]))
(n <- dim(sw)[1])
```

`## [1] 47`

`(p <- dim(sw)[2])`

`## [1] 6`

`pairs(sw, lower.panel = NULL)`

We will:

fit a normal linear model and look for outliers,

remove them, refit, and look again for outliers,

make a model reduction to something simple which nevertheless predicts the response, and finally

check again for outliers in the reduced model.

```
# (i) fit and check for outliers
sw1.lm <- lm(Fertility ~ Infant.Mortality + Examination + Education + Catholic + Agriculture, data = sw)
summary(sw1.lm)
```

```
##
## Call:
## lm(formula = Fertility ~ Infant.Mortality + Examination + Education +
## Catholic + Agriculture, data = sw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.712 -5.246 -0.259 6.853 12.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.744 11.461 6.70 4.4e-08 ***
## Infant.Mortality 23.627 6.939 3.40 0.0015 **
## Examination -6.209 3.710 -1.67 0.1019
## Education -6.832 2.698 -2.53 0.0153 *
## Catholic 0.823 0.618 1.33 0.1908
## Agriculture -1.870 1.690 -1.11 0.2748
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.4 on 41 degrees of freedom
## Multiple R-squared: 0.597, Adjusted R-squared: 0.548
## F-statistic: 12.2 on 5 and 41 DF, p-value: 2.96e-07
```

We can find the points of high influence and display them as below. We see that that the data points for `V. De Geneve`

and `La Vallee`

have high influence. They must have high misfit (as measured by standardised residual) or high leverage, or both.

```
# influence
# round(cooks.distance(sw1.lm), 3)
# round(8/(n - 2*p), 3)
# V. De Geneve and La Vallee are points of high influence
(i <- cooks.distance(sw1.lm) > (8/(n - 2*p)))
```

```
## Courtelary Delemont Franches-Mnt Moutier Neuveville
## FALSE FALSE FALSE FALSE FALSE
## Porrentruy Broye Glane Gruyere Sarine
## FALSE FALSE FALSE FALSE FALSE
## Veveyse Aigle Aubonne Avenches Cossonay
## FALSE FALSE FALSE FALSE FALSE
## Echallens Grandson Lausanne La Vallee Lavaux
## FALSE FALSE FALSE TRUE FALSE
## Morges Moudon Nyone Orbe Oron
## FALSE FALSE FALSE FALSE FALSE
## Payerne Paysd'enhaut Rolle Vevey Yverdon
## FALSE FALSE FALSE FALSE FALSE
## Conthey Entremont Herens Martigwy Monthey
## FALSE FALSE FALSE FALSE FALSE
## St Maurice Sierre Sion Boudry La Chauxdfnd
## FALSE FALSE FALSE FALSE FALSE
## Le Locle Neuchatel Val de Ruz ValdeTravers V. De Geneve
## FALSE FALSE FALSE FALSE TRUE
## Rive Droite Rive Gauche
## FALSE FALSE
```

`pairs(sw, lower.panel = NULL, pch = 1 + 15*i, col = 1 + i)`

The data points for `V. De Geneve`

and `La Vallee`

have high leverage in the sense that they exceed twice the mean leverage \(p/n\).

How about misfit? V. De Geneve and Rive Gauche are above threshold (which equals 2) for high misfit. Rive Droite and La Vallee have highish misfit. However, Rive Gauche and Rive Droite do not have high enough leverages (0.1 and 0.178) to make them a problem. On the other hand La Vallee has a particularly high leverage (0.379), so its milder misfit (1.7) is still a cause for concern.

```
# leverage
# round(hatvalues(sw1.lm), 3)
# round(2*p/n, 3)
# V. De Geneve and La Vallee are points of high leverage
# misfit
# round(rstandard(sw1.lm), 3)
# V. De Geneve and Rive Gauche have higher misfit than Rive Droite and La Vallee
# standard diagnostics
par(mfrow = c(2, 2))
qqnorm(rstudent(sw1.lm), main = NULL, pch = 1 + 15*i, col = 1 + i)
qqline(rstudent(sw1.lm))
plot(fitted.values(sw1.lm), rstudent(sw1.lm), pch = 1 + 15*i, col = 1 + i)
text(fitted.values(sw1.lm), rstudent(sw1.lm), abbreviate(row.names(sw)), adj = -0.2)
plot(hatvalues(sw1.lm), ylim = c(0, 0.6), pch = 1 + 15*i, col = 1 + i)
text(hatvalues(sw1.lm), row.names(sw), srt = 90, adj = -0.1)
abline(2*p/n, 0, lty = 2)
plot(cooks.distance(sw1.lm), ylim = c(-0.2, 0.8), pch = 1 + 15*i, col = 1 + i)
text(cooks.distance(sw1.lm), row.names(sw), srt = 90, adj = 1.1)
abline(8/(n - 2*p), 0, lty = 2)
```

In the first two diagnostic plots above:

the Q-Q plot is acceptable, note that two outliers are in the extremes

the scatter in the plot of studentised residuals against fitted values is somewhat clustered, with some points “outlying”.

These signs of correlation show signs of model violation somewhere in the model. The `pairs()`

plot further above seems to show reasonable linearity in at least some explanatory variables.

If, after considering background knowledge of the data gathering, and the plausibility of the candidate outlier values on physical grounds, which we don’t have in this case, we decide that high influence points are indeed outliers, we may remove them, and refit.

```
# (ii) remove, refit, check again for outliers
swr <- sw[-which(i), ]
nr <- dim(swr)[1]
swr1.lm <- lm(Fertility ~ Infant.Mortality + Examination + Education + Catholic + Agriculture, data = swr)
summary(swr1.lm)
```

```
##
## Call:
## lm(formula = Fertility ~ Infant.Mortality + Examination + Education +
## Catholic + Agriculture, data = swr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.753 -4.791 0.365 5.688 12.129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.50 12.38 6.67 6.2e-08 ***
## Infant.Mortality 26.96 8.26 3.27 0.0023 **
## Examination -6.79 3.52 -1.93 0.0611 .
## Education -5.96 2.55 -2.34 0.0247 *
## Catholic 1.03 0.60 1.71 0.0951 .
## Agriculture -2.83 1.77 -1.61 0.1165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.87 on 39 degrees of freedom
## Multiple R-squared: 0.572, Adjusted R-squared: 0.517
## F-statistic: 10.4 on 5 and 39 DF, p-value: 2.14e-06
```

Considering we removed just two of 47 data points, the parameters and significance levels have changed a fair bit. For example, Examination and Catholic are now marginally significant. Before we continue, let’s repeat the diagnostics. Outliers can mask other outliers, so we may find new points of high influence in the reduced data set. The new diagnostics are plotted below.

`any(j <- cooks.distance(swr1.lm) > (8/(nr - 2*p)))`

`## [1] FALSE`

```
par(mfrow = c(2, 2))
qqnorm(rstudent(swr1.lm), main = NULL, pch = 1 + 15*j, col = 1 + j)
qqline(rstudent(swr1.lm))
plot(fitted.values(swr1.lm), rstudent(swr1.lm), pch = 1 + 15*j, col = 1 + j)
text(fitted.values(swr1.lm), rstudent(swr1.lm), abbreviate(row.names(swr)), adj = -0.2)
plot(hatvalues(swr1.lm), ylim = c(0, 0.6), pch = 1 + 15*j, col = 1 + j)
text(hatvalues(swr1.lm), row.names(swr), srt = 90, adj = -0.1)
abline(2*p/nr, 0, lty = 2)
plot(cooks.distance(swr1.lm), ylim = c(-0.2, 0.8), pch = 1 + 15*j, col = 1 + j)
text(cooks.distance(swr1.lm), row.names(swr), srt = 90, adj = 1.1)
abline(8/(nr - 2*p), 0, lty = 2)
```