Third Year BS1: Applied Statistics 23 April 2007
Stop Press - Revision
There will be revision sessions in the Mathematical Institute, Lecture Room 1 on Tuesday 15 May (Week 4) and Tuesday 22 May (Week 5)at 10.30 am.
At the first of these I shall go through the BS1: Applied Statistics paper for 2005, and at the second the BS1: Applied Statistics paper for 2006.
You are advised to try the relevant paper for yourself, preferably timed, before each session.
Great statisticians: Andrei Nikolaevich Kolmogorov (1903 - 1987) and Nikolai Vasel'yevich Smirnov (1900 - 1966)

This web page provides all of the material for the BS1: Applied Statistics lecture course given by Daniel Lunn in Michaelmas Term 2006. Files will be given in both pdf and postscript versions.
Lecture Notes
Practicals
Problem Sheets
Useful results
A list of useful distributions and some of their moments is available.
Postscript files
The files listed below are postscript versions.
Postscript Lecture Notes
Postscript practicals
Postscript Problem Sheets
A list of useful distributions and some of their moments is available in postscript format.
Data files
The following data files are used in the course: they are given here as text files.
accident contains a record of factory accidents used in Practical 2.
bats contains bat-to-prey detection distances, the angle at which an insect was detected and whether it is to the left or right.
biology used in the Rcode examples.
bodyfat contains the data for Question 4 on Problem Sheet 1. It comprises body fat percentage, age and sex for 18 adults aged between 23 and 61 years.
byss is a data set from a 1973 study of 5419 workers undertaken by a large cotton textile company. The variables are number of cases of byssinosis, total, smoking status (1 = smoker), sex (1 = male), race (1 = white), employment record (1 = < 10 years, 2 = 10 - 19 years, 3 = >19 years), dustiness of workplace (1 = most dusty, 2 = less dusty, 3 = least dusty).
clot gives the clotting times (in seconds) of 9 different percentage concentrations with prothrombin-free plasma (u). Clotting was induced with two different lots of thromboplastin.
coins is the data set used for the specimen practical. The data are measurements of the silver content (% Ag) of a number of Byzantine coins discovered in Cyprus. Nine of the coins came from the first coinage of the reign of King Manuel I, Comnemus (1143-1180); there were seven from the second coinage minted several years later and four from the third coinage, minted later still; another seven were from a fourth coinage.
crime is a data set on crime in the USA.The variables are Y: Crime rate, X1: Number of males aged 14-24 per 1000 of total state population, X2: Southern states (X2=1), others (X2=0), X3: Mean number of years schooling ×10 of the population aged 25 or more, X4: Police expenditure (in $) per person by state and local government in 1960, X5: Police expenditure (in $) per person by state and local government in 1959, X6: Labour force participation per 1000 civilian urban males in age group 14-24, X7: Number of males per 1000 females, X8: State population in hundred thousands, X9: Number of non-whites per 1000, X10: Unemployment rate of urban males per 1000 in the age group 14-24, X11: Unemployment rate of urban males per 1000 in the age group 35-59, X12: Median value of family income or transferable goods and assets (unit $10), X13: Number of families per 1000 earning below one half of the median income.
depress is a data set on seriously emotionally disturbed (SED) and learning disabled (LD) adolescents. The variables are the number with high depression in a group, the total number in the group, sex (1 = female), SED or LD (1 = SED), age 12 - 14, age 15 - 16, age 17 - 18.
eskulls has measurements of maximum head breadth made on skulls of ancient Etruscans and modern Italians to ascertain whether Etruscans were native Italians
insurance is a data set on insurance availability in Chicago census neighbourhoods in 1977-78. Variables are : 1. Percentage of minority races, 2. Number of fires/1000 housing units, 3. Number of thefts/1000 population, 4. Percentage of housing units built before 1940, 5. Net new homeowner policies/100 housing units, 6. Median family income ($1000's).
island contains data on plant species in the Galapagos islands. The variables are name of island, number of species, number of native species, area, elevation, distance from nearest island, distance from Santa Cruz, area of adjacent island.
melanoma shows the incidence of non-melanoma skin cancer among women, statified by age, in two metropolitan areas of the USA. The variables are age, number of cases, population size, region.
mice contains measurements of the BSA nitrogen bound of three groups of diabetic mice.
mices is a three-column re-arrangement of the mice.txt dataset into a single column with two columns of indicators defining the groups.
oesoph comes from a case-control study investigating risk factors for oesophagal cancer. The variables are number of cases, total, grams of tobacco per day (0 = 0 - 9, 1 = 10 - 19, 2 = 20+), grams of alcohol per day (0 = 0 - 39, 1 = 40 - 79, 2 = 80+).
oesopha is the same data set as oesoph.txt given above, but with indicator variables.
otters is a large data set containing frequencies with which captive North American otters groom themselves. The variables are Group (A to H), breeding or non-breeding season, observation time (in minutes), identity of groomer, identity of recipient, grooming frequency.
piglets is a dataset comprising gains in weight of 12 young piglets of the same breed, with three pigs from each of four litters allocated at random to three feeds A, B and C.
rats contains counts of fibres in rat skeletal muscle. A group of fibres is called a fascicle and any fascicle contains fibres of two different types, these being Type I and Type II. Type I fibres are further subdivided into reticulated, punctate, both reticulated and punctate.
rubber measures abrasion loss for tyre rubber with different values of hardness and tensile strength.
sport contains data collected by government officials in the Republic of Ireland as part of an investigation into the nation's general health and is used for Practical 3. The government wished to introduce measures to promote the taking of regular exercise and therefore decided it was necessary to assess the factors which affect continued participation in sport.
sprints contains times recorded by winners of men's Olympic sprint finals from 1900 to 2000 (no Olympic Games in 1916, 1940 & 1944), along with the altitude of the Olympic venue..
sprintspeed contains transformation of the times of sprints.txt into average speed for the race, along with a re-arrangement into the columns Speed, Year, Distance, Altitude.
testos contains testosterone levels of eight chronic schizophrenic patients during three treatment periods, patient identity and treatment being designated by indicator variables.
werner contains data on the dependence of blood cholesterol levels in women on age, height, weight, the birth control pill, and levels in the blood of albumin, calcium and uric acid. It is used in Practical 1.