MS1 Statistical Data Mining Hilary 2007

Draft Synopsis

Announcement

New:
Revision Class, MSc, Trinity 2007, SPR1 lecture room, Week 6 Friday 3-4
Prepare Questions from last year's exams (Part C and MSc).

Old:
12 lectures, 4 computer lab sessions (replacing the last 4 lectures). 
The lab sessions are not assessed and there are 4 exam questions not 3. 

Lecture supplements (2007)

Lecture 1 pdf bibliography

Lecture 1-2 Principal components and Biplots lect107pca.R.

Lecture 3 ICA lect307ica.R.

Lecture 4 MDS lect407mds.R, villages.R data (see R-code for note on data source)

Lecture 6 K-means lect607kmn.R and hierarchical aglomerative clustering lect607hcl.R

Lecture 7 Vector Quantization and image compression lect707vqz.R

Lecture 8 LDA lect807lda.R

Lecture 10 classification trees lect1007ctr.R and data spam.R and spamt.R from the UCI Machine Learning repository.

Problem sheets 1-7

Problem Sheet 1: ms1ps107.pdf

Problem Sheet 2: ms1ps207.pdf

Problem Sheet 3: ms1ps307.pdf

Problem Sheet 4: ms1ps407.pdf

Problem Sheet 5: ms1ps507.pdf

Problem Sheet 6: ms1ps607.pdf

Problem Sheet 7: ms1ps707.pdf with data from the UCI Machine Learning repository:
for Q1,
agaricus-lepiota.data and agaricus-lepiota.names; for Q2, wine.data and wine.names.

UG practical classes 2007

(nnets) R-script and pdf for the 1st lab (Week 7, Wed 9-10 SPR2.205)

(nnets, V-fold cross validation) R-script and pdf for the 2nd lab (Week 8, Mon 2-3 SPR2.205)

(V-fold cross validation and classification trees) R-script and pdf for the 3rd lab (Week 8, Wed 9-10 SPR2.205)

(MDS and agglomerative clustering) Data villages.R, R-sample answers and pdf for the 4th lab (Week 8, Wed 10-11 SPR2.205)

MSc Applied Statistics practical class

(2006) Data villages.R and the practical question sheet and sample R-solutions and Splus-solutions (thanks Kenny)

(2007) 1x4hours, 2pm-6pm, Friday week 8

Part 1: R-script and pdf

Part 2: R-script and pdf

Part 3: R-script and pdf

MSc Assessed Practical Trinity week 5 2007

You may find these sample answers and corresponding R-file of interest (particularly if you didnt do these prac question).

MSc Applied and Computational Mathematics Hilary 2006

(2006) Take home test, data spam.R and spamt.R.

The data for this sheet came from the UCI Machine Learning repository.

Reading material: online

Bioinformatics SDM lecture notes (Vos, Evers)

Pattern Recognition and Expert Systems (Northrop)

Multivariate (including principal components, biplots and scaling) (Ripley)

Independent components (Hyvärinen, Oja)

Reading material: books

1) Primary sources:
- T. Hastie, R. Tibshirani, J. H. Friedman ``The Elements of Statistical Learning'',
  Springer Series in Statistics, 2001
- W.N. Venables and B.D. Ripley, ``Modern applied statistics with S'', 4th ed.,
  Springer Series in Statistics, 2002
- D. Hand, H Mannila and P. Smyth, ``Principles of Data Mining'', MIT Press, 2001

2) Useful for reference, examples, problems:
- G. A. F. Seber. Multivariate Observations, John Wiley, 1984
- J. D. Jobson. Applied Multivariate Data Analysis, vol. II: Categorical and 
  Multivariate Methods, Springer Verlag, 1992

nicholls@stats.ox.ac.uk