Prelims Data Analysis TT 2017

Course lecturer : Professor Jonathan Marchini

The aim of the course is to introduce students to the theory and practice of unsupervised learning.

Unsupervised learning can be described as finding structure in datasets, and has applications in many areas such as finance, retail, medical imaging, sports performance analysis, genetics, medicine, studies of the environment and social networks.

Unsupervised learning methods are important parts of Computational Statistics, Machine Learning, Artificial Intelligence and Big Data.

Motivating example

Raw dataset : 300 x 8686 matrix of gene expression measurements from

Pollen et al (2014) Nature Biotechnology 32, 1053-1058

Viewing the raw data it is very difficult to see any clear structure or similarity between the samples.

3D Projection and clustering : The method of Principal Components Analysis (PCA) has been applied to the dataset in order to uncover structure. A clustering method (k-means) has then been applied to group observations in distinct groupings or clusters.

Students will learn the theory and practical skills to reproduce this analysis.

Course notes

Here is a link to the course notes course_notes_17-5-17.pdf

: these maybe updated slightly during the course.

The course synopsis is here


Students should take notes in each lecture, but I will use slides as visual aids to illustrate various concepts and results.


Exercise sheets

There will be 3 exercise sheets for this part of the course.

Exercise sheet

Specimen exam questions

The Paper III specimen papers now include questions on the material in these 6 lectures. These can be found here

Optional exercises in R or Matlab

Each sheet will contain a mix of written questions and Optional questions to be done either using R or Matlab.

It is up to each college tutor to decide whether students should attempt these questions, but it is strongly recommended, as these questions will help with understanding of the theory.

Modern statistics is pervasive in the era of "Big Data". The majority of Maths graduates will go on to careers that involve some use of data, so a firm practical grounding in statistical analysis is highly valuable. An aim of this course is to get students started on being able to independently carry out statistical data analysis.

As many student will not have worked with R, here is a short tutorial document that will introduce R, show students how to install R and get started with some basics.


Future Courses

This course leads onto several more advanced courses in future years that students should consider if they wish to learn more about Statistical Data Analysis, Machine Learning, Big Data and Artificial Intelligence.

Part A
Part B
Part C
Simulation and Statistical Programming
Foundations of Statistical Inference
Statistical Data Mining and Machine Learning
Advanced Simulation Methods


The following book gives a good overview of the methods covered in this course

This book is freely available online here

G. James, D. Witten, T. Hastie, R. Tibshirani An Introduction to Statistical Learning (with Applications in R) (Springer 2013)

Chapter 10 covers unsupervised learning.