Prelims Data Analysis TT 2017

Course lecturer : Professor Jonathan Marchini   marchini@stats.ox.ac.uk

The aim of the course is to introduce students to the theory and practice of unsupervised learning.

Unsupervised learning can be described as finding structure in datasets, and has applications in many areas such as finance, retail, medical imaging, sports performance analysis, genetics, medicine, studies of the environment and social networks.

Unsupervised learning methods are important parts of Computational Statistics, Machine Learning, Artificial Intelligence and Big Data.

Motivating example

Raw dataset : 300 x 8686 matrix of gene expression measurements from

Pollen et al (2014) Nature Biotechnology 32, 1053-1058

Viewing the raw data it is very difficult to see any clear structure or similarity between the samples.



3D Projection and clustering : The method of Principal Components Analysis (PCA) has been applied to the dataset in order to uncover structure. A clustering method (k-means) has then been applied to group observations in distinct groupings or clusters.

Students will learn the theory and practical skills to reproduce this analysis.


Course notes

Here is a link to the course notes course_notes_17-5-17.pdf

Note
: these maybe updated slightly during the course.

The course synopsis is here https://courses.maths.ox.ac.uk/node/23

Lectures

Students should take notes in each lecture, but I will use slides as visual aids to illustrate various concepts and results.

slides.pdf

Exercise sheets

There will be 3 exercise sheets for this part of the course.


Exercise sheet
1
sheet6.pdf
2
sheet7.pdf
3
sheet8.pdf

Specimen exam questions

The Paper III specimen papers now include questions on the material in these 6 lectures. These can be found here

https://www1.maths.ox.ac.uk/members/students/undergraduate-courses/examinations-assessments/past-papers/prelims-2013-2016

Optional exercises in R or Matlab

Each sheet will contain a mix of written questions and Optional questions to be done either using R or Matlab.

It is up to each college tutor to decide whether students should attempt these questions, but it is strongly recommended, as these questions will help with understanding of the theory.

Modern statistics is pervasive in the era of "Big Data". The majority of Maths graduates will go on to careers that involve some use of data, so a firm practical grounding in statistical analysis is highly valuable. An aim of this course is to get students started on being able to independently carry out statistical data analysis.

As many student will not have worked with R, here is a short tutorial document that will introduce R, show students how to install R and get started with some basics.

R_intro.pdf

Future Courses

This course leads onto several more advanced courses in future years that students should consider if they wish to learn more about Statistical Data Analysis, Machine Learning, Big Data and Artificial Intelligence.

Part A
Part B
Part C
Probability
Statistics
Simulation and Statistical Programming
Foundations of Statistical Inference
Statistical Data Mining and Machine Learning
Advanced Simulation Methods

Book

The following book gives a good overview of the methods covered in this course

This book is freely available online here http://www-bcf.usc.edu/~gareth/ISL/

G. James, D. Witten, T. Hastie, R. Tibshirani An Introduction to Statistical Learning (with Applications in R) (Springer 2013)

Chapter 10 covers unsupervised learning.