Multivariate Analysis, HT2007

A 3-hour module for the M.Sc. in Applied Statistics in Hilary Term weeks 1 to 3. It follows the sections of Venables & Ripley (2002) specified under `relevant books'.

Note that prior to 2005-6 there was an 8-hour course in Multivariate Analysis,, but much of the material has been moved to Statistical Data Mining or Further Statistical Methods. The purpose of this module is to give an overview and to relate concepts found in those two courses.

Synopsis

What `multivariate analysis' is (and is not).

Graphical methods. Brush and Spin, Projection pursuit.

Principal component and factor analysis.
[Factor analysis is covered in Further Statistical Methods, and PCA in the optional Statistical Data Mining. This lecture will go over PCA from several viewpoints and explain why it is frequently confused with factor analysis.]

Discrete methods, including correspondence analysis.

Lecture material

`Finding Needles in Haystacks: Tools for Finding Structure in Large Datasets'
slides for visualization.

`Principal Component Analysis and Factor Analysis'

Background material

All PDF documents.

`SVD, PCA and Metric Scaling' A very mathematical account of the underlying theory.

`Discrete Multivariate Analysis' Correspondence analysis.

Datasets

`Visualization --- Crop Viruses' Data on 61 viruses, data frame virus.

`University LeagueTables' Datasets Times, ft and tfl.

All these datasets are contained in the file mult.RData. This is an R save file, and you can use load on it, or drag-and-drop it onto an R console window.

Practical

There is an assessed practical on Friday of week 3, and details will be made available at 1pm that day. The practical will consist of a single problem based on this dataset. There will be a pre-practical demonstration of how to use the software on the class of problems, in the lecture room from 12.15 to 12.45 that day. Scripts LTables.R and viruses.R.

Software for use on your own machine

We will be making use of GGobi. The status of that website varies from day to day, but when it is accessible there is a lot of information on it, including a draft book (by Cook & Swayne, see below), which is lot more usable than the manual.

You can download GGobi from here. Note that you also need GTK for Windows, and can do the download from inside R. The version of GTK on their link is much more than you need: this one suffices, and here is a local copy of the GGobi for Windows installer.

Some of the demos are worth viewing if you have QuickTime installed (the lab machines currently do not). In particular those for tours and brushing part 2.

Here are some notes on how to manipulate the GGvis plugin inside GGobi.

GGobi can be driven from R via package rggobi, which you can install like any other R package. You will also want package DescribeDisplay to print out plots. If you install these from the menus you will get all the dependencies: if doing this manually you need the packages

   DescribeDisplay ggplot RColorBrewer reshape rggobi RGTK2
Note that RGTK2 has lots of small HTML help files and so takes a long time to install.

If you want to use (two variants of) Chernoff faces from R you need package TeachingDemos and its dependency tkrplot.

3D rotations including of surfaces are covered by package rgl.

Warning

This software is not as stable as R or the packages you have been using hitherto. Be careful to save your work (image, editor scripts, history) frequently. It seems particularly vulnerable times are when you shut GGobi windows or close GGobi itself.

Examples

Some examples of driving GGobi from the R package rggobi.

Relevant books

Bartholomew, D. J., Steele, F., Moustaki, I. and Galbraith, J. I. (2002) The Analysis and Interpretation of Multivariate Data for Social Scientists. Chapman & Hall / CRC.

Cook, D. and Swayne, D. F. (2007?) Interactive and Dynamic Graphics for Data Analysis: With Examples Using R and GGobi. Online draft of nearly complete book.

Gower, J. C. and Hand, D. J. (1996) Biplots. Chapman & Hall.

Krzanowski, W. J. (1988) Principles of Multivariate Analysis. OUP.

Ripley, B. D. (1996) Pattern Recognition and Neural Networks. CUP. (Sections 9.1 and 9.2.)

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Springer. (Sections 11.1, 11.3 and 11.4.)

Visualization

Cleveland, W. (1993) Visualizing Data. Hobart Press.

Wilkinson, L. (1999, 2005) The Grammar of Graphics. Springer.

Unwin, A., Theus, M. and Hoffmann, H. (2006) Graphics of Large Datasets. Visualizing a Million. Springer.


Last edited on 18 January 2007 by Prof Brian Ripley (ripley@stats.ox.ac.uk)