Chapter 1 Introduction

The modern world is replete with sources of massively multivariate data, sometimes called ‘big data’. In many cases, the number of variables being measured (\(p\)) exceeds the number of samples available (\(n\)), and in almost all cases the number of possible ways of classifying individuals is greater than \(n\).

Example 1.1

  • There are around 25,000 human genes, which gives more possible human genomes than humans who have ever existed. Even if a gene is present, whether or not it is expressed depends upon other genes and also environmental factors. Good genetic data sets might have a few hundred thousand individuals in, the best ones perhaps a million.
    How do we study what effect these genes have on diseases, or on each other’s expression?

  • A doctor has to diagnose one (or more) of hundreds of different possible diseases in a patient with a handful out of thousands of possible symptoms, and with a few pieces of information about his medical history. She can perhaps order some tests to provide evidence in favour of one condition or another. How should she decide whether the evidence is behind a particular condition?

  • Photographs are typically made up of millions of pixels, each of which can take one of \(256^3 \approx 17\) million colours. How do we train a computer to recognize the object in an image?

The nature of these data sets leads to two related challenges: the first is statistical, and the second computational. Both are features of the so-called curse of dimensionality. The statistical problems are easy to see: suppose I ask 1,000 people 10 questions each with two answers. This gives \(2^{10} = 1024\) possible response patterns, so that it is impossible to observe all the response patterns, and in practice we won’t observe most of them even once. How can we sensibly estimate the probability of those missing response patterns in future?

The computational problem is related. Suppose now that I know the distribution of outcomes, so I have \(P(X_V = x_V)\) for every \(x_V \in {\cal X}_V\). How can I compute the marginal probability of a particular variable? Well: \[ P(X_i = x_i) = \sum_{x_{V \setminus \{i\}}} P(X_V = x_V). \] But notice that, if \(p=|V|\) is large, say 1,000 variables, then this sum could easily involve \(2^{1000} \approx 10^{301}\) terms! Even for a very fast computer this is completely infeasible, and of course we would not be able to store all the probabilities in the first place.

Each of these examples—although theoretically massive—has a lot of underlying structure that makes the problem potentially tractable. Particular medical symptoms are closely tied to particular diseases, with probabilities that we understand. Adjacent pixels in photographs are often almost the same; if every pixel were completely different we would never discern an image.

Graphical models provide a convenient way of modelling this structure, and make it computationally feasible to perform calculations with the networks.