Pattern recognition has a long and respectable history within engineering, especially for military applications, but the cost of the hardware both to acquire the data (signals and images) and to compute the answers made it for many years a rather specialist subject. Hardware advances have made the concerns of pattern recognition of much wider applicability. In essence it covers the following problem:

`Given some examples of complex signals and the correct decisions for them, make decisions automatically for a stream of future examples.'

There are many examples from everyday life:

- Name the species of a flowering plant.
- Grade bacon rashers from a visual image.
- Classify an X-ray image of a tumour as cancerous or benign.
- Decide to buy or sell a stock option.
- Give or refuse credit to a shopper.

*Neural networks* have arisen from analogies with models of the
way that humans might approach pattern recognition tasks, although
they have developed a long way from the biological roots. Great
claims have been made for these procedures, and although few of
these claims have withstood careful scrutiny, neural network methods
have had great impact on pattern recognition practice. A
theoretical understanding of how they work is still under
construction, and is attempted here by viewing neural networks
within a statistical framework, together with methods developed
in the field of *machine learning*.

One of the aims of this book is to be a reference resource, so almost all the results used are proved (and the remainder are given references to complete proofs). The proofs are often original, short and I believe show insight into why the methods work. Another unusual feature of this book is that the methods are illustrated on examples, and those examples are either real ones or realistic abstractions. Unlike the proofs, the examples are not optional!

The formal pre-requisites to follow this book are rather few,
especially if no attempt is made to follow the proofs. A background in
linear algebra is needed, including eigendecompositions. (The singular
value decomposition is used, but explained.) A knowledge of calculus
and its use in finding extrema (such as local minima) is needed, as
well as the simplest notions of asymptotics (Taylor series expansions
and *O(n)* notation). Graph theory is used in Chapter 8, but
developed from scratch. Only a first course in probability and
statistics is assumed, *but* considerable experience in
manipulations will be needed to follow the derivations without writing
out the intermediate steps. The glossary should help readers with
non-technical backgrounds.

A graduate-course knowledge of statistical concepts will be needed to appreciate fully the theoretical developments and proofs. The sections on examples need a much less mathematical background; indeed a good overview of the state of the subject can be obtained by skimming the theoretical sections and concentrating on the examples. The theory and the insights it gives are important in understanding the relative merits of the methods, and it is often very much harder to show that an idea is unsound than to explain the idea.

Several chapters have been used in graduate courses to statisticians and to engineers, computer scientists and physicists. A core of material would be Sections 2.1-2.3, 2.6, 2.7, 3.1, 3.5, 3.6, 4.1, 4.2, 5.1-5.4, 6.1-6.4, 7.1-7.3 and 9.1-9.4, supplemented by material of particular interest to the audience. For example, statisticians should cover 2.4, 2.5, 3.3, 3.4, 5.5, 5.6 and are likely to be interested in Chapter~8, and a fuller view of neural networks in pattern recognition will be gained by adding 3.2, 4.3, 5.5-5.7, 7.6 and 8.4 to the core.