Complements to
`Pattern Recognition and Neural Networks'
by B.D. Ripley
Cambridge University Press, 1996, ISBN 0-521-46086-7
Valentin et al. (1994) survey recent developments in face recognition.
Arbib (1995) provides many short sketches of topics over a very wide range of neural networks, both artificial and biological.
A variant on this idea which has been suggested many times is to add
`noise' to the training set, randomly perturbing either the feature
vectors or the classes (or both). Further along this
line, we could model the joint distribution of
and create new training sets from this distribution. Bagging can be
seen as the rather extreme form of this procedure in which the model
is the empirical distribution. (Krogh & Vedelsby,1995, use
cross-validation rather than re-sampling, and consider designing
training sets weighted towards areas where the existing classifiers
are prone to disagree.)
Boosting [Schapire,1990; Freund,1990; Drucker et al.,1993; Drucker et al.,1994; Freund,1995; Freund & Schapire,1995; Freund & Schapire,1996a] is altogether more ambitious. The idea is to design a series of training sets and use a combination of classifiers trained on these sets. (Majority voting and linear combinations have both been used.) The training sets are chosen sequentially, with the weights for each example being modified based on the success of the classifier trained on the previous set. (The weight of the examples which were classified incorrectly is increased relative to those which were classified correctly.) For classifiers that do not accept weights on the examples, resampling with probabilities proportional to the weights can be used.
There have been a number of practical tests of boosting, including Drucker (1996);Drucker & Cortes (1996);Freund & Schapire (1996b);Breiman (1996c) and Quinlan (1996a). Each of Freund & Schapire, Breiman and Quinlan compare bagging and boosting for classification trees (although only Quinlan used weighting rather than weighted re-sampling for boosting), and all find that boosting usually out-performs bagging but can fail so badly as to make the boosted classifier worse that the original.
The original motivation for boosting was to produce a combined classifier that could fit the training set perfectly by concentrating on the regions of the feature space which the current classifier found hard. As both Breiman and Quinlan point out, this is a very different aim from producing a classifier with good generalization in problems with overlapping classes, a problem which requires boosting beyond the point needed to fit the training set perfectly. It seems likely that greater study of boosting will lead to algorithms designed to boost generalization error.
Cesa-Bianchi et al. (1993);Cesa-Bianchi et al. (1996) consider how to learn how to do as well as the best of a class of classifiers over a sequence of examples, and derive bounds for the excess errors made whilst learning which is best. This is a different goal from the usual one in combining classifiers, which is to do better than even the best classifier by exploiting the strengths of all. The algorithm used is boosting-like, maintaining weights for each expert rather than each example.
In (2.50) the factor 1/8 in the exponent can be removed. Parrondo & Van der Broeck (1993) give
and Vapnik(1995, pp. 66, 85) quotes the constant 4 as in (2.50), although proofs are deferred to the forthcoming Vapnik (1996).
The improvements of Parrondo & Van der Broeck rely on
This follows from the binomial inequality
for each choice of sign, and so, conditionally on the first sample
Inequality (1) says that for a binomial distribution the median lies within one of the mean; it seems well-known to combinatorists, none of whom was able to give me a reference.
The bound for proposition 2.5 may be improved to
by the same ideas. (The statement in
Parrondo & Van der Broeck,1993, does not follow from their
proof and appears to be wrong.)
Prior to Minsky & Papert, Muroga et al. (1961) had shown that the
weights in a linearly-separating perceptron with n binary inputs
could be chosen to be integers less than , and Muroga (1965) showed that there were problems where
integer weights of
are required. Hampson & Volper (1986) showed
that in some sense an average problem needs integer weights of
. More recently Håstad found an
example which needs integer weights at least as large as
. These results are reviewed and proved by
Parberry (1994).
Other learning rules are less restricted. We saw that Mansfield's
method makes at most mistakes before finding a linearly
separating solution, a bound reduced to
by
Maass & Turán (1994) by using a more recent method in convex optimization.
(They also show that every method makes at least
mistakes
are necessary on some problems with p binary inputs.).
The advantage is choosing the optimal hyperplane is to reduce the
VC-dimension of the space of solutions (which is proportional to
a bound on ). If the two classes are linearly
separable then [Vapnik1995, Theorem 5.2] the expected error
rate on future examples is bounded by the expected number of
support vectors divided by n-1. Thus finding a small number of support
vectors might indicate good generalization properties.
Of course, linear separation in the original feature space is quite rare, but as for generalized linear discrimination (page 121) we can expand the feature space by using polynomials or even radial-basis function networks and sigmoidal functions. These can give rise to very large feature spaces, but generalization may remain acceptable if the number of support vectors remains small, which was the case in the experiments reported by Vapnik(1995, section 5.7).
By jointly minimizing the sum of the degree of error (page 116) and
these ideas can be extended to non-separable two-class
problems [Cortes & Vapnik,1995].
For sigmoidal neural networks,
Karpinski & Macintyre (1995a);Karpinski & Macintyre (1995b) showed that the
VC-dimension is
, and Koiran & Sontag (1996) showed
that
. Bartlett & Williamson (1996) bound the
VC-dimension for a single-hidden-layer network by
if the inputs are restricted to
.
Friedman (1994) and
Hastie & Tibshirani (1996a);Hastie & Tibshirani (1996b) look for a local
metric. Friedman works in the spirit of
Short & Fukunaga (1980);Short & Fukunaga (1981), using a local measure of
relevance and choosing a hyper-rectangle (corresponding to an
distance with a diagonal weighting matrix) by recursive
partitioning on this relevance measure. The proposal of
Hastie & Tibshirani is much simpler; they use Euclidean
distance on the linear discriminants for points within a small
neighbourhood (defined by this metric and so chosen recursively).
This has led to discussion of algorithms to find `efficient' and `optimal' multi-way splits of continuous attributes [Fayyad & Irani,1993; Fulton et al.,1995; Elomaa & Rousu,1996]. The idea is to partition a continuous attribute into K ;SPMgt; 2 disjoint intervals, and have K daughter nodes, in such a way as to maximize the reduction in impurity. (Elomaa & Rousu,1996 point out errors and inefficiencies in the earlier algorithms.)
Quinlan (1996b) disputes whether this is the actual cause of the
increased performance, and proposes amendments to C4.5 which in
his experiments regain its advantage over discretization. There are
several small changes, but the major effect is to introduce a bias
against continuous attributes, as the wide choice of threshold t gives
a selection bias in favour of continuous attributes. (If there are
N values of X in the training set, the information gain of
is reduced by
.)