`Pattern Recognition and Neural Networks'
by B.D. Ripley
Cambridge University Press, 1996, ISBN 0-521-46086-7
Valentin et al. (1994) survey recent developments in face recognition.
Arbib (1995) provides many short sketches of topics over a very wide range of neural networks, both artificial and biological.
A variant on this idea which has been suggested many times is to add `noise' to the training set, randomly perturbing either the feature vectors or the classes (or both). Further along this line, we could model the joint distribution of and create new training sets from this distribution. Bagging can be seen as the rather extreme form of this procedure in which the model is the empirical distribution. (Krogh \& Vedelsby,1995, use cross-validation rather than re-sampling, and consider designing training sets weighted towards areas where the existing classifiers are prone to disagree.)
Boosting [Schapire,1990; Freund,1990; Drucker et al.,1993; Drucker et al.,1994; Freund,1995; Freund \& Schapire,1995; Freund \& Schapire,1996a] is altogether more ambitious. The idea is to design a series of training sets and use a combination of classifiers trained on these sets. (Majority voting and linear combinations have both been used.) The training sets are chosen sequentially, with the weights for each example being modified based on the success of the classifier trained on the previous set. (The weight of the examples which were classified incorrectly is increased relative to those which were classified correctly.) For classifiers that do not accept weights on the examples, resampling with probabilities proportional to the weights can be used.
There have been a number of practical tests of boosting, including Drucker (1996);Drucker \& Cortes (1996);Freund \& Schapire (1996b);Breiman (1996c) and Quinlan (1996a). Each of Freund \& Schapire, Breiman and Quinlan compare bagging and boosting for classification trees (although only Quinlan used weighting rather than weighted re-sampling for boosting), and all find that boosting usually out-performs bagging but can fail so badly as to make the boosted classifier worse that the original.
The original motivation for boosting was to produce a combined classifier that could fit the training set perfectly by concentrating on the regions of the feature space which the current classifier found hard. As both Breiman and Quinlan point out, this is a very different aim from producing a classifier with good generalization in problems with overlapping classes, a problem which requires boosting beyond the point needed to fit the training set perfectly. It seems likely that greater study of boosting will lead to algorithms designed to boost generalization error.
Cesa-Bianchi et al. (1993);Cesa-Bianchi et al. (1996) consider how to learn how to do as well as the best of a class of classifiers over a sequence of examples, and derive bounds for the excess errors made whilst learning which is best. This is a different goal from the usual one in combining classifiers, which is to do better than even the best classifier by exploiting the strengths of all. The algorithm used is boosting-like, maintaining weights for each expert rather than each example.
In (2.50) the factor 1/8 in the exponent can be removed. Parrondo \& Van der Broeck (1993) give
and Vapnik(1995, pp. 66, 85) quotes the constant 4 as in (2.50), although proofs are deferred to the forthcoming Vapnik (1996).
The improvements of Parrondo \& Van der Broeck rely on
This follows from the binomial inequality
for each choice of sign, and so, conditionally on the first sample
Inequality (1) says that for a binomial distribution the median lies within one of the mean; it seems well-known to combinatorists, none of whom was able to give me a reference.
The bound for proposition 2.5 may be improved to by the same ideas. (The statement in Parrondo \& Van der Broeck,1993, does not follow from their proof and appears to be wrong.)
Prior to Minsky & Papert, Muroga et al. (1961) had shown that the weights in a linearly-separating perceptron with n binary inputs could be chosen to be integers less than , and Muroga (1965) showed that there were problems where integer weights of are required. Hampson \& Volper (1986) showed that in some sense an average problem needs integer weights of . More recently Håstad found an example which needs integer weights at least as large as . These results are reviewed and proved by Parberry (1994).
Other learning rules are less restricted. We saw that Mansfield's method makes at most mistakes before finding a linearly separating solution, a bound reduced to by Maass \& Turán (1994) by using a more recent method in convex optimization. (They also show that every method makes at least mistakes are necessary on some problems with p binary inputs.).
The advantage is choosing the optimal hyperplane is to reduce the VC-dimension of the space of solutions (which is proportional to a bound on ). If the two classes are linearly separable then [Vapnik1995, Theorem 5.2] the expected error rate on future examples is bounded by the expected number of support vectors divided by n-1. Thus finding a small number of support vectors might indicate good generalization properties.
Of course, linear separation in the original feature space is quite rare, but as for generalized linear discrimination (page 121) we can expand the feature space by using polynomials or even radial-basis function networks and sigmoidal functions. These can give rise to very large feature spaces, but generalization may remain acceptable if the number of support vectors remains small, which was the case in the experiments reported by Vapnik(1995, section 5.7).
By jointly minimizing the sum of the degree of error (page 116) and these ideas can be extended to non-separable two-class problems [Cortes \& Vapnik,1995].
For sigmoidal neural networks, Karpinski \& Macintyre (1995a);Karpinski \& Macintyre (1995b) showed that the VC-dimension is , and Koiran \& Sontag (1996) showed that . Bartlett \& Williamson (1996) bound the VC-dimension for a single-hidden-layer network by if the inputs are restricted to .
Friedman (1994) and Hastie \& Tibshirani (1996a);Hastie \& Tibshirani (1996b) look for a local metric. Friedman works in the spirit of Short \& Fukunaga (1980);Short \& Fukunaga (1981), using a local measure of relevance and choosing a hyper-rectangle (corresponding to an distance with a diagonal weighting matrix) by recursive partitioning on this relevance measure. The proposal of Hastie \& Tibshirani is much simpler; they use Euclidean distance on the linear discriminants for points within a small neighbourhood (defined by this metric and so chosen recursively).
This has led to discussion of algorithms to find `efficient' and `optimal' multi-way splits of continuous attributes [Fayyad \& Irani,1993; Fulton et al.,1995; Elomaa \& Rousu,1996]. The idea is to partition a continuous attribute into K ;SPMgt; 2 disjoint intervals, and have K daughter nodes, in such a way as to maximize the reduction in impurity. (Elomaa \& Rousu,1996 point out errors and inefficiencies in the earlier algorithms.)
Quinlan (1996b) disputes whether this is the actual cause of the increased performance, and proposes amendments to C4.5 which in his experiments regain its advantage over discretization. There are several small changes, but the major effect is to introduce a bias against continuous attributes, as the wide choice of threshold t gives a selection bias in favour of continuous attributes. (If there are N values of X in the training set, the information gain of is reduced by .)