Complements to

*`Pattern Recognition and Neural Networks'*

by B.D. Ripley

Cambridge University Press, 1996, ISBN 0-521-46086-7

These complements provide further details, and references which appeared (or came to my attention) after the book was completed in June 1995. Minor corrections can be found in the Errata list.

**Page 4**- The book by Przytula \& Prasanna (1993) discusses in detail
the parallel implementation of neural networks.
**Page 16**- Langley (1996) provides a book-length
introduction to one viewpoint on machine learning.
Langley \& Simon (1995) and Bratko \& Muggleton (1995) discuss
applications of machine learning with claimed real-world benefits.
Valentin

*et al.*(1994) survey recent developments in face recognition.Arbib (1995) provides many short sketches of topics over a very wide range of neural networks, both artificial and biological.

**Page 65-66**-
There has been a lot of interest in combining classifiers produced by the
*same*method on*different*training sets.*Bagging*[Breiman,1994; Breiman,1996a] is an abbreviation for `bootstrap aggregating'; the proposal is to take an unweighted average of the predictions of, say 100, classifiers trained on training sets formed by resampling with replacement from the original training set. Breiman (1996b) motivates this for*unstable*methods such as classification trees in which a small change in the training set can lead to a large change in the classifier. (`Plug-in' neural network fitting with multiple local minima may also be unstable.)A variant on this idea which has been suggested many times is to add `noise' to the training set, randomly perturbing either the feature vectors or the classes (or both). Further along this line, we could model the joint distribution of and create new training sets from this distribution. Bagging can be seen as the rather extreme form of this procedure in which the model is the empirical distribution. (Krogh \& Vedelsby,1995, use cross-validation rather than re-sampling, and consider designing training sets weighted towards areas where the existing classifiers are prone to disagree.)

*Boosting*[Schapire,1990; Freund,1990; Drucker*et al.*,1993; Drucker*et al.*,1994; Freund,1995; Freund \& Schapire,1995; Freund \& Schapire,1996a] is altogether more ambitious. The idea is to*design*a series of training sets and use a combination of classifiers trained on these sets. (Majority voting and linear combinations have both been used.) The training sets are chosen sequentially, with the weights for each example being modified based on the success of the classifier trained on the previous set. (The weight of the examples which were classified incorrectly is increased relative to those which were classified correctly.) For classifiers that do not accept weights on the examples, resampling with probabilities proportional to the weights can be used.There have been a number of practical tests of boosting, including Drucker (1996);Drucker \& Cortes (1996);Freund \& Schapire (1996b);Breiman (1996c) and Quinlan (1996a). Each of Freund \& Schapire, Breiman and Quinlan compare bagging and boosting for classification trees (although only Quinlan used weighting rather than weighted re-sampling for boosting), and all find that boosting usually out-performs bagging but can fail so badly as to make the boosted classifier worse that the original.

The original motivation for boosting was to produce a combined classifier that could fit the training set perfectly by concentrating on the regions of the feature space which the current classifier found hard. As both Breiman and Quinlan point out, this is a very different aim from producing a classifier with good generalization in problems with overlapping classes, a problem which requires boosting beyond the point needed to fit the training set perfectly. It seems likely that greater study of boosting will lead to algorithms designed to boost generalization error.

Cesa-Bianchi

*et al.*(1993);Cesa-Bianchi*et al.*(1996) consider how to learn how to do as well as the best of a class of classifiers over a sequence of examples, and derive bounds for the excess errors made whilst learning which is best. This is a different goal from the usual one in combining classifiers, which is to do better than even the best classifier by exploiting the strengths of all. The algorithm used is boosting-like, maintaining weights for each expert rather than each example. **Page 80**-
Chernoff's bound follows from the results of that paper, but in this
precise form is due to Angluin \& Valiant (1979).
**Page 82**-
The constants in these results can be refined slightly by
using probability inequalities which are specific to binomial
distributions. (One difficulty is that the references, like
Anthony \& Shawe-Taylor,1993, omit the proofs of these
inequalities, which I find far subtler than the ideas which
*are*proved. The improvements were omitted because of the lack of accessible proofs to which to refer the reader.)In (2.50) the factor 1/8 in the exponent can be removed. Parrondo \& Van der Broeck (1993) give

and Vapnik(1995, pp. 66, 85) quotes the constant 4 as in (2.50), although proofs are deferred to the forthcoming Vapnik (1996).

The improvements of Parrondo \& Van der Broeck rely on

- [(i)]
- Replacing (2.55) by
This follows from the binomial inequality

for each choice of sign, and so, conditionally on the first sample

Inequality (1) says that for a binomial distribution the median lies within one of the mean; it seems well-known to combinatorists, none of whom was able to give me a reference.

- replacing the use of Hoeffding's inequality at the top of page 87
by an inequality of Vapnik (1982):

The bound for proposition 2.5 may be improved to by the same ideas. (The statement in Parrondo \& Van der Broeck,1993, does not follow from their proof and appears to be wrong.)

- Replacing (2.55) by

**Page 116**- Analytical evidence that optimizing the number of errors
made by a perceptron is hard is provided by Höffgen
*et al.*(1995) who showed that the problem of determining if there is a solution with at most misclassifications is NP-hard. (For*k*= 0 it is reducible to a linear programming program as shown on this page, so solvable in polynomial time.) **Page 118**- Minsky & Papert showed that the weights needed in a
perceptron could be large for some realistic problems. As for
binary inputs the perceptron learning algorithm only changes
one coordinate of by unity at each step, large weights
entail slow convergence.
Prior to Minsky & Papert, Muroga

*et al.*(1961) had shown that the weights in a linearly-separating perceptron with*n*binary inputs could be chosen to be integers less than , and Muroga (1965) showed that there were problems where integer weights of are required. Hampson \& Volper (1986) showed that in some sense an average problem needs integer weights of . More recently Håstad found an example which needs integer weights at least as large as . These results are reviewed and proved by Parberry (1994).Other learning rules are less restricted. We saw that Mansfield's method makes at most mistakes before finding a linearly separating solution, a bound reduced to by Maass \& Turán (1994) by using a more recent method in convex optimization. (They also show that every method makes at least mistakes are necessary on some problems with

*p*binary inputs.). **Page 120**-
**Support-vector machines**[Cortes \& Vapnik,1995; Vapnik,1995]. If two classes are linearly separable, there will be a continuum of weight vectors which give rise to separating hyperplanes. Amongst these we can choose a hyperplane with maximal distance to the nearest example, achieved by minimizing whilst insisting that . Finding this hyperplane is a quadratic programming problem, and the usual Kuhn-Tucker optimality conditions show that there will be a subset of examples (known as*support vectors*) for which and that the optimal is a linear combination of these .The advantage is choosing the optimal hyperplane is to reduce the VC-dimension of the space of solutions (which is proportional to a bound on ). If the two classes

*are*linearly separable then [Vapnik1995, Theorem 5.2] the expected error rate on future examples is bounded by the expected number of support vectors divided by*n*-1. Thus finding a small number of support vectors might indicate good generalization properties.Of course, linear separation in the original feature space is quite rare, but as for

*generalized linear discrimination*(page 121) we can expand the feature space by using polynomials or even radial-basis function networks and sigmoidal functions. These can give rise to very large feature spaces, but generalization may remain acceptable if the number of support vectors remains small, which was the case in the experiments reported by Vapnik(1995, section 5.7).By jointly minimizing the sum of the degree of error (page 116) and these ideas can be extended to non-separable two-class problems [Cortes \& Vapnik,1995].

**Page 157**- According to Werbos (1995), the weight-decay penalty
was also proposed by Werbos (1987).
**Page 165**- We saw that a local minimum which fits well is not important
in the Bayesian integration if it corresponds to a sharp peak of the
posterior density of the weights. Hochreiter \& Schmidhuber (1995);Hochreiter \& Schmidhuber (1996) add a penalty to the optimization to
encourage exploring broad peaks. However, they still use local
optimization, and their penalty is better regarded as an elaborate form
of weight decay.
**Page 172**- Marchand
*et al.*(1990) had another early construction algorithm. **Page 179**- This VC-dimension result for threshold-unit neural networks
was anticipated by Cover (1968).
**Page 180**- The description of the and
results fails to make clear what is allowed to vary; these results apply
to networks where both the number of input units and the number of
hidden units are allowed to increase. In that case Sakurai (1993) has
results for networks with one hidden layer and
real inputs (whereas Maass considered binary inputs).
For sigmoidal neural networks, Karpinski \& Macintyre (1995a);Karpinski \& Macintyre (1995b) showed that the VC-dimension is , and Koiran \& Sontag (1996) showed that . Bartlett \& Williamson (1996) bound the VC-dimension for a single-hidden-layer network by if the inputs are restricted to .

**Page 197-198**- Lowe (1995) proposes a distance-weighted
nearest-neighbour method with a Gaussian kernel ( )
where is proportional to the average distance to the, say,
first 5 neighbours, and is weighted by weights for each
feature. (Thus a quadratic metric with diagonal
*A*is used.) Both the constant of proportionality and the metric are chosen by (leave-one-out) cross-validation. Like that of Fukunaga \& Flick (1984) this is a global method, but Lowe mentions that the feature weights could be chosen locally, as was done by Atkeson (1991) in a control problem. The restriction to a diagonal metric places the pre-processing of the features at a premium.Friedman (1994) and Hastie \& Tibshirani (1996a);Hastie \& Tibshirani (1996b) look for a local metric. Friedman works in the spirit of Short \& Fukunaga (1980);Short \& Fukunaga (1981), using a local measure of relevance and choosing a hyper-rectangle (corresponding to an distance with a diagonal weighting matrix) by recursive partitioning on this relevance measure. The proposal of Hastie \& Tibshirani is much simpler; they use Euclidean distance on the linear discriminants for points within a small neighbourhood (defined by this metric and so chosen recursively).

**Page 200-201**- Further references on `memory-based reasoning' are
Waltz (1990);Aha
*et al.*(1991);Cost \& Salzberg (1993) and Rachlin*et al.*(1994).

**Page 236-7**- Debate has continued in the machine-learning community on how
to choose amongst attributes at a split. Buntine \& Niblett (1992)
reviewed approaches at that time. More recently, some authors have
spotted [Catlett,1991; Auer
*et al.*,1995; Dougherty*et al.*,1995] that discretizing a continuous attribute can improve the performance of a classification tree induced by Quinlan's*C4.5*and rarely makes the performance worse. This might arise since for a continuous attribute*X*the only splits allowed are of the form , whereas for a nominal attribute*any*split of the values into two groups is allowed, and this can result in splitting a discretized continuous attribute into a series of intervals and their complement. Indeed, Holte (1993) and Auer*et al.*(1995) have shown that very shallow trees (one level or two levels respectively) can be rather effective on many common problems if splits of this sort are allowed.This has led to discussion of algorithms to find `efficient' and `optimal' multi-way splits of continuous attributes [Fayyad \& Irani,1993; Fulton

*et al.*,1995; Elomaa \& Rousu,1996]. The idea is to partition a continuous attribute into*K*;*SPMgt*; 2 disjoint intervals, and have*K*daughter nodes, in such a way as to maximize the reduction in impurity. (Elomaa \& Rousu,1996 point out errors and inefficiencies in the earlier algorithms.)Quinlan (1996b) disputes whether this is the actual cause of the increased performance, and proposes amendments to

*C4.5*which in his experiments regain its advantage over discretization. There are several small changes, but the major effect is to introduce a bias against continuous attributes, as the wide choice of threshold*t*gives a selection bias in favour of continuous attributes. (If there are*N*values of*X*in the training set, the information gain of is reduced by .) **Page 240**- There has been renewed interest in taking linear
combinations of variables at nodes, sometimes known as
*perceptron*,*neural*,*multivariate*or*oblique*trees: Utgoff (1989);Heath*et al.*(1993);Murthy*et al.*(1993);Murthy*et al.*(1994);Brodley \& Utgoff (1995). **Page 241**- Craven \& Shavlik (1996) consider using trees to `explain'
the classifier modelled by a neural network, that is approximates
a neural network by a classification tree, trained by querying the
neural network as an oracle. (This is related to Angluin's ideas on page 7.)
**Page 242**- There has been further work on averaging multiple decision trees.
One idea is to replace the pruning of a tree by a weighted average
over all prunings. Although there are very many such prunings, averaging
a prediction over prunings amounts to a weighted average over potential
terminal nodes (usually all nodes) and so is feasible; see
Willems
*et al.*(1993);Willems*et al.*(1995);Oliver \& Hand (1995);Helmbold \& Schapire (1995);Helmbold \& Schapire (1996). Other combination ideas are given by Ali \& Pazzani (1995);Ho (1995) and Shlien (1990).

**Page 245**- Applications of Bayesian belief networks are considered in
in the March 1995 issue of
*Communications of the ACM*by [Heckerman \& Wellman,1995; Burnell \& Horvitz,1995; Heckerman*et al.*,1995; Fung \& Del Favarro,1995]. **Page 245, 258-262**- Almond (1995) is a thesis-length treatment
of graphical computations for Dempster-Shafer belief functions,
which have probability calculations as a special case. There are
many possible join trees, and Almond considers other constructions
which may produce trees better-suited to the computational
complexity of belief-function computations. Also, as belief
functions do not have a division operator, the calculations have to
be organised to exclude the denominators from the product in the
numerator.
[Despite its claimed date, Almond's book was published several months
into 1996, with 1993 papers cited as `to appear' in its references.]
**Page 258-262**- Shafer (1996) will give an in-depth account of
join trees and their rôle in probabilistic expert systems.
[It is due in August 1996.]

**Page 304**- The experiments of Wang \& Oja (1993) comparing a
five-layer auto-associator with principal component analysis
found that the latter gave better generalization.

**Page 354**- Benveniste
*et al.*(1990) and Ljung*et al.*(1992) are books on stochastic approximation.

**Aha***et al.*(1991)-
Aha, D. W., Kibler, D. & Albert, M. K. (1991) Instance-based learning
algorithms.
*Machine Learning***6**(1), 37-66. **Ali & Pazzani (1995)**-
Ali, K. M. & Pazzani, M. J. (1995) On the link between error correlation and
error reduction in decision tree ensembles.
Technical Report 95-38, Department of Information and Computer
Science, University of California at Irvine.
**Almond (1995)**-
Almond, R. G. (1995)
*Graphical Belief Modeling*. London: Chapman & Hall. ISBN 0-412-06661-0. [Despite the date, this book was actually published in May 1996]. **Angluin & Valiant (1979)**-
Angluin, D. & Valiant, L. G. (1979) Fast probabilistic algorithms for
Hamiltonian circuits and matchings.
*Journal of Computer and System Sciences***18**, 155-193. **Anthony & Shawe-Taylor (1993)**-
Anthony, M. & Shawe-Taylor, J. (1993) A result of Vapnik with applications.
*Discrete Applied Mathematics***47**, 207-217. [Erratum (1994)**52**, 211 (the proof of theorem 2.1 is corrected)]. **Arbib (1995)**-
Arbib, M. A. (ed.) (1995)
*The Handbook of Brain Theory and Neural Networks*. Cambridge, MA: MIT Press. ISBN 0-262-01148-4. **Atkeson (1991)**-
Atkeson, C. G. (1991) Using locally weighted regression for robot learning.
In
*Proceedings of the IEEE Conference on Robotics and Automation (Sacremento, CA, 1991)*, pp. 958-963. IEEE Press. **Auer***et al.*(1995)-
Auer, P., Holte, R. C. & Maass, W. (1995) Theory and application of agnostic
PAC-learning with small decision trees.
In
*Proceedings of the Twelfth International Conference on Machine Learning*, pp. 21-29. San Francisco: Morgan Kaufmann. Also NeuroCOLT Technical Report NC-TR-96-034 (Feb 1996). **Bartlett & Williamson (1996)**-
Bartlett, P. L. & Williamson, R. C. (1996) The VC dimension and
pseudodimension of two-layer neural networks with discrete inputs.
*Neural Computation***8**(3), 625-628. **Benveniste***et al.*(1990)-
Benveniste, A., Métivier, M. & Priouret, P. (1990)
*Adaptive Algorithms and Stochastic Approximations*. Berlin: Springer. **Bratko & Muggleton (1995)**-
Bratko, I. & Muggleton, S. (1995) Applications of inductive logic programming.
*Communications of the Association for Computing Machinery***38**(11), 65-70. **Breiman (1994)**-
Breiman, L. (1994) Bagging predictors.
Technical Report 421, Department of Statistics, University of
California at Berkeley.
**Breiman (1996a)**-
Breiman, L. (1996a) Bagging predictors.
*Machine Learning*. **Breiman (1996b)**-
Breiman, L. (1996b) The heuristics of instability in model selection.
*Annals of Statistics*. **Breiman (1996c)**-
Breiman, L. (1996c) Bias, variance and arcing classifiers.
Technical report, Department of Statistics, University of California
at Berkeley.
**Brodley & Utgoff (1995)**-
Brodley, C. E. & Utgoff, P. E. (1995) Multivariate decision trees.
*Machine Learning***19**, 45-77. **Buntine & Niblett (1992)**-
Buntine, W. & Niblett, T. (1992) A further comparison of splitting rules for
decision-tree induction.
*Machine Learning***8**, 75-86. **Burnell & Horvitz (1995)**-
Burnell, L. & Horvitz, E. (1995) Structure and chance: Melding logic and
probability for software debugging.
*Communications of the ACM***38**(3), 31-41, 57. **Catlett (1991)**-
Catlett, J. (1991) On changing continuous attributes into ordered discrete
attributes.
In
*Proceedings of the European Working Session on Learning - EWSL-91*, ed. Y. Kodratoff, pp. 164-178. Berlin: Springer. **Cesa-Bianchi***et al.*(1993)-
Cesa-Bianchi, M., Freund, Y., Helmbold, D. P., Haussler, D., Schapire, R. E. &
Warmuth, M. K. (1993) How to use expert advice.
In
*Proceedings of the Twenth-Fifth ACM Symposium on the Theory of Computing (San Diego, CA, 1993)*, pp. 382-391. New York: ACM Press. **Cesa-Bianchi***et al.*(1996)-
Cesa-Bianchi, M., Freund, Y., Helmbold, D. P., Haussler, D., Schapire, R. E. &
Warmuth, M. K. (1996) How to use expert advice.
*Journal of the ACM*. **Cortes & Vapnik (1995)**-
Cortes, C. & Vapnik, V. (1995) Support-vector networks.
*Machine Learning***20**, 273-297. **Cost & Salzberg (1993)**-
Cost, S. & Salzberg, S. (1993) A weighted nearest neighbor algorithm for
learning with symbolic features.
*Machine Learning***10**, 57-78. **Cover (1968)**-
Cover, T. M. (1968) Capacity problems for linear machines.
In
*Pattern Recognition*, ed. L. Kanal, pp. 283-289. Thompson. **Craven & Shavlik (1996)**-
Craven, M. W. & Shavlik, J. W. (1996) Extracting tree-structured
representations of trained networks.
In Touretzky
*et al.*(1996), pp. 24-30. **Dasarathy (1991)**-
Dasarathy, B. V. (ed.) (1991)
*Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques*. Los Alamitos, CA: IEEE Computer Society Press. **Dougherty***et al.*(1995)-
Dougherty, J., Kohavi, R. & Sahami, M. (1995) Supervised and unsupervised
discretization of continuous features.
In
*Proceedings of the Twelfth International Conference on Machine Learning*, eds A. Prieditis & S. Russell, pp. 194-202. San Francisco: Morgan Kaufmann. **Drucker (1996)**-
Drucker, H. (1996) Fast decision tree ensembles for optical character
recognition.
In
*Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrival*. **Drucker & Cortes (1996)**-
Drucker, H. & Cortes, C. (1996) Boosting decision trees.
In Touretzky
*et al.*(1996), pp. 479-486. **Drucker***et al.*(1993)-
Drucker, H., Schapire, R. & Simard, P. (1993) Boosting performance in neural
networks.
*International Journal of Pattern Recognition and Artificial Intelligence***7**(4), 705-719. **Drucker***et al.*(1994)-
Drucker, H., Cortes, C., Jaeckel, L. D., LeCun, Y. & Vapnik, V. (1994)
Boosting and other ensemble methods.
*Neural Computation***6**(6), 1289-1301. **Elomaa & Rousu (1996)**-
Elomaa, T. & Rousu, J. (1996) Finding optimal multi-splits for numerical
attributes in decision tree learning.
NeuroCOLT Technical Report Series NC-TR-96-041, Department of
Computer Science, University of Helsinki.
**Fayyad & Irani (1993)**-
Fayyad, U. M. & Irani, K. B. (1993) Multi-interval discretization of
continuous-valued attributes in decision tree generation.
In
*Proceedings of the the Thirteenth International Joint Conference on Artificial Intelligence (Chambery, France, 1993)*, pp. 1022-1027. San Francisco: Morgan Kaufmann. **Freund (1990)**-
Freund, Y. (1990) Boosting a weak learning algorithm by majority.
In
*Proceedings of the Third Workshop on Computational Learning Theory*, pp. 202-216. Morgan Kaufmann. **Freund (1995)**-
Freund, Y. (1995) Boosting a weak learning algorithm by majority.
*Information and Computation***121**(2), 256-285. **Freund & Schapire (1995)**-
Freund, Y. & Schapire, R. E. (1995) A decision-theoretic generalization of
on-line learning and an application to boosting.
In
*Proceedings of the Second European Conference on Computational Learning Theory*, pp. 23-37. Springer. **Freund & Schapire (1996a)**-
Freund, Y. & Schapire, R. E. (1996a) Game theory, on-line prediction and
boosting.
In
*Proceedings of the Ninth Annual Conference on Computational Learning Theory*. **Freund & Schapire (1996b)**-
Freund, Y. & Schapire, R. E. (1996b) Experiments with a new boosting
algorithm.
In
*Proceedings of the Thirteenth International Conference on Machine Learning*. **Friedman (1994)**-
Friedman, J. H. (1994) Flexible metric nearest neighbor classification.
Technical report, Department of Statistics, Stanford University.
**Fukunaga & Flick (1984)**-
Fukunaga, K. & Flick, T. E. (1984) An optimal global nearest neighbor metric.
*IEEE Transactions on Pattern Analysis and Machine Intelligence***6**, 314-318. [Reprinted in Dasarathy (1991)]. **Fulton***et al.*(1995)-
Fulton, T., Kasif, S. & Salzberg, S. (1995) Efficient algorithms for finding
multi-way splits for decision trees.
In
*Proceedings of the Twelfth International Conference on Machine Learning*, eds A. Prieditis & S. Russell, pp. 244-251. San Francisco: Morgan Kaufmann. **Fung & Del Favarro (1995)**-
Fung, R. & Del Favarro, B. (1995) Applying Bayesian networks to
information retrival.
*Communications of the ACM***38**(3), 42-48, 57. **Hampson & Volper (1986)**-
Hampson, S. E. & Volper, D. J. (1986) Linear function neurons: structure and
training.
*Biological Cybernetics*. **Hastie & Tibshirani (1996a)**-
Hastie, T. & Tibshirani, R. (1996a) Discriminant adaptive nearest neighbor
classification and regression.
In Touretzky
*et al.*(1996), pp. 409-415. **Hastie & Tibshirani (1996b)**-
Hastie, T. & Tibshirani, R. (1996b) Discriminant adaptive nearest neighbor
classification.
*IEEE Transactions on Pattern Analysis and Machine Intelligence***18**, 607-618. **Heath***et al.*(1993)-
Heath, D., Kasif, S. & Salzberg, S. (1993) Learning oblique decision trees.
In
*Proceedings of the the Thirteenth International Joint Conference on Artificial Intelligence (Chambery, France, 1993)*, pp. 1002-1007. San Francisco: Morgan Kaufmann. **Heckerman & Wellman (1995)**-
Heckerman, D. & Wellman, M. P. (1995) Bayesian networks.
*Communications of the ACM***38**(3), 26-30. **Heckerman***et al.*(1995)-
Heckerman, D., Breese, J. S. & Rommelse, K. (1995) Decision-theoretic
troubleshooting.
*Communications of the ACM***38**(3), 49-57. **Helmbold & Schapire (1995)**-
Helmbold, D. P. & Schapire, R. E. (1995) Predicting nearly as well as the best
pruning of a decision tree.
In
*Proceedings of the Eight Annual Conference on Computational Learning Theory*, pp. 61-68. New York: ACM Press. **Helmbold & Schapire (1996)**-
Helmbold, D. P. & Schapire, R. E. (1996) Predicting nearly as well as the best
pruning of a decision tree.
*Machine Learning*. **Ho (1995)**-
Ho, T. K. (1995) Random decision forests.
In
*Proceedings of the Third International Conference on Document Analysis and Recognition*, pp. 278-282. IEEE Computer Society Press. **Hochreiter & Schmidhuber (1995)**-
Hochreiter, S. & Schmidhuber, J. (1995) Simplifying neural nets by discovering
flat minima.
In Tesauro
*et al.*(1995), pp. 529-536. ISBN 0-262-20104-6. **Hochreiter & Schmidhuber (1996)**-
Hochreiter, S. & Schmidhuber, J. (1996) Flat minima.
*Neural Computation*. **Holte (1993)**-
Holte, R. C. (1993) Very simple classification rules perform well on most
commonly used datasets.
*Machine Learning***11**, 63-91. **Höffgen***et al.*(1995)-
Höffgen, K.-U., Simon, H.-U. & Van Horn, K. S. (1995) Robust
trainability of single neurons.
*Journal of Computer and System Sciences***50**(1), 114-125. **Karpinski & Macintyre (1995a)**-
Karpinski, M. & Macintyre, A. (1995a) Bounding VC-dimension of neural
networks: Progress and prospects.
In
*Proceedings of the Second European Conference on Computational Learning Theory (Barcelona, Spain)*, ed. P. Vitanyi, number 904 in Lecture Notes in Artificial Intelligence, pp. 337-341. Berlin: Springer. **Karpinski & Macintyre (1995b)**-
Karpinski, M. & Macintyre, A. (1995b) Polynomial bounds for VC dimension
of sigmoidal neural networks.
In
*Proceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing (Las Vegas)*, pp. 200-208. ACM Press. **Koiran & Sontag (1996)**-
Koiran, P. & Sontag, E. (1996) Neural networks with quadratic VC dimension.
In Touretzky
*et al.*(1996), pp. 197-203. **Krogh & Vedelsby (1995)**-
Krogh, A. & Vedelsby, J. (1995) Neural network ensembles, cross validation,
and active learning.
In Tesauro
*et al.*(1995), pp. 231-238. ISBN 0-262-20104-6. **Langley (1996)**-
Langley, P. (1996)
*Elements of Machine Learning*. San Francisco: Morgan Kaufmann. **Langley & Simon (1995)**-
Langley, P. & Simon, H. A. (1995) Applications of machine learning and rule
induction.
*Communications of the ACM***38**(11), 54-64. **Ljung***et al.*(1992)-
Ljung, L., Pflug, H. & Walk, H. (1992)
*Stochastic Approximation and Optimization of Random Systems*. Berlin: Birkhäuser. **Lowe (1995)**-
Lowe, D. G. (1995) Similarity metric learning for a variable-kernel classifier.
*Neural Computation***7**(1), 72-85. **Maass & Turán (1994)**-
Maass, W. & Turán, G. (1994) How fast can a threshold gate learn?
In
*Computational Learning Theory and Natural Learning Systems: Constraints and Prospects*, eds S. J. Hanson, G. A. Drastal & R. L. Rivest, volume I, pp. 381-414. MIT Press. **Marchand***et al.*(1990)-
Marchand, M., Golea, M. & Rujan, P. (1990) A convergence theorm for sequential
learning in two-layer perceptrons.
*Europhysics Letters***11**, 487-492. **Muroga (1965)**-
Muroga, S. (1965) Lower bounds of the number of threshold functions and a
maximum weight.
*IEEE Transactions on Electronic Computers***14**, 136-148. **Muroga***et al.*(1961)-
Muroga, S., Toda, I. & Takasu, S. (1961) Theory of majority decision elements.
*Journal of the Franklin Institute***271**, 376-418. **Murthy***et al.*(1993)-
Murthy, S. K., Kasif, S., Salzberg, S. & Beigel, R. (1993) OC1: randomized
induction of oblique decision tress.
In
*Proceedings of the Eleventh National Conference on Artificial Intelligence (Washington, DC, 1993)*, pp. 322-327. AAAI Press. ISBN 0-262-51071-5. **Murthy***et al.*(1994)-
Murthy, S. K., Kasif, S. & Salzberg, S. (1994) A system for the induction of
oblique decision trees.
*Journal of Artificial Intelligence Research***2**, 1-33. **Oliver & Hand (1995)**-
Oliver, J. J. & Hand, D. J. (1995) On pruning and averaging decision trees.
In
*Machine Learning: Proceedings of the Twelfth International Conference*, pp. 430-437. Morgan Kaufmann. **Parberry (1994)**-
Parberry, I. (1994)
*Circuit Complexity and Neural Networks*. Cambridge, MA: MIT Press. ISBN 0-262-16148-6. **Parrondo & Van der Broeck (1993)**-
Parrondo, J. M. R. & Van der Broeck, C. (1993) Vapnik-Chervonenkis
bounds for generalization.
*J. Phys. A***26**, 2211-2223. **Przytula & Prasanna (1993)**-
Przytula, K. W. & Prasanna, V. K. (1993)
*Parallel Digital Implementation of Neural Networks*. Englewood Cliffs, NJ: Prentice Hall. **Quinlan (1996a)**-
Quinlan, J. R. (1996a) Bagging, boosting, and C4.5.
In
*Proceedings of the Fourteenth National Conference on Artificial Intelligence*. Menlo Park, CA: AAAI Press. **Quinlan (1996b)**-
Quinlan, J. R. (1996b) Improved use of continuous attributes in C4.5.
*Journal of Artificial Intelligence Research***4**, 77-90. **Rachlin***et al.*(1994)-
Rachlin, J., Kasif, S., Salzberg, S. & Aha, D. (1994) Towards a better
understanding of memory-based and Bayesian classifiers.
In
*Proceedings of the Eleventh International Conference on Machine Learning (New Brunswick, NJ)*, pp. 242-250. **Sakurai (1993)**-
Sakurai, A. (1993) Tighter bounds of the VC-dimension of three-layer
networks.
In
*Proceedings of the 1993 World Congress on Neural Networks*, volume 3, pp. 540-543. Hillsdale, NJ: Erlbaum. **Schapire (1990)**-
Schapire, R. E. (1990) The strength of weak learnability.
*Machine Learning***5**(2), 197-227. **Shafer (1996)**-
Shafer, G. (1996)
*Probabilistic Expert Systems*. Number 67 in CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, PA: SIAM. ISBN 0-89871-373-0. **Shlien (1990)**-
Shlien, S. (1990) Multiple binary decision tree classifiers.
*Pattern Recognition***23**(7), 757-763. **Short & Fukunaga (1980)**-
Short, R. D. & Fukunaga, K. (1980) A new nearest neighbor distance measure.
In
*Proceedings of the Fifth IEEE International Conference on Pattern Recognition (Miami Beach, 1980)*, pp. 81-86. Los Alamitos, CA: IEEE Computer Society Press. **Short & Fukunaga (1981)**-
Short, R. D. & Fukunaga, K. (1981) The optimal distance measure for nearest
neighbor classification.
*IEEE Transactions on Information Theory***27**, 622-627. [Reprinted in Dasarathy (1991)]. **Tesauro***et al.*(1995)-
Tesauro, G., Touretzky, D. S. & Leen, T. K. (eds) (1995)
*Advances in Neural Information Processing Systems 7. Proceedings of the 1994 Conference*. Cambridge, MA: MIT Press. ISBN 0-262-20104-6. **Touretzky***et al.*(1996)-
Touretzky, D. S., Moser, M. C. & Hasselmo, M. E. (eds) (1996)
*Advances in Neural Information Processing Systems 8. Proceedings of the 1995 Conference*. Cambridge, MA: MIT Press. **Utgoff (1989)**-
Utgoff, P. E. (1989) Perceptron trees: a case study in hybrid concept
representations.
*Connection Science***1**(4), 377-391. **Valentin***et al.*(1994)-
Valentin, D., Abdi, H., O'Toole, A. J. & Cottrell, G. (1994) Connectionist
models of face processing: A survey.
*Pattern Recognition***27**, 1208-1230. **Vapnik (1995)**-
Vapnik, V. N. (1995)
*The Nature of Statistical Learning Theory*. New York: Springer. **Vapnik (1996)**-
Vapnik, V. N. (1996)
*Statistical Learning Theory*. New York: Wiley. **Waltz (1990)**-
Waltz, D. (1990) Memory-based reasoning.
In
*Natural and Artificial Parallel Computation*, eds M. Arbib & J. Robinson, pp. 251-276. Cambridge, MA: MIT Press. **Wang & Oja (1993)**-
Wang, L. & Oja, E. (1993) Image compression by MLP and PCA neural
networks.
In
*Proceedings of the Eight Scandinavian Conference on Image Analysis (Tromsø, Norway)*, pp. 1317-1324. **Werbos (1987)**-
Werbos, P. (1987) Learning how the world works.
In
*Proceedings of the IEEE Conference on Systems, Man and Cybernetics*, pp. 320-310. New York: IEEE Press. **Werbos (1995)**-
Werbos, P. J. (1995) Backpropagation: Basics and new developments.
pp. 134-139 of Arbib (1995).
**Willems***et al.*(1993)-
Willems, F. M. J., Shtarkov, Y. M. & Tjalkens, T. J. (1993) Context tree
weighting: A sequential universal source coding procedure for FSMX
sources.
In
*Proceedings of the 1993 IEEE International Symposium on Information Theory*, p. 59. IEEE Press. **Willems***et al.*(1995)-
Willems, F. M. J., Shtarkov, Y. M. & Tjalkens, T. J. (1995) The context-tree
weighting method: Basic properties.
*IEEE Transactions on Information Theory*pp. 653-664.

- ...
- The notation is defined on page 178.
- ...
- with an explicit bound for a
single-output network which has a leading term of where
*M*is the number of sigmoidal units

Thu Aug 15 22:57:29 BST 1996