LVQ_PAK
				      
				      
				     The
				      
		       Learning  Vector  Quantization
				      
			      Program  Package
				      
				      
		       Version 2.1 (October 9, 1992)
				      
				      
			       Prepared by the
				      
			 LVQ Programming Team of the
				      
		      Helsinki University of Technology
				      
	       Laboratory of Computer and Information Science
				      
		     Rakentajanaukio 2 C, SF-02150 Espoo
				      
				   FINLAND
				      
				      
			 Copyright (C) 1991-1992 by
				      
				Teuvo Kohonen
				 Jari Kangas
			       Jorma Laaksonen
				Kari Torkkola
				      
				      
				      1


Contents


1   Introduction                                                            4

    1.1  Contents of this package  : : : : : : : : : : : : : : : : : : : :  4

    1.2  Precautions   : : : : : : : : : : : : : : : : : : : : : : : : : :  4

    1.3  The LVQ-algorithms  : : : : : : : : : : : : : : : : : : : : : : :  5

         1.3.1   The LVQ1  : : : : : : : : : : : : : : : : : : : : : : : :  5

         1.3.2   The LVQ2.1  : : : : : : : : : : : : : : : : : : : : : : :  5

         1.3.3   The LVQ3  : : : : : : : : : : : : : : : : : : : : : : : :  6

         1.3.4   Differences between the basic LVQ1, LVQ2.1 and LVQ3 : : :  6

         1.3.5   The optimized-learning-rate LVQ1 (OLVQ1)    : : : : : : :  7


2   General considerations                                                  8

    2.1  Initialization of the codebook vectors    : : : : : : : : : : : :  8

    2.2  Learning  : : : : : : : : : : : : : : : : : : : : : : : : : : : :  9

    2.3  Stopping rule   : : : : : : : : : : : : : : : : : : : : : : : : : 10


3   Installation of the program package                                    10

    3.1  Getting the program code  : : : : : : : : : : : : : : : : : : : : 10

    3.2  Installation in UNIX  : : : : : : : : : : : : : : : : : : : : : : 11

    3.3  Installation in DOS : : : : : : : : : : : : : : : : : : : : : : : 12

    3.4   Hardware requirements  : : : : : : : : : : : : : : : : : : : : : 13


4   File formats                                                           13

    4.1  Data file formats   : : : : : : : : : : : : : : : : : : : : : : : 13

    4.2  Codebook file formats : : : : : : : : : : : : : : : : : : : : : : 14


                                       2


5   Application of this package                                            14

    5.1  The interface program lvq_run : : : : : : : : : : : : : : : : : : 14

    5.2  Using the programs directly : : : : : : : : : : : : : : : : : : : 14

    5.3  Program parameters  : : : : : : : : : : : : : : : : : : : : : : : 15

    5.4  Using the command lines (an example)  : : : : : : : : : : : : : : 16

         5.4.1   First stage: Codebook  initialization   : : : : : : : : : 17

         5.4.2   Second stage: Codebook  training  : : : : : : : : : : : : 18

         5.4.3   Third stage: Evaluation of the recognition accuracy : : : 18

         5.4.4   Fourth stage: Codebook visualization  : : : : : : : : : : 18


6   Description of the programs of this package                            19

    6.1  Initialization programs : : : : : : : : : : : : : : : : : : : : : 19

    6.2  Training programs   : : : : : : : : : : : : : : : : : : : : : : : 19

    6.3  Recognition accuracy program  : : : : : : : : : : : : : : : : : : 20

    6.4  Classification program  : : : : : : : : : : : : : : : : : : : : : 21

    6.5  Monitoring programs : : : : : : : : : : : : : : : : : : : : : : : 21

    6.6  Auxiliary subprograms : : : : : : : : : : : : : : : : : : : : : : 21


7   Comments and experiences of the use of this package                    22

    7.1  Changes in package  : : : : : : : : : : : : : : : : : : : : : : : 22


References                                                                 24


                                       3


1     Introduction


1.1     Contents of this package

This package contains all the programs necessary for the correct application
of certain LVQ (Learning Vector Quantization) algorithms in an arbitrary
statistical classification or pattern recognition task,  as well as a program
for the monitoring of the codebook vectors at any time during the learning
process [Kohonen et al. 1992].

To this package three options for the algorithms, the LVQ1 (as described
in [Kohonen 1990b]), the LVQ2.1 (as specified, e.g., in [Kohonen 1990a] and
[Kohonen 1990c]) and the LVQ3 (as described in [Kohonen 1990b]), have been
selected.

NOTE: This program package is copyrighted in the sense that it may be used
for scientific purposes. The package as a whole, or parts thereof, cannot be
included or used in any commercial application without written permission
granted by its producents. No programs contained in this package may be
copied for commercial distribution.

This program package is distributed in the hope that it will be useful, but
without any warranty.  No author or distributor accepts responsibility to
anyone for the consequences of using it or for whether it serves any particular
purpose or works at all, unless he says so in writing.


1.2     Precautions

Few traditional 'neural network' algorithms have been meant to directly op-
erate on raw data, such as pixels of an image, or samples of speech waveforms
picked up from the time domain. Most pattern recognition tasks are preceded
by a preprocessing transformation that extracts invariant features from the
raw data, such as spectral components of acoustical signals, or elements of
co-occurrence matrices of pixels. Selection of a proper preprocessing trans-
formation for a particular task usually requires careful consideration, and no
general rules can be given here. It is cautioned that if this LVQ-package is
used for benchmarking against other methods, a proper preprocessing should
always be used.

In performing statistical experiments, a separate data set for training, and
another separate data set for testing must be used. If the number of required
learning steps is bigger than the number of training samples available, the
samples must be used re-iteratively in training, either in a cyclical or in a
randomly-sampled order. The first method is implemented in this package.


                                       4


1.3     The LVQ-algorithms

1.3.1   The LVQ1

Assume that a number of 'codebook vectors' mi (free parameter vectors) are
placed into the input space to approximate various domains of the input
vector x by their quantized values.  Usually several codebook vectors are
assigned to each class of x values, and x is then decided to belong to the
same class to which the nearest mi belongs. Let

                       c = arg min{length(x - mi)}                        (1)


define the nearest mi to x, denoted by mc.

Values for the mithat approximately minimize the misclassification errors in
the above nearest-neighbor classification can be found as asymptotic values
in the following learning process. Let x(t) be a sample of input and let the
mi(t) represent sequences of the mi in the discrete-time domain. Starting
with properly defined initial values (cf.  Sec.  2.1), the following equations
define the basic LVQ1 process:

           mc(t + 1)   =   mc(t) + alpha(t)[x(t) - mc(t)]

                            if x and mc belong to the same class,

           mc(t + 1)   =   mc(t) - alpha(t)[x(t) - mc(t)]                  (2)

                            if x and mc belong to different classes,

           mi(t + 1)   =   mi(t)  for i not equal to c.


Here 0 < alpha(t) < 1, and alpha(t) may be constant or decrease monotonically
with time.  In the above basic LVQ1 it is recommended that alpha should
initially be smaller than 0.1; linear decrease in time is used in this
package. (In the optimized LVQ1, cf. Sec. 1.3.5, different values of alpha
are used.)


1.3.2   The LVQ2.1

The classification decision in this algorithm is identical with that of
the LVQ1.
In learning, however, two codebook vectors, mi and mj that are the nearest
neighbors to x, are now updated simultaneously. One of them must belong
to the correct class and the other to a wrong class, respectively. Moreover, x
must fall into a zone of values called 'window', which is defined around the
midplane of mi and mj. Assume that di and dj are the Euclidean distances
of x from mi and mj, respectively; then x is defined to fall in a 'window' of
relative width w if

                    min (di/dj;dj/di) > s, where s = (1-w)/(1+w)           (3)
  

                                       5


A relative 'window' width w of 0.2 to 0.3 is recommendable.


Algorithm:


                  mi(t + 1)   =   mi(t) - alpha(t)[x(t) - mi(t)],

                  mj(t + 1)   =   mj(t) + alpha(t)[x(t) - mj(t)],          (4)


where mi and mj are the two closest codebook vectors to x, whereby x and
mj belong to the same class, while x and mi belong to different classes,
respectively. Furthermore x must fall into the 'window'.


1.3.3   The LVQ3

The LVQ2 algorithm was based on the idea of differentially shifting the
decision borders towards the Bayes limits, while no attention was paid to
what might happen to the location of the miin the long run if this process
were continued.  Therefore it seems necessary to include corrections that
ensure that the micontinue approximating the class distributions, at least
roughly. Combining these ideas, we now obtain an improved algorithm that
may be called LVQ3:


                  mi(t + 1)   =   mi(t) - alpha(t)[x(t) - mi(t)],

                  mj(t + 1)   =   mj(t) + alpha(t)[x(t) - mj(t)],

      where  mi and  mj are  the  two closest  codebook  vectors  to  x,
      whereby x and mj belong to the same class, while x and mi belong
      to different classes, respectively; furthermore x must fall into the
      'window';


             mk(t + 1)   =   mk(t) - epsilon(t)alpha(t)[x(t) - mk(t)],     (5)

      for k is i or j, if x, mi and mj belong to the same class.

In a series of experiments, applicable values of epsilon between 0.1 and 0.5
were found. The optimal value of epsilon seems to depend on the size of the
window, being smaller for narrower windows. This algorithm seems to be
self-stabilizing, i.e., the optimal placement of the midoes not change in
continual learning.


1.3.4   Differences between the basic LVQ1, LVQ2.1 and LVQ3

The three options for the LVQ-algorithms, namely, the LVQ1, the LVQ2.1 and
the LVQ3 chosen to this package, yield almost similar accuracies, although
a different philosophy underlies each.  The LVQ1 and the LVQ3 define a


                                       6


more robust process, whereby the codebook vectors assume stationary values
even after extended learning periods. For the LVQ1 the learning rate can
approximately be optimized for quick convergence (as shown in Sec. 1.3.5).
In the LVQ2.1, the relative distances of the codebook vectors from the class
borders are optimized whereas there is no guarantee for the codebook vectors
being placed optimally to describe the forms of the class borders.  Therefore
the LVQ2.1 should only be used in a differential fashion, using a small value
of learning rate and a relatively low number of training steps.


1.3.5   The optimized-learning-rate LVQ1 (OLVQ1)

The basic LVQ1 algorithm is now modified in
such a way that an individual learning rate alphai(t) is assigned to each mi.
We then get the following discrete-time learning process. Let c be defined
by Eq. (1). Then


             mc(t + 1)  =   mc(t) + alphac(t)[x(t) - mc(t)]

                             if x is classified correctly,

             mc(t + 1)  =   mc(t) - alphac(t)[x(t) - mc(t)]               (6)

                             if the classification of x is incorrect,

             mi(t + 1)  =   mi(t) for i not equal to c.


Next we address the problem of whether the alphai(t) can be determined
optimally for fastest possible convergence of (6). If we express (6) in the
form

               mc(t + 1) = [(1 - s(t)alphac(t)]mc(t) + s(t)alphac(t)x(t)  (7)

where s(t) = +1 if the classification is correct and s(t) = 1 if the classifi-
cation is wrong, we first directly see that mc(t) is statistically independent
of x(t). It may also be obvious that the statistical accuracy of the learned
codebook vector values is optimal if the effects of the corrections made at
different times, when referring to the end of the learning period, are of equal
weight. Notice that mc(t + 1) contains a "trace" from x(t) through the last
term in (7), and "traces" from the earlier x(t0); t0 = 1, 2, ... (t-1) through
mc(t). The (absolute) magnitude of the last "trace" from x(t) is scaled down
by the factor ffc(t), and, for instance, the "trace" from x(t - 1) is
scaled down by [1 - s(t)alphac(t)]alphac(t-1).
Now we first stipulate that these two scalings must be identical:


                       alphac(t) = [1 - s(t)alphac(t)]alphac(t - 1)        (8)

If this condition is then made to hold for all t, by induction it can be shown
that the "traces" collected up to time t from all the earlier x will be scaled


                                       7


ddown by an equal amount at the end, and thus the "optimal" values of ffi(t)
are determined by the recursion


                   alphac(t) = alphac(t-1)/(1+s(t)alphac(t-1))            (9)

Any user of the LVQ_PAK can easily become convinced about that (9) really
provides for fast convergence.  A precaution must be made, however:  since
alphac(t) can also increase, it is especially important that it does not
rise above the value 1; the learning program olvq1 in this package is even more
restrictive, it never allows any alphai to rise above its initial value.
With this provision, the initial values of the alphai can be selected
rather high, say, 0.3, whereby learning
is significantly speeded up, especially in the beginning, and the mi quickly
find their approximate asymptotic values.

It must be warned, too, that (9) is not applicable to the LVQ2, since thereby
the alphai, on the average, would not decrease, and the process would not con-
verge.


2     General considerations


In the LVQ algorithms, vector quantization is not used to approximate to den-
sity functions of the class samples (described, e.g., in [Makhoul et al. 1985]),
but to directly define the class borders according to the nearest-neighbor
rule. The accuracy achievable in any classification task to which the LVQ al-
gorithms are applied and the time needed for learning depend on the following
factors:


    - an approximately optimal number of codebook vectors assigned to each
      class and their initial values,

    - the detailed algorithm, a proper learning rate applied during the steps,
      and a proper criterion for the stopping of learning.


2.1     Initialization of the codebook  vectors

In many practical applications such as speech recognition, even when the a
priori probabilities for the samples falling in different classes are very
different,
a very good strategy already is to start with the same number of codebook
vectors in each class. An upper limit to the total number of codebook vectors
is set by the restricted recognition time and computing power available.

Since the class borders are represented piecewise linearly by segments of mid-
planes between codebook vectors of neighboring classes (borders of the so-
called Voronoi tessellations), it may seem to be an even better strategy for


                                       8


optimal approximation of the borders that the average distances between the
adjacent codebook vectors (which depend on their numbers per class) should
be the same on both sides of the borders. Then, at least if the class distri-
butions were symmetric, this would mean that the average shortest distances
of the codebook vectors (or alternatively, the medians of the shortest dis-
tances) should be the same in every class. Because the final placement of the
codebook vectors is not known until at the end of the learning process, their
distances and thus their optimal numbers cannot be determined before that.
This kind of assignment of the codebook vectors to the various classes can
therefore only be made iteratively, for which there is a provision (program
named balance) in this package.

Once the tentative numbers of the codebook vectors for each class have been
fixed, for their initial values one can use first samples of the real training
data
picked up from the respective classes.  Since the codebook vectors should al-
ways remain inside the respective class domains, for the above initial values
too one can only accept samples that are not misclassified. In other words, a
sample is first tentatively classified against all the other samples in the
training set, for instance by the k-nearest-neighbor (KNN) method, and accepted
for a possible initial value only if this tentative classification is the same
as the
class identifier of the sample. (In the learning algorithm itself, however, no
samples must be excluded; they are thereby applied independent of whether
they fall on the correct side of the class border or not.)

In the program balance, the medians of the shortest distances between the
initial codebook vectors of each class are first computed. If the distances turn
out to be very different for the different classes, new codebook vectors may
be added to or old ones deleted from the deviating classes, and a tentative
training cycle based on the optimized-learning-rate LVQ1 algorithm (cf. Sec.
1.3.5) is run once. This procedure can be iterated a few times. (The exact
numbers of codebook vectors are not critical; the shortest distances may differ
by a factor of, say, 2 but not significantly more.)

For good piecewise linear approximation of the borders, the medians of the
shortest distances between the codebook vectors should also be somewhat
smaller than the standard deviations (= square roots of variances) of the
input samples in all the respective classes. These figures are displayed by the
program mindist for checking.


2.2     Learning

It is recommended that learning be always started with the optimized LVQ1
algorithm (cf.  Sec.  1.3.5), which has very fast convergence; its asymptotic
recognition accuracy will be achieved after a number of learning steps that
is about 30 to 50 times the total number of codebook  vectors.  If the initial
learning period, as described in Sec. 2.1, is included in the initialization of


                                       9


the codebook vectors, the optimized LVQ1 algorithm can be continued from
those codebook vector values that have been obtained in the initialization
phase.

Often the optimized LVQ1 learning phase alone may be sufficient for prac-
tical applications, especially if the learning time is critical.  However, in an
attempt to improve recognition accuracy, one may continue with either the
basic LVQ1, the LVQ2.1 or the LVQ3, using a low initial value of learning
rate, which is then the same for all the classes.


2.3     Stopping rule

It often happens that the neural-network algorithms 'overlearn'; i.e., when
learning and test phases are alternated, the recognition accuracy is first im-
proved until an optimum is reached; after that, when learning is continued,
the accuracy starts to decrease slowly. A possible explanation in the present
case is that when the codebook vectors become very specifically tuned to the
training data, the ability of the algorithm to generalize for new data suffers
from that. It is therefore necessary to stop the learning process after some
'optimal' number of steps, say, 50 to 200 times the total number of the code-
book vectors (depending on particular algorithm and learning rate). Such a
stopping rule can only be found by experience, and it also depends on the
input data.

Let us recall that the optimized-learning-rate LVQ1 may generally be stopped
after a number of steps that is 30 to 50 times the number of codebook vectors
(c.f. Sec. 2.2).


3     Installation  of  the  program  package


In the implementation of the LVQ programs we have tried to use as simple a
code as possible. Therefore the programs are supposed to compile in various
machines without any specific modifications made on the code. All programs
have been written in ANSI C.

No graphics are included in this package so that the programs may be run
equally well in all computers ranging from PC:s to Cray supercomputers. The
monitoring program sammon generates a list of coordinates of points that can
be displayed two-dimensionally by any standard graphics program.


3.1     Getting the program code

The latest version - currently Version 2.1 - of the lvq_pak-program pack-
age is available for anonymous ftp user at the Internet site cochlea.hut.fi
(130.233.168.48).  All programs and this documentation are stored in the


                                      10


directory /pub/lvq_pak.  The files are in multiple formats to ease their down-
loading and compiling.

The directory /pub/lvq_ak  contains the following files:

 README             -  short description about the lvq_pak package
 lvq_doc.ps         -  this document in c) PostScript format
 lvq_doc.ps.Z       -  same as above but compressed
 lvq_doc.txt        -  this document in ASCII format
 lvq_p2r1.exe       -  Self-extracting MS-DOS archive file
 lvq_pak-2.1.tar    -  UNIX tape archive file
 lvq_pak-2.1.tar.Z  -  same as above but compressed


3.2     Installation in UNIX

The archive file lvq_pak-2.1.tar.Z is intended to be used when installing
lvq_pak
in UNIX systems. It needs to be uncompressed to get the file lvq_pak-2.1.tar.
If your system doesn't support the BSD compress utility, you may download
the uncompressed file directly.

The tar archive contains the source code files, makefiles, and example data
sets of the package, all in one subdirectory called lvq_pak-2.1.  In order to
create the subdirectory and extract all the files you should use the command
tar xovf lvq_pak-2.1. (The switches of tar unfortunately vary, so you may need
omit the 'o'.)

The package contains two makefiles for compilation in UNIX systems. Before
executing the make command, one or the other of them has to be copied
to the name makefile. The file, makefile.bsd, should work as such in most
systems, whereas the other one, makefile.sv, while being more distinct and
refined, is only usable in systems with UNIX System V make utility or GNU
gmake implementation of make. Whichever makefile is chosen, it has to be
copied to the name makefile before performing the make command.  If the
executable files are not properly built up, using the other one of the makefiles
should help.

We have written the source code for an ANSI standard C compiler and envi-
ronment. If the cc compiler of your system doesn't fulfill these requirements,
we recommend you to port the public domain GNU gcc compiler in your
computer. When using gcc, the makefile macro definition CC=cc has to be
changed accordingly to CC=gcc. The makefile also contains some other plat-
form specific definitions, like optimizer and linker switches, that may need to
be revised.

In order to summarize, the installation procedure is as follows:

> uncompress lvq_pak-2.1.tar.Z

> tar xovf lvq_pak-2.1.tar

> cd lvq_pak-2.1


                                      11


> cp makefile.sv makefile

> make

After a successful make of the executables, you may test them by executing

> make example

which performs the commands as listed in section "5.4 Using the command
lines (an example)".

The installation and application of the lvq_pak package has been successfully
carried out in the following computers (operating systems/compilers): Silicon
Graphics Personal Iris and Power Series models 25, 35, and 340 (IRIX System
V Release 3.3.1 / MIPS C 2.0), Iris Indigo (IRIX Release 4.0 System V / MIPS
C 2.10), Sun Sparcstation (SunOS Release 4.1.1 / gcc 1.40 + makefile.sv),
Cray XMP-EA/432 (UNICOS 6.0 / Cray Standard C 2.0.1), IBM RS/6000
(AIX v3.1 / Version 01.01), Apollo 425t (Domain/OS SR10.3 / cc Rev 6.8),
HP 9000/720 (HP-UX 8.05 / HP C Compiler), MIPS (RISC/os 4.52 / C 2.11).


3.3     Installation in DOS

The archive file lvq_p2r1.exe is intended to be used when installing lvq_pak in
MS-DOS computers. It is a self-extracting packed archive compatible with
the public domain lha utility.  If your system supports UNIX tar archiving
and compress file compressing utilities, you may as well use lvq_pak-2.1.tar
and lvq_pak-2.1.tar.Z archives.

The lvq_p2r1.exe archive contains the source code files, makefiles, and example
data sets of the package, all in one subdirectory called lvq_pak.2r1. In order
to create the subdirectory and extract all the files simply use the command
lvq_p2r1.

The package contains a makefile called makefile.dos for building up the object
files. Before using the make command, makefile.dos has to be copied to the
name makefile. It is intended to be used with the Borland Make Version 3.5
and the Borland C++ compiler Version 2.0, and may need to be revised if
used with other compilation tools. Even with Borland C you may want to
set some compiler switches, e.g., floating point options, according to your
hardware.

In order to summarize, the installation procedure is as follows:

> lvq_p2r1

> cd lvq_pak.2r1

> copy makefile.dos makefile

> make

After a successful make of the executables, you may test them by executing

> make example

which performs the commands as listed in section "5.4 Using the command


                                      12


lines (an example)".


3.4     Hardware requirements

The archive files are about 300 kbytes in size, whereas the extracted files take
about 1 Mbyte. When compiled and linked in MS-DOS, the executables are
about 45 kbytes each. It is recommended to have at least 640 kbytes RAM,
when using lvq_pak in MS-DOS.

The execution times of the programs depend heavily on the hardware. For
example, the execution of the subprogram olvq1 in the example (Section 5.4),
with 200 codebook vectors takes about 10 seconds in Cray XMP, about 22
seconds in Sun workstations, about 9 seconds in Silicon Graphics workstations,
about 110 seconds in a PC/386DX provided with a math co-processor and
running at 25 MHz, and about 970 seconds with a similar PC without co-processor.


4     File formats


All data files (input vectors and codebooks) are stored as ASCII files for their
easy editing and checking. The files that contain training data and test data
are formally similar, and can be used interchangeably.


4.1     Data file formats

The input data is stored in ASCII-form as a list of entries, one line being
reserved for each vectorial sample.   Each line consists of n floating-point
numbers followed by the class label (that can be any string). The first line of
the file is reserved for status knowledge of the entries; in the present version
it is used to define the dimensionality of the data vector. The data files can
contain comment lines that begin with '#', and are ignored.

The program classify can read unlabeled data vectors. So in these data files
the class label can be ignored (the program classify can read also previously
labeled data vectors). In all other cases the input data vectors must have
labels.

An  example:  Consider  a  hypothetical  data  file  exam.dat  that  represents
shades of colors in a three-component form.  This file contains four samples,
each one comprising a three-dimensional data vector.  (The dimensionality
of the vectors is given on the first line.)  The labels can be any strings; here
'yellow' and 'red' are the names of the classes. The second line and the fifth
line are comment lines that are ignored while reading the file.


      exam.dat:


                                      13


       3
       # First the yellow entries
       181.0   196.0    17.0   yellow
       251.0   217.0    49.0   yellow
       # Then the red entries
       248.0   119.0   110.0   red
       213.0    64.0    87.0   red


4.2     Codebook file formats

The codebook vectors are stored in ASCII-form. The format of the entries is
similar to that used in the input data files.

An example: The codebook file code.dat contains two codebook vectors that
approximate to the samples of the exam.dat file, one codebook vector being
assigned to each class.


      code.dat:


       3
       224.2   209.0   36.8  yellow
       232.2    94.2   99.6  red


5     Application  of  this  package


5.1     The interface program lvq_run

The easiest way to use the lvq_pak-programs is to run them through the
lvq_run interface program, whereby no separate command lines are needed.
The lvq_run interactively asks the user about the needed parameters and
takes care of running the recommended subprograms in the correct order.
Therefore it is advised that the user should first use the lvq_run to learn the
procedures, even if he or she intends to apply the subprograms directly later
on.


5.2     Using the programs directly

It is also possible to run each of the subprograms contained in this package
separately and directly from the console using command lines defined in Sec
6. The user should, however, take care of that the programs are then run in
the correct order: first proper initialization, then training, and then tests;
and that correct parameters are given (correspondence of the input and output
files of subsequent programs is particularly important). Direct use facilitates,
e.g., combined application of learning algorithms.


                                      14


Each program needs some parameters: file names, learning parameters, sizes
of codebooks, etc. All these must be given to the program in the beginning;
the programs are not interactive in the sense that they do not ask for any
parameters during their running.


5.3     Program parameters

Various programs need various parameters.  All the parameters that are re-
quired by any program in this package have been listed below. The meaning
of the parameters is obvious in most cases.  The parameters can be defined
in any order in the commands.


  -noc        Number of codebook vectors in the codebook.
  -din        Name of the input data file.
  -dout       Name of the output data file.
  -cin        Name of the file from which the codebook vectors are read.
  -cout       Name of the file to which the codebook vectors are stored.
  -rlen       Running length (number of steps) in training.
  -alpha      Initial learning rate parameter.
  -epsilon    Relative learning rate parameter (needed in the lvq3 program).
  -win        Window width parameter (needed in the lvq2 and lvq3 pro-
              grams).
  -knn        Number of neighbors used in knn-classification.


It is always possible to give the -v parameter (verbose parameter), which
defines how much diagnostic output the program will generate. The values
can range from 0 upwards, whereby greater values will generate more output;
the default value is 1.


  -v          Verbose parameter defining the output level.


In the program lvq_run the user is prompted by bell sound when input is
requested.  The bell can be silenced with the parameter -silent by giving
some greater value than 0.


  -silent     Silent parameter defining if the program lvq_run will give bell
              sound in prompt.


In the initialization and training programs the random-number generator is
used to select the order of the training samples, etc.  The parameter -rand
defines whether a new seed for the random number generator is given; when
any other number than zero is given, that number is used as seed, otherwise
the seed is read from the system clock. Default value is zero (system clock is
used).


                                      15


  -rand       Parameter that defines whether a new seed for the random num-
              ber generator is defined.


Some auxiliary programs, not usually needed by the user, require the following
parameters.


  -label      Class label (string).
  -cfout      Name of the classification information file.


Some examples of the use of parameters:

> eveninit -noc 200 -din exam1.dat -cout code1.cod -knn 3

An initialization program was called above to create a total of 200 entries
into the codebook. The input entries out of which the codebook vectors were
formed were read from the file exam1.dat and the codebook was stored to the
file code1.cod.  The entries selected for initial values of the codebook vectors
were supposed to fall inside the class borders, which was tested automatically
by knn-classification using the value k = 3.

> lvq1 -din exam1.dat -cin code1.cod -cout code1.cod -rlen 10000 -alpha 0.05

A training program (lvq1) was called. The training entries were read from the
file exam1.dat; the codebook to be trained was read from the file code1.cod
and the trained codebook vectors were resaved (in this case) to the same file
code1.cod.  Training was defined to take 10000 steps, but if there are fewer
entries in the input file, the file is iterated randomly a sufficient number of
times. The initial learning rate was set to 0.05.

> accuracy -din exam2.dat -cin code1.cod

The recognition accuracy achieved with the codebook vectors stored in the
file code1.cod was tested using the test data file exam2.dat.


5.4     Using the command lines (an example)

The example given in this section demonstrates the direct use of command
lines (it is thus not run under lvq_run). It is meant for an introduction to
the application of this package, and it may be helpful to study it in detail.
(The example may be run directly by the command make example.)

The data items used in the example are contained in this package.  They
consist of two data sets, ex1.dat and ex2.dat, one for training the codebook
and the other for testing, respectively. Each data set contains 1962 cepstral-
coefficient vectors picked up from continuous Finnish speech, from the same
speaker.  Each vector has a dimensionality of 20 and has been labeled to
represent one phoneme.

Below, the data sets are processed, the codebooks  are formed, and the recog-
nition accuracy is evaluated.


                                      16


5.4.1   First stage: Codebook initialization

For the initialization of the codebook vectors one has to select first a set of
vectorial initial values that are picked up from the training data, one at a
time. All the entries used for initialization must fall within the borders of
the corresponding classes, which is automatically checked by knn-classification.
The initialization program takes care of that. The number of codebook vec-
tors also has to be decided at that point. For this speech recognition example
a total number of about 200 codebook vectors seems to be a good  choice.

The program eveninit selects the initial codebook  entries from a given file
with the same number of entries allocated to each class.  (The program propinit
would select the initial values so that their numbers in the respective classes
are proportional to their a priori probabilities.)

> eveninit -din ex1.dat -cout ex1e.cod -noc 200 -rand 1

Now the codebook entries have been selected.  This time we did not get the
same number of entries to all classes because in certain classes there were not
enough sample entries (e.g. in class D there were only four samples).

We can now check the number of entries selected for each class and the me-
dians of the shortest distances using the program mindist.

> mindist -cin ex1e.cod

The recognition accuracy depends on the number of codebook entries allo-
cated to each class. There does not exist any simple rule to find out the best
distribution of the codebook vectors. In this example we use the method of
iteratively balancing the medians of the shortest distances in all classes.

The program balance first computes the medians of the shortest distances for
each class and corrects the distribution so that into those classes in which
the distance is greater than the average, entries are added, and from those
classes in which the distance is smaller than the average, some entries are
deleted. Thereafter one learning cycle of the optimized-learning-rate LVQ, or
the olvq1 procedure, is automatically run within the program. After this the
medians of the shortest distances are computed again and displayed. (This
program may be iterated, if necessary.)

> balance -din ex1.dat -cin ex1e.cod -cout ex1b.cod -rand 1

A global-parameter file for the learning-rate parameters, relating to the code-
book -cout, is also created by this program, and the recursively updated values
of these parameters are left in this file, from which they are automatically
read when next time calling the olvq1 program (c.f. below).

Now the codebook has been initialized and learning can begin.


                                      17


5.4.2   Second stage: Codebook training

The codebook will now be trained by the fastest and most robust of all the
Learning Vector Quantization algorithms, namely, the optimized-learning-
rate LVQ1, olvq1.

> olvq1 -din ex1.dat -cin ex1b.cod -cout ex1o.cod -rlen 5000 -rand 1

The length of the training run has to be decided in the beginning. One should
notice that if the program balance was used in initialization, one cycle of
olvq1
was already included in it; therefore this training phase can be shorter by that
amount. The initial values for the learning-rate parameters are automatically
read from the global-parameter file, which was created by the program bal-
ance. For the run length, 5000 steps have here been chosen. In this example
we do not use any fine-tuning by additional training programs.

It may have become obvious from the general description of the LVQ that
an unknown vector is always classified by determining its nearest neighbor in
the trained codebook.


5.4.3   Third stage: Evaluation of the recognition accuracy

Now the codebook entries have been trained to their final values and the
resulting recognition accuracy relative to the codebook can be tested. In the
package there exists another speech entry file, ex2.dat, that is statistically
independent of the file ex1.dat used in training.  This file may be used for
testing the trained codebook. The program accuracy can be used to test the
recognition accuracy relating to any codebook  vector file and test data file.

> accuracy -din ex2.dat -cin ex1o.cod

This program computes the recognition accuracy for each class separately
and also the average over all the classes. The recognition accuracy resulting
in this example is expected to be 90.1 %.


5.4.4   Fourth stage: Codebook visualization

NOTE: This stage is helpful but not necessary.

The trained codebook is now ready to be used for classification.   In this
package there are some visualization programs by which the distribution and
clustering of any data entries (training samples or codebook vectors) can be
checked.

The program sammon generates a mapping [Sammon Jr. 1969] from an n-
dimensional data space to the two-dimensional plane.  The two-dimensional
mapping approximates to Euclidean distances of the data space, and thus
visualizes the clustering of the data. The list of the mapped points can be
visualized two-dimensionally by any standard graphics program.

> sammon -cin ex1o.cod -cout ex1o.sam -rlen 100


                                      18


The program will store the two-dimensional image points in a similar fashion
as the input data entries are stored.


6     Description  of  the  programs  of  this package


6.1     Initialization  programs

The initialization programs initialize the codebook  vectors. The total number
of codebook vectors is given as one parameter (-noc).

    - eveninit - This program selects an equal number of codebook vectors to
      each class and sets their initial values. The codebook vectors are picked
      up from the file defined by the parameter -din and the selected vectors
      are left in the file defined by the parameter -cout. Misclassified entries
      are rejected by the knn-classification check automatically (whereby k
      may be defined by the parameter -knn; its default value is 5).

      > eveninit -noc 200 -din file.dat -cout file.cod [-knn 7]

    - propinit - This program defines the number of codebook vectors for each
      class in proportion to the a priori probabilities of the classes.

      > propinit -noc 200 -din file.dat -cout file.cod [-knn 7]

    - balance - This program adjusts the numbers of codebook vectors stored
      in the file defined by the parameter -cin and using training data stored
      in the file defined by the parameter -din so that the medians of the
      shortest distances between the codebook vectors in all classes are
      equalized. One learning cycle (all the data vectors in the training
      file are used once)
      based on the optimized-learning-rate LVQ1 algorithm is included in this
      program, whereby the 'learned' codebook vectors are left in the file indi-
      cated by the -cout parameter. Every time when this program is called,
      the initial learning rate parameters of the optimized-learning-rate LVQ1
      algorithm, for reasons explained in Sec. 1.3.5, are set to 0.3, and when
      the program has been run, the values determined by Eq. (1) are left in
      a global parameter file associated with the respective codebook vector
      file. Notice that even if the program balance is iterated several times,
      only the last iteration is taken into account in olvq1 learning.

      > balance -din file.dat -cin code1.cod -cout code2.cod


6.2     Training programs

    - olvq1  -  This  is  the  optimized-learning-rate  LVQ1  algorithm, recom-
      mended for the main learning algorithm.  It must always be preceded


                                      19


      by an initialization program eveninit or propinit and possibly by the
      program balance, too. No explicit learning rate parameters are defined
      in the command.  If the initialization stage was terminated with the
      balance program, optimized default values for the learning rate param-
      eters were left by that program in the respective parameter file, from
      which they are automatically read by the olvq1. If for initialization only
      the eveninit or the propinit program was used, the default values of the
      initial learning rates were set equal to 0.3 in those programs.

      The training data is taken from the file defined by the parameter -din,
      and the codebook vectors from the file defined by the parameter -cin,
      respectively. The trained codebook vectors are left in the file defined
      by the parameter -cout (which can be the same as -cin). The number
      of training steps is defined by the parameter -rlen.

      > olvq1 -din file.dat -cin file1.cod -cout file2.cod -rlen 10000

    - lvq1 - The original LVQ1 algorithm.  It can be used (with low -alpha
      value) for an additional fine-tuning stage in learning. The training data
      is taken from the file defined by -din and the codebook vectors to be
      fine tuned from the file defined by -cin; the tuned codebook vectors are
      left in the file defined by -cout.

      > lvq1 -din file.dat -cin file1.cod -cout file2.cod -alpha 0.05
         -rlen 40000

    - lvq2 - The LVQ2.1 version of the LVQ algorithms. It can be used (with
      low -alpha value) for another additional fine-tuning stage in learning.
      The relative width of the 'window' into which the training data must
      fall is defined by the parameter -win.

      > lvq2 -din file.dat -cin file1.cod -cout file2.cod -alpha 0.05
         -rlen 40000 -win 0.3

    - lvq3 - The LVQ3 version of the LVQ algorithms. It can be used (with low
      -alpha value) for additional fine-tuning stage in learning. The relative
      learning rate parameter -epsilon is used (multiplied by the parameter
      -alpha) when both of the nearest codebook vectors belong to the same
      class. The relative width of the 'window' into which the training data
      must fall is defined by the parameter -win.

      > lvq3 -din file.dat -cin file1.cod -cout file2.cod -alpha 0.05
         -epsilon 0.1 -rlen 40000 -win 0.3


6.3     Recognition accuracy program

    - accuracy - The recognition accuracy is evaluated. The codebook vectors
      are taken from the file defined by the parameter -cin, and the test
      entries
      from the file defined by the parameter -din, respectively. Optionally this


                                      20


      program creates a classification information file needed in testing the
      statistical significance of the difference between two classifiers by
      using program mcnemar.

      > accuracy -din file.dat -cin file.cod [-cfout file.cfo]


6.4     Classification  program

    - classify - The classifications of unknown data vectors are found. The
      codebook vectors are taken from the file defined by the parameter -cin,
      and the entries to be classified from the file defined by the parameter
      -din, respectively. The classification results are saved to the file
      defined
      by the parameter -dout. Optionally this program creates a classification
      file that contains only the labels of classified vectors.

      > classify -din file.dat -cin file.cod -dout file.cla [-cfout file.cfo]


6.5     Monitoring  programs

    - labels - Displays the class labels and the numbers of entries in each
      class of a given file.

      > labels -cin file.cod

    - mindist - Displays the medians of the shortest distances between code-
      book vectors in each class and the standard deviations of entries in each
      class in the corresponding input data file (if given).

      > mindist -cin file.cod [-din data.dat]

    - sammon - Generates the Sammon mapping [Sammon Jr. 1969] from n-
      dimensional input vectors to 2-dimensional points on a plane so that the
      distances between the image vectors tend to approximate to Euclidian
      distances of the input vectors. If the input vector set contains identical
      vector pairs, one of the vectors is removed for each such pair.

      > sammon -cin file.cod -cout file.sam -rlen 100

    - mcnemar - Computes the statistical significance of the difference be-
      tween classification results of two classifiers that have been tested with
      the same data. As input, two classification information files created by
      accuracy are required.

      > mcnemar file1.cfo file2.cfo


6.6     Auxiliary  subprograms

These programs are normally not applied by the user. They are mainly used
by the other programs as subroutines.


                                      21


    - elimin - Eliminates those entries in a given file that are classified to
      the wrong class when using the knn-classifier. The purpose is to ignore
      those entries that lie on the wrong side of the class borders when ini-
      tializing the codebook vectors. (Here 7 nearest neighbors are used in
      the classification.)

      > elimin -din file.dat -cout file.elim -knn 7

    - extract - Selects and saves only those entries that belong to a given
      class (here class 'K' is given).

      > extract -din file.dat -cout file.ext -label K

    - pick - Picks a given number of entries from a file.

      > pick -din file.dat -cout file.pic -noc 10

    - setlabel - Sets the labels of the codebook vectors in a codebook by
      finding a given number of nearest entries in the entry file and selecting
      the label by majority voting over them.

      > setlabel -din file.dat -cin file1.cod -cout file2.cod -knn 5

    - knntest - The recognition accuracy is computed using the k-nearest-
      neighbors classifier. Each entry is classified using majority voting with
      respect to a given number of nearest neighbors in the codebook.  This
      algorithm is primarily meant for a subprogram in the initialization of
      the codebook vectors, but can be used as an independent classifier, too.

      > knntest -din file.dat -cin file.cod -knn 5


7     Comments  and  experiences  of  the  use of
      this package


Comments and experiences of the installation and use of these programs are
welcome, and may be sent to the e-mail address lvq@cochlea.hut.fi.


7.1     Changes in the package

No changes to the central recognition algorithms have been made; the lat-
ter have been used successfully as such over many years. Therefore, if you
already have used Version 1.0, 1.1 or 2.0 of LVQ_AK,  you will not notice
any significant differences in accuracies yielded by the Version 2.1, neither.
However, the total computing time, on account of the improved balance pro-
gram as well as some improvements in the best matching unit search, will
now be shorter. The following are the details that have been changed from
the Version 1.0:


                                      22


 0. The only change made to Version 1.1 was a bug fix in the allocation of
    memory.

 1. In Version 2.0 the following changes have been made: For the recursion
    of ff(t) in Eq. (9), another justification (cf. text) has now been found.
    Thereby the indexing in Eq. (9) is slightly changed from Versions 1.0
    and 1.1. This change has a negligible effect on the numerical results,
    but in publications you should refer to the new form.

 2. The program balance has been made faster in several ways, by avoid-
    ing unnecessary operations. In Versions 1.0 and 1.1 it eventually also
    changed the number of codebook  vectors defined in the command line.
    In Version 2.0 this number is kept constant.  If no samples after the
    knn test are left in some class, one codebook vector, picked from the
    samples, is anyway taken to it.

 3. Since the random-number generators in different computers are not
    identical, we have programmed our own formula into the procedures;
    now the examples computed by the different machines are supposed to
    yield identical results (provided that they use a similar arithmetic).

 4. The LVQ3 algorithm has been added.

 5. Several amendments, which are invisible to the user but make the sys-
    tem more logical, have been made to control programs.

 6. In Version 2.1 the following changes have been made: It is now possible
    to put comment lines into the data files. The comment lines begin with
    '#' and they are ignored while reading the data.

 7. We have included a program classify, that produces the classifications
    of unknown data vectors.

 8. It is possible to use unlabeled data vectors with the program classify.

 9. If there are equal vectors in the input data set of the sammon program,
    it now discards all of them except one. Previously such vectors corrupted
    the computation of the Sammon mapping.

10. We have corrected one error in the documentation.  The Eq.  5 is now
    in the correct form.

11. The routines for the search of the best matching unit are improved.
    For each codebook vector the computation of the distance between the
    sample vector and the codebook vector is terminated if the subdistance
    is already greater than the distance for the current best matching unit.
    This improvement will decrease the computing time considerably for
    longer vector lengths.


                                    23


References


[Kohonen et al. 1992]   Teuvo Kohonen, Jari Kangas, Jorma Laaksonen, Kari
                        Torkkola. LVQ_PAK: A program package for the cor-
                        rect application of Learning Vector Quantization al-
                        gorithms.   In Proceedings of the International Joint
                        Conference on Neural Networks, pages I 725-730, Bal-
                        timore, June 1992. IEEE.

[Kohonen 1989]          Teuvo  Kohonen.    Self-Organization  and  Associa-
                        tive Memory. Springer-Verlag, Berlin-Heidelberg-New
                        York-Tokio, 3 edition, 1989.

[Kohonen 1990a]         Teuvo Kohonen. Improved versions of learning vector
                        quantization. In Proceedings of the International Joint
                        Conference on Neural Networks, pages I 545-550, San
                        Diego, June 1990.

[Kohonen 1990b]         Teuvo Kohonen. The self-organizing map. Proceedings
                        of the IEEE, 78(9):1464-1480, 1990.

[Kohonen 1990c]         Teuvo Kohonen. Statistical pattern recognition revis-
                        ited.  In Advanced Neural Computers, pages 137-144,
                        1990.

[Makhoul et al. 1985]   John Makhoul, S. Roucos, H. Gish.  Vector quanti-
                        zation  in  speech  coding.  Proceedings of the IEEE,
                        73:1551-1588, November 1985.

[Sammon Jr. 1969]       John W. Sammon Jr.  A nonlinear mapping for data
                        structure analysis. IEEE Transactions on Computers,
                        C-18(5):401-409, May 1969.


                                      24