heme Feature Artificial Neural Networks:A Tutorial Anil K.Jain umerous advances have been made in developing intelligent Michigan State University systems,some inspired by biological neural networks. Researchers from many scientific disciplines are designing arti- Jianchang Mao ficial neural networks (ANNs)to solve a variety of problems in pattern K.M.Mohiuddin recognition,prediction,optimization,associative memory,and control IBM Almaden Research Center (see the“Challenging problems'”sidebar). Conventional approaches have been proposed for solving these prob- lems.Although successful applications can be found in certain well-con- strained environments,none is flexible enough to perform well outside its domain.ANNs provide exciting alternatives,and many applications could benefit from using them.3 This article is for those readers with little or no knowledge of ANNs to help them understand the other articles in this issue of Computer.We dis- cuss the motivations behind the development of ANNs,describe the basic biological neuron and the artificial computational model,outline net- work architectures and learning processes,and present some of the most commonly used ANN models.We conclude with character recognition.a successful ANN application. WHY ARTIFICIAL NEURAL NETWORKS? The long course of evolution has given the human brain many desir able characteristics not present in von Neumann or modern parallelcom- puters.These include massive parallelism, distributed representation and computation, ·learning ability, generalization ability, 。adaptivity, .inherent contextual information processing, fault tolerance,and low energy consumption. 烧 It is hoped that devices based on biological neural networks will possess some of these desirable characteristics. These massively parallel Modern digital computers outperform humans in the domain of numeric computation and related symbol manipulation.However, systems with large numbers humans can effortlessly solve complex perceptual problems (like recog- nizing a man in a crowd from a mere glimpse of his face)at such a high of interconnected simple speed and extent as to dwarf the world's fastest computer.Why is there such a remarkable difference in their performance?The biological neural processors may solve a variety system architecture is completely different from the von Neumann archi- tecture(see Table 1).This difference significantly affects the type of func- of challenging computational tions each computational model can best perform. Numerous efforts to develop "intelligent"programs based on von problems.This tutorial Neumann's centralized architecture have not resulted in general-purpose intelligent programs.Inspired by biological neural networks,ANNs are provides the background and massively parallel computing systems consisting of an exremely large num- ber of simple processors with many interconnections.ANN models attempt the basics. to use some"organizational"principles believed to be used in the human 0018-91629655.00.1996IEEE March 1996 B0
Ani1 K. Jain Michigan State University Jianchang Mao K.M. Mohiuddin ZBMAZmaden Research Center umerous advances have been made in developing intelligent N systems, some inspired by biological neural networks. Researchers from many scientific disciplines are designing artificial neural networks (A”s) to solve a variety of problems in pattern recognition, prediction, optimization, associative memory, and control (see the “Challenging problems” sidebar). Conventional approaches have been proposed for solving these problems. Although successful applications can be found in certain well-constrained environments, none is flexible enough to perform well outside its domain. ANNs provide exciting alternatives, and many applications could benefit from using them.’ This article is for those readers with little or no knowledge of ANNs to help them understand the other articles in this issue of Computer. We discuss the motivations behind the development of A”s, describe the basic biological neuron and the artificial computational model, outline network architectures and learning processes, and present some of the most commonly used ANN models. We conclude with character recognition, a successful ANN application. WHY ARTIFICIAL NEURAL NETWORKS? The long course of evolution has given the human brain many desirable characteristics not present invon Neumann or modern parallel computers. These include These massively parallel systems with large numbers of interconnected simple processors may solve a variety of challenging computational problems. This tutorial provides the background and the basics. massive parallelism, distributed representation and computation, learning ability, generalization ability, adaptivity, inherent contextual information processing, fault tolerance, and low energy consumption. It is hoped that devices based on biological neural networks will possess some of these desirable characteristics. Modern digital computers outperform humans in the domain of numeric computation and related symbol manipulation. However, humans can effortlessly solve complex perceptual problems (like recognizing a man in a crowd from a mere glimpse of his face) at such a high speed and extent as to dwarf the world’s fastest computer. Why is there such a remarkable difference in their performance? The biological neural system architecture is completely different from the von Neumann architecture (see Table l). This difference significantly affects the type of functions each computational model can best perform. Numerous efforts to develop “intelligent” programs based on von Neumann’s centralized architecture have not resulted in general-purpose intelligent programs. Inspired by biological neural networks, ANNs are massively parallel computing systems consisting of an exremely large number of simple processors with many interconnections. ANN models attempt to use some “organizational” principles believed to be used in the human 0018-9162/96/$5.000 1996 IEEE March 1996
》@弯aes Let us consider the following problems of interest to com- Control puter scientists and engineers. Consider a dynamic system defined by a tuple [u(t),vt)) where u(t)is the control input and y(t)is the resulting out- Pattern classification put of the system at time t.In model-reference adaptive The task of pattern classification is to assign an input pat- control,the goal is to generate a control input u()such that tern(like a speech waveform or handwritten symbol)rep- the system follows a desired trajectory determined by the resented by a feature vector to one of many prespecified reference model.An example is engine idle-speed control classes(see Figure A1).Well-known applications include (Figure A7) character recognition,speech recognition,EEG waveform classification,blood cell classification,and printed circuit board inspection. Clustering/categorization Pattern Normal classitier In clustering,also known as unsupervised pattern clas- Abnormal sification,there are no training data with known class labels.A clustering algorithm explores the similarity between the patterns and places similar patterns in a clus- ter(see Figure A2).Well-known clustering applications + include data mining,data compression,and exploratory data analysis. Function approximation Over-fitting to Suppose a set of n labeled training patterns(input-out- noisy training data put pairs),[(xy)(x2,y),...,(x)),have been generated from an unknown function u(x)(subject to noise).The task True functic of function approximation is to find an estimate,say u,of the unknown function u(Figure A3).Various engineering (2) (3) and scientific modeling problems require function approx- imation. y Stock values (8 Prediction/forecasting Given a set of n samples (y(t )y(t),...,yt))in a time sequence,t,t...,t the task is to predict the sample y(t)at some future time t Prediction/forecasting has a t2 tn to+1 significant impact on decision-making in business,science, (4) 5) and engineering.Stock market prediction and weather forecasting are typical applications of prediction/forecast- ing techniques(see Figure A4). Retrieved airplane Optimization Associative A wide variety of problems in mathematics,statistics, memory engineering,science,medicine,and economics can be posed as optimization problems.The goal of an optimiza- ( tion algorithm is to find a solution satisfying a set of con- straints such that an objective function is maximized or Load torque minimized.The Traveling Salesman Problem(TSP),an NP- complete problem,is a classic example(see Figure A5) Throttle Engine Content-addressable memory In the von Neumann model of computation,an entry in memory is accessed only through its address,which is inde- Controller pendent of the content in the memory.Moreover,if a small (7) error is made in calculating the address,a completely dif- ferent item can be retrieved.Associative memory or con- Figure A.Tasks that neural networks can perform: tent-addressable memory,as the name implies,can be (1)pattern classification;(2)clustering/categorization; accessed by their content.The content in the memory can (3)function approximation:(4)prediction/forecasting be recalled even by a partial input or distorted content (see (5)optimization (a TSP problem example);(6)retrieval Figure A6).Associative memory is extremely desirable in by content;and(7)control (engine idle speed).(Adapted building multimedia information databases. from DARPA Neural Network Study) 32 Computer
I ification is to assign an input patn applications include ition, EEG waveform , and printed circuit a with known class lores the similarity n labeled training patterns (input-outfunction p(x) (subject to noise). The task ion is to find an estimate, say i, of p (Figure A3). Various engineering problems require function approxdes MtJ, y(fJ, . . . , y(t,)l in a time n, the task is to predict the sample me tn+,. Prediction/forecasting has a ecision-making in business, science, k market prediction and weather roblems in mathematics, statistics, medicine, and economics can be lems. The goal of an optimizaolution satisfying a set of contive function is maximized or Normal Abnormal ++ + Over-fitting to M noisy training data True function e (4) Airplane partially occluded by clouds Associative memory Load torque Controller (7) Figure A. Tasks that neur Computer
brain.Modeling a biological nervous system using ANNs can also increase our understanding of biological functions. I.Von Neumann computer versus biological neural system. State-of-the-art computer hardware technology (such as Von Neumann Biological VLSI and optical)has made this modeling feasible. A thorough study of ANNs requires knowledge of neu- computer neural system rophysiology,cognitive science/psychology,physics (sta- 550 Complex Simple tistical mechanics),control theory,computer science, High speed Low speed artificial intelligence,statistics/mathematics,pattern One or a few A large number recognition,computer vision,parallel processing,and hardware (digital/analog/VLSI/optical).New develop Memory Separate from a processor Integrated into ments in these disciplines continuously nourish the field Localized processor On the other hand,ANNs also provide an impetus to these Noncontent addressable Distributed disciplines in the form of new tools and representations. Content addressable This symbiosis is necessary for the vitality of neural net- work research.Communications among these disciplines Computing Centralized Distributed ought to be encouraged. Sequential Parallel Stored programs Self-learning Brief historical review ANN research has experienced three periods of exten- Reliability very vulnerable Robust sive activity.The first peak in the 1940s was due to McCulloch and Pitts'pioneering work.+The second Expertise Numerical and symbolic Perceptual occurred in the 1960s with Rosenblatt's perceptron con- manipulations problems vergence theorem5 and Minsky and Papert's work showing the limitations of a simple perceptron."Minsky and Operating Well-defined, Poorly defined, Papert's results dampened the enthusiasm of most environment well-constrained unconstrained researchers,especially those in the computer science com- munity.The resulting lull in neural network research lasted almost 20 years.Since the early 1980s,ANNs have received considerable renewed interest.The major devel- opments behind this resurgence include Hopfield's energy approach?in 1982 and the back-propagation learning algorithm for multilayer perceptrons(multilayer feed- forward networks)first proposed by Werbos,reinvented several times,and then popularized by Rumelhart et al. in 1986.Anderson and Rosenfeld1 provide a detailed his- torical account of ANN developments. Biological neural networks A neuron (or nerve cell)is a special biological cell that processes information (see Figure 1).It is composed of a cell body,or soma,and two types of out-reaching tree-like Figure 1.A sketch of a biological neuron. branches:the axon and the dendrites.The cell body has a nucleus that contains information about hereditary traits and a plasma that holds the molecular equipment for pro rons about 2 to 3 millimeters thick with a surface area of ducing material needed by the neuron.A neuron receives about 2,200 cm2,about twice the area of a standard com- signals (impulses)from other neurons through its dendrites puter keyboard.The cerebral cortex contains about 101 (receivers)and transmits signals generated by its cell body neurons,which is approximately the number of stars in the along the axon (transmitter),which eventually branches Milky Way.Neurons are massively connected,much more into strands and substrands.At the terminals of these complex and dense than telephone networks.Each neuron strands are the synapses.A synapse is an elementary struc. is connected to 102 to 10other neurons.In total,thehuman ture and functional unit between two neurons (an axon brain contains approximately 101 to 1015 interconnections strand of one neuron and a dendrite of another).When the Neurons communicate through a very short train of impulse reaches the synapse's terminal,certain chemicals pulses,typically milliseconds in duration.The message is called neurotransmitters are released.The neurotransmit- modulated on the pulse-transmission frequency.This fre- ters diffuse across the synaptic gap,to enhance or inhibit quency can vary from a few to several hundred hertz,which depending on the type of the synapse,the receptor neuron's is a million times slower than the fastest switching speed in own tendency to emit electrical impulses.The synapse's electronic circuits.However,complex perceptual decisions effectiveness can be adjusted by the signals passing through such as face recognition are typically made by humans it so that the synapses can learn from the activities in which within a few hundred milliseconds.These decisions are they participate.This dependence on history acts as a mem- made by a network of neurons whose operational speed is ory,which is possibly responsible for human memory. only a few milliseconds.This implies that the computations The cerebral cortex in humans is a large flat sheet of neu- cannot take more than about 100 serial stages.In other March 1996 33
brain. Modeling a biological nervous system using A"s can also increase our understanding of biological functions. State-of-the-art computer hardware technology (such as VLSI and optical) has made this modeling feasible. A thorough study of A"s requires knowledge of neurophysiology, cognitive science/psychology, physics (statistical mechanics), control theory, computer science, artificial intelligence, statistics/mathematics, pattern recognition, computer vision, parallel processing, and hardware (digital/analog/VLSI/optical) . New developments in these disciplines continuously nourish the field. On the other hand, ANNs also provide an impetus to these disciplines in the form of new tools and representations. This symbiosis is necessary for the vitality of neural network research. Communications among these disciplines ought to be encouraged. Brief historical review ANN research has experienced three periods of extensive activity. The first peak in the 1940s was due to McCulloch and Pitts' pioneering The second occurred in the 1960s with Rosenblatt's perceptron convergence theorem5 and Minsky and Papert's work showing the limitations of a simple perceptron.6 Minsky and Papert's results dampened the enthusiasm of most researchers, especially those in the computer science community. The resulting lull in neural network research lasted almost 20 years. Since the early 1980s, ANNs have received considerable renewed interest. The major developments behind this resurgence include Hopfield's energy approach7 in 1982 and the back-propagation learning algorithm for multilayer perceptrons (multilayer feedforward networks) first proposed by Werbos,8 reinvented several times, and then popularized by Rumelhart et aL9 in 1986. Anderson and RosenfeldlO provide a detailed historical account of ANN developments. Biological neural networks A neuron (or nerve cell) is a special biological cell that processes information (see Figure 1). It is composed of a cell body, or soma, and two types of out-reaching tree-like branches: the axon and the dendrites. The cell body has a nucleus that contains information about hereditary traits and a plasma that holds the molecular equipment for producing material needed by the neuron. A neuron receives signals (impulses) from other neurons through its dendrites (receivers) and transmits signals generated by its cell body along the axon (transmitter), which eventually branches into strands and substrands. At the terminals of these strands are the synapses. A synapse is an elementary structure and functional unit between two neurons (an axon strand of one neuron and a dendrite of another), When the impulse reaches the synapse's terminal, certain chemicals called neurotransmitters are released. The neurotransmitters diffuse across the synaptic gap, to enhance or inhibit, depending on the type of the synapse, the receptor neuron's own tendency to emit electrical impulses. The synapse's effectiveness can be adjusted by the signals passing through it so that the synapses can learn from the activities in which they participate. This dependence on history acts as amemory, which is possibly responsible for human memory. The cerebral cortex in humans is a large flat sheet of neuFigure 1. A sketch of a biological neuron. ions about 2 to 3 millimeters thick with a surface area of about 2,200 cm2, about twice the area of a standard computer keyboard. The cerebral cortex contains about 10" neurons, which is approximately the number of stars in the Milky Way." Neurons are massively connected, much more complex and dense than telephone networks. Each neuron is connected to 103 to lo4 other neurons. In total, the human brain contains approximately 1014 to loi5 interconnections. Neurons communicate through a very short train of pulses, typically milliseconds in duration. The message is modulated on the pulse-transmission frequency. This frequency can vary from a few to several hundred hertz, which is a million times slower than the fastest switching speed in electronic circuits. However, complex perceptual decisions such as face recognition are typically made by humans within a few hundred milliseconds. These decisions are made by a network of neurons whose operational speed is only a few milliseconds. This implies that the computations cannot take more than about 100 serial stages. In other March 1996
and has the desired asymptotic properties. The standard sigmoid function is the logis tic function,defined by gx)=1/(1+exp{-Bx}), where B is the slope parameter. Network architectures ANNs can be viewed as weighted directed graphs in which artificial neurons are Figure 2.McCulloch-Pitts model of a neuron nodes and directed edges (with weights) are connections between neuron outputs and neuron inputs. words,the brain runs parallel programs that are about 100 Based on the connection pattern(architecture),ANNs steps long for such perceptual tasks.This is known as the can be grouped into two categories (see Figure 4): hundred step rule.The same timing considerations show that the amount of information sent from one neuron to feed-forward networks,in which graphs have no another must be very small(a few bits).This implies that loops,and critical information is not transmitted directly,but captured recurrent (or feedback)networks,in which loops and distributed in the interconnections-hence the name, occur because of feedback connections. connectionist model,used to describe ANNs. Interested readers can find more introductory and eas- In the most common family of feed-forward networks. ily comprehensible material on biological neurons and called multilayer perceptron,neurons are organized into neural networks in Brunak and Lautrup.n layers that have unidirectional connections between them Figure 4 also shows typical networks for each category. ANN OVERVIEW Different connectivities yield different network behav- iors.Generally speaking,feed-forward networks are sta- Computational models of neurons tic,that is,they produce only one set of output values MeCulloch and Pitts+proposed a binary threshold unit rather than a sequence of values from a given input.Feed- as a computational model for an artificial neuron (see forward networks are memory-less in the sense that their Figure 2). response to an input is independent of the previous net- This mathematical neuron computes a weighted sum of work state.Recurrent,or feedback,networks,on the other itsn input signals,x,=1,2,...,n,and generates an out- hand,are dynamic systems.When a new input pattern is put of 1 if this sum is above a certain threshold u. presented,the neuron outputs are computed.Because of Otherwise,an output of 0 results.Mathematically, the feedback paths,the inputs to each neuron are then modified,which leads the network to enter a new state. Different network architectures require appropriate learning algorithms.The next section provides an overview of learning processes. where 0()is a unit step function at 0,and w,is the synapse Learning weight associated with the jth input.For simplicity of nota- The ability to learn is a fundamental trait of intelligence. tion,we often consider the threshold u as another weight Although a precise definition of learning is difficult to for- wo=-u attached to the neuron with a constant input xo mulate,a learning process in the ANN context can be =1.Positive weights correspond to excitatory synapses, viewed as the problem of updating network architecture while negative weights model inhibitory ones.McCulloch and connection weights so that a network can efficiently and Pitts proved that,in principle,suitably chosen weights perform a specific task.The network usually must learn let a synchronous arrangement of such neurons perform the connection weights from available training patterns. universal computations.There is a crude analogy here to Performance is improved over time by iteratively updat. a biological neuron:wires and interconnections model ing the weights in the network.ANNs'ability to auto- axons and dendrites,connection weights represent matically learn from examples makes them attractive and synapses,and the threshold function approximates the exciting.Instead of following a set of rules specified by activity in a soma.The McCulloch and Pitts model,how- human experts,ANNs appear to learn underlying rules ever,contains a number of simplifying assumptions that (like input-output relationships)from the given collec- do not reflect the true behavior of biological neurons. tion of representative examples.This is one of the major The McCulloch-Pitts neuron has been generalized in advantages of neural networks over traditional expert sys- many ways.An obvious one is to use activation functions tems. other than the threshold function,such as piecewise lin- To understand or design a learning process,you must ear,sigmoid,or Gaussian,as shown in Figure 3.The sig- first have a model of the environment in which a neural moid function is by far the most frequently used in ANNs. network operates,that is,you must know what informa- It is a strictly increasing function that exhibits smoothness tion is available to the network.We refer to this model as 34 Computer
Figure 2. McCulloch-Pitts model of a neuron. words, the brain runs parallel programs that are about 100 steps long for such perceptual tasks. This is known as the hundred step rule.12 The same timing considerations show that the amount of information sent from one neuron to another must be very small (a few bits). This implies that critical information is not transmitted directly, but captured and distributed in the interconnections-hence the name, connectionist model, used to describe A"s. Interested readers can find more introductory and easily comprehensible material on biological neurons and neural networks in Brunak and Lautrup.ll ANN OVERVIEW Computational models of neurons McCulloch and Pitts4 proposed a binary threshold unit as a computational model for an artificial neuron (see Figure 2). This mathematical neuron computes a weighted sum of its n input signals,x,, j = 1,2, . . . , n, and generates an output of 1 if this sum is above a certain threshold U. Otherwise, an output of 0 results. Mathematically, where O( ) is a unit step function at 0, and w, is the synapse weight associated with the jth input. For simplicity of notation, we often consider the threshold U as anotherweight wo = - U attached to the neuron with a constant input x, = 1. Positive weights correspond to excitatory synapses, while negative weights model inhibitory ones. McCulloch and Pitts proved that, in principle, suitably chosen weights let a synchronous arrangement of such neurons perform universal computations. There is a crude analogy here to a biological neuron: wires and interconnections model axons and dendrites, connection weights represent synapses, and the threshold function approximates the activity in a soma. The McCulloch and Pitts model, however, contains a number of simplifylng assumptions that do not reflect the true behavior of biological neurons. The McCulloch-Pitts neuron has been generalized in many ways. An obvious one is to use activation functions other than the threshold function, such as piecewise linear, sigmoid, or Gaussian, as shown in Figure 3. The sigmoid function is by far the most frequently used in A"s. It is a strictly increasing function that exhibits smoothness Computer and has the desired asymptotic properties. The standard sigmoid function is the logistic function, defined by where p is the slope parameter. Network architectures A"s can be viewed as weighted directed graphs in which artificial neurons are nodes and directed edges (with weights) are connections between neuron outputs and neuron inputs. Based on the connection pattern (architecture), A"s can be grouped into two categories (see Figure 4) : * feed-forward networks, in which graphs have no * recurrent (or feedback) networks, in which loops loops, and occur because of feedback connections. In the most common family of feed-forward networks, called multilayer perceptron, neurons are organized into layers that have unidirectional connections between them. Figure 4 also shows typical networks for each category. Different connectivities yield different network behaviors. Generally speaking, feed-forward networks are static, that is, they produce only one set of output values rather than a sequence of values from a given input. Feedforward networks are memory-less in the sense that their response to an input is independent of the previous networkstate. Recurrent, or feedback, networks, on the other hand, are dynamic systems. When a new input pattern is presented, the neuron outputs are computed. Because of the feedback paths, the inputs to each neuron are then modified, which leads the network to enter a new state. Different network architectures require appropriate learning algorithms. The next section provides an overview of learning processes. Learning The ability to learn is a fundamental trait of intelligence. Although aprecise definition of learning is difficult to formulate, a learning process in the ANN context can be viewed as the problem of updating network architecture and connection weights so that a network can efficiently perform a specific task. The network usually must learn the connection weights from available training patterns. Performance is improved over time by iteratively updating the weights in the network. ANNs' ability to automatically learnfrom examples makes them attractive and exciting. Instead of following a set of rules specified by human experts, ANNs appear to learn underlying rules (like input-output relationships) from the given collection of representative examples. This is one of the major advantages of neural networks over traditional expert systems. To understand or design a learning process, you must first have a model of the environment in which a neural network operates, that is, you must know what information is available to the network. We refer to this model as
(a (b) d Figure 3.Different types of activation functions:(a)threshold,(b)piecewise linear,(c)sigmoid,and(d) Gaussian. Neural networks Feed-forward networks Recurrent/feedback networks 1+2 Single-layer Multilayer Radial Basis Competitive Kohonen's Hopfield SOM ART models perceptron perceptron Function nets networks network Figure 4.A taxonomy of feed-forward and recurrent/feedback network architectures. a learning paradigm.Second,you must understand how stored,and what functions and decision boundaries a net- network weights are updated,that is,which learning rules work can form. govern the updating process.A learning algorithm refers -Sample complexity determines the number of training to a procedure in which learning rules are used for adjust- patterns needed to train the network to guarantee a valid ing the weights. generalization.Too few patterns may cause "over-fitting' There are three main learning paradigms:supervised, (wherein the network performs well on the training data unsupervised,and hybrid.In supervised learning,or set,but poorly on independent test patterns drawn from the learning with a"teacher,"the network is provided with a same distribution as the training patterns,as in Figure A3). correct answer (output)for every input pattern.Weights Computational complexity refers to the time required are determined to allow the network to produce answers for a learning algorithm to estimate a solution from train- as close as possible to the known correct answers ing patterns.Many existing learning algorithms have high Reinforcement learning is a variant of supervised learn- computational complexity.Designing efficient algorithms ing in which the network is provided with only a critique for neural network learning is a very active research topic. on the correctness of network outputs,not the correct There are four basic types of learning rules:error- answers themselves.In contrast,unsupervised learning,or correction,Boltzmann,Hebbian,and competitive learning learning without a teacher,does not require a correct answer associated with each input pattern in the training ERROR-CORRECTION RULES.In the supervised learn- data set.It explores the underlying structure in the data ing paradigm,the network is given a desired output for or correlations between patterns in the data,and orga each input pattern.During the learning process,the actual nizes patterns into categories from these correlations. outputy generated by the network may not equal the Hybrid learning combines supervised and unsupervised desired output d.The basic principle of error-correction learning.Part of the weights are usually determined learning rules is to use the error signal(d-y)to modify through supervised learning,while the others are the connection weights to gradually reduce this error. obtained through unsupervised learning. The perceptron learning rule is based on this error-cor- Learning theory must address three fundamental and rection principle.A perceptron consists of a single neuron practical issues associated with learning from samples: with adjustable weights,w,j=1,2,...,n,and threshold capacity,sample complexity,and computational com- u,as shown in Figure 2.Given an input vector x=(x,x2, plexity.Capacity concerns how many patterns can be ...,x),the net input to the neuron is March 1996 35
(a) Jr- (4 (C) 4 (4 Figure 3. Different types of activation functions: (a) threshold, (b) piecewise linear, (c) sigmoid, and (d) Gaussian. Figure 4. A taxonomy of feed-forward and recurrentlfeedback network architectures. a learning ~aradigm.~ Second, you must understand how network weights are updated, that is, which learning rules govern the updating process. A learning algorithm refers to a procedure in which learning rules are used for adjusting the weights. There are three main learning paradigms: supervised, unsupervised, and hybrid. In supervised learning, or learning with a “teacher,” the network is provided with a correct answer (output) for every input pattern. Weights are determined to allow the network to produce answers as close as possible to the known correct answers. Reinforcement learning is a variant of supervised learning in which the network is provided with only a critique on the correctness of network outputs, not the correct answers themselves. In contrast, unsupervised learning, or learning without a teacher, does not require a correct answer associated with each input pattern in the training data set. It explores the underlying structure in the data, or correlations between patterns in the data, and organizes patterns into categories from these correlations. Hybrid learning combines supervised and unsupervised learning. Part of the weights are usually determined through supervised learning, while the others are obtained through unsupervised learning. Learning theory must address three fundamental and practical issues associated with learning from samples: capacity, sample complexity, and computational complexity. Capacity concerns how many patterns can be stored, and what functions and decision boundaries a network can form. , Sample complexity determines the number of training patterns needed to train the network to guarantee a valid generalization. Too few patterns may cause “over-fitting” (wherein the network performs well on the training data set, but poorly on independent test patterns drawn from the same distribution as the training patterns, as in Figure A3). Computational complexity refers to the time required for a learning algorithm to estimate a solution from training patterns. Many existing learning algorithms have high computational complexity. Designing efficient algorithms for neural network learning is avery active research topic. There are four basic types of learning rules: errorcorrection, Boltzmann, Hebbian, and competitive learning. ERROR-CORRECTION RULES. In the supervised learning paradigm, the network is given a desired output for each input pattern. During the learning process, the actual output y generated by the network may not equal the desired output d. The basic principle of error-correction learning rules is to use the error signal (d -y) to modify the connection weights to gradually reduce this error. The perceptron learning rule is based on this error-correction principle. A perceptron consists of a single neuron with adjustable weights, w,, j = 1,2, . . . , n, and threshold U, as shown in Figure 2. Given an input vector x= (xl, x,, . . . , xJt, the net input to the neuron is March 1996
ceptron canonly separate linearly separable patterns as long e8 tron lears2ng5ri含》W as a monotonic activation function is used. 1.Initialize the weights and threshold to small The back-propagation learning algorithm (see the random numbers. "Back-propagation algorithm sidebar")is also based on 2.Present a pattern vector (xX2....x)and the error-correction principle. evaluate the output of the neuron. 3.Update the weights according to BOLTZMANN LEARNING.Boltzmann machines are sym- metric recurrent networks consisting of binary units (+1 wt+1)=W0+门(d-》 for“on”and-lfor“off).By symmetric,.we mean that the weight on the connection from unitito unit jis equal to the where d is the desired output,t is the iteration weight on the connection from unit j to unit i (w=w).A number,and n (0.00,and 0 oth- Boltzmann learning is a stochastic learning rule derived erwise.In a two-class classification problem,the percep- from information-theoretic and thermodynamic princi- tron assigns an input pattern to one class ify =1,and to ples.0 The objective of Boltzmann learning is to adiust the the other class ify=0.The linear equation connection weights so that the states of visible units satisfy a particular desired probability distribution.According to the Boltzmann learning rule,the change in the connec- w5-=0 tion weight w is given by △wH=(p4-P), defines the decision boundary (a hyperplane in the n-dimensional input space)that halves the space. where n is the learning rate,and p and p are the corre. Rosenblatt5 developed a learning procedure to deter- lations between the states of units i and i when the net- mine the weights and threshold in a perceptron,given a work operates in the clamped mode and free-running set of training patterns(see the"Perceptron learning algo- mode,respectively.The values of p,and p are usually esti- rithm”sidebar) mated from Monte Carlo experiments,which are Note that learning occurs only when the perceptron extremely slow. makes an error.Rosenblatt proved that when training pat- Boltzmann learning can be viewed as a special case of terns are drawn from two linearly separable classes,the error-correction learning in which error is measured not perceptron learning procedure converges after a finite as the direct difference between desired and actual out- number of iterations.This is the perceptron convergence puts,but as the difference between the correlations among theorem.In practice,you do not know whether the pat- the outputs of two neurons under clamped and free- terns are linearly separable.Many variations of this learn- running operating conditions. ing algorithm have been proposed in the literature.2 Other activation functions that lead to different learning char- HEBBIAN RULE.The oldest learning rule is Hebb's pos- acteristics can also be used.However,a single-layer per- tulate oflearning.13 Hebb based it on the following obser- vation from neurobiological experiments:If neurons on both sides of a synapse are activated synchronously and repeatedly,the synapse's strength is selectively increased. X24 W Mathematically,the Hebbian rule can be described as w,(t+1)=w(t)+ny(t)x() where x and y,are the output values of neurons i and j, 。” respectively,which are connected by the synapse wn,andn is the learning rate.Note thatx,is the input to the synapse. W An important property of this rule is that learning is done locally,that is,the change in synapse weight depends only on the activities of the two neurons connected by it. This significantly simplifies the complexity of the learning circuit in a VLSI implementation. A single neuron trained using the Hebbian rule exhibits Figure 5.Orientation selectivity of a single neuron an orientation selectivity.Figure 5 demonstrates this prop- trained using the Hebbian rule. erty.The points depicted are drawn from a two-dimen- 36 Computer
ctor (x,,x,, . . . , xJ* and of the neuron. red output, t is the iteration i=l The outputy of the perceptron is + 1 ifv > 0, and 0 otherwise. In a two-class classification problem, the perceptron assigns an input pattern to one class ify = 1, and to the other class ify=O. The linear equation j=1 defines the decision boundary (a hyperplane in the n-dimensional input space) that halves the space. Rosenblatt5 developed a learning procedure to determine the weights and threshold in a perceptron, given a set of training patterns (see the “Perceptron learning algorithm” sidebar). Note that learning occurs only when the perceptron makes an error. Rosenblatt proved that when trainingpatterns are drawn from two linearly separable classes, the perceptron learning procedure converges after a finite number of iterations. This is the perceptron convergence theorem. In practice, you do not know whether the patterns are linearly separable. Many variations of this learning algorithm have been proposed in the literature.2 Other activation functions that lead to different learning characteristics can also be used. However, a single-layerperFigure 5. Orientation selectivity of a single neuron trained using the Hebbian rule. ceptron can onlyseparate linearly separable patterns as long as a monotonic activationfunction is used. The back-propagation learning algorithm (see the “Back-propagation algorithm sidebar”) is also based on the error-correction principle. BOLTZMA” LFING. Boltzmann machines are symmetric recurrent networks consisting of binary units (+ 1 for “on” and -1 for “off’). By symmetric, we mean that the weight on the connection from unit i to unitj is equal to the weight on the connection from unit j to unit i (wy = wJ. A subset of the neurons, called visible, interact with the environment; the rest, called hidden, do not. Each neuron is a stochastic unit that generates an output (or state) according to the Boltzmann distribution of statistical mechanics. Boltzmann machines operate in two modes: clamped, in whichvisible neurons are clamped onto specific states determined by the environment; andfree-running, in which both visible and hidden neurons are allowed to operate freely. Boltzmann learning is a stochastic learning rule derived from information-theoretic and thermodynamic principles.lOThe objective of Boltzmann learning is to adjust the connection weights so that the states ofvisible units satisfy a particular desired probability distribution. According to the Boltzmann learning rule, the change in the connection weight wg is given by Awij =q(P,j -pij), where q is the learning rate, and p, and py are the correlations between the states of units z and J when the network operates in the clamped mode and free-running mode, respectively. The values of pli and p, are usuallyestimated from Monte Carlo experiments, which are extremely slow. Boltzmann learning can be viewed as a special case of error-correction learning in which error IS measured not as the direct difference between desired and actual outputs, but as the difference between the correlations among the outputs of two neurons under clamped and freerunning operating conditions. HEBBIAN RULE. The oldest learning rule is Hebb’spostulate of 1e~rning.l~ Hebb based it on the following observation from neurobiological experiments: If neurons on both sides of a synapse are activated synchronously and repeatedly, the synapse’s strength is selectively increased. Mathematically, the Hebbian rule can be described as WJt + 1) = W,W + ?lY]@) xm, where x, andy, are the output values of neurons i and J, respectively, which are connected by the synapse w,, and q is the learning rate. Note thatx, is the input to the synapse. An important property of this rule is that learning is done locally, that is, the change in synapse weight depends only on the activities of the two neurons connected by it. This significantly simplifies the complexity of the learning circuit in a VLSI implementation. A single neuron trained using the Hebbian rule exhibits an orientation selectivity. Figure 5 demonstrates this property. The points depicted are drawn from a two-dimenComputer
sional Gaussian distribution and used for training a neu ron.The weight vector of the neuron is initialized to wo as shown in the figure.As the learning proceeds,the weight vector moves progressively closer to the direction w of maximal variance in the data.In fact,wis the eigenvector of the covariance matrix of the data corresponding to the largest eigenvalue. COMPETITIVE LEARNING RULES.Unlike Hebbian learn- ing (in which multiple output units can be fired simulta- neously),competitive-learning output units compete (b) among themselves for activation.As a result,only one out- put unit is active at any given time.This phenomenon is Figure 6.An example of competitive learning:(a) known as winner-take-all.Competitive learning has been before learning;(b)after learning found to exist in biological neural networks.3 Competitive learning often clusters or categorizes the input data.Similar patterns are grouped by the network The most well-known example of competitive learning and represented by a single unit.This grouping is done is vector quantization for data compression.It has been automatically based on data correlations. widely used in speech and image processing for efficient The simplest competitive learning network consists of a storage,transmission,and modeling.Its goal is to repre- single layer of output units as shown in Figure 4.Each out- sent a set or distribution of input vectors with a relatively put unit i in the network connects to all the input units (x,'s) small number of prototype vectors (weight vectors),or a via weights,wj=1,2,...,n.Each output unit also con- codebook.Once a codebook has been constructed and nects to all other output units via inhibitory weights but has agreed upon by both the transmitter and the receiver,you a self-feedback with an excitatory weight.As a result of com- need only transmit or store the index of the corresponding petition,only the unit i with the largest (or the smallest) prototype to the input vector.Given an input vector,its cor- net input becomes the winner,that is,w'x≥w·x,∀i,or responding prototype can be found by searching for the w-xs w;-x,Vi.When all the weight vectors are nearest prototype in the codebook normalized,these two inequalities are equivalent. A simple competitive learning rule can be stated as SUMMARY.Table 2 summaries various learning algo- rithms and their associated network architectures(this n(x-wi),i=i", is not an exhaustive list).Both supervised and unsuper- (1) vised learning paradigms employ learning rules based 0. i≠i*】 Note that only the weights of the winner unit get updated The effect of this learning rule is to move the stored pat 3-3物之20器透80教% tern in the winner unit (weights)a little bit closer to the 1. Initialize the weights to small random values input pattern.Figure 6 demonstrates a geometric inter- 2. Randomly choose an input pattern x. pretation of competitive learning.In this example,we 3. Propagate the signal forward through the network assume that all input vectors have been normalized to have Compute 8 in the output layer (o=y) unit length.They are depicted as black dots in Figure 6. The weight vectors of the three units are randomly ini- =g(h,9d=y] tialized.Their initial and final positions on the sphere after competitive learning are marked as Xs in Figures 6a and 6b,respectively.In Figure 6,each of the three natural where hrepresents the net input to the ith unit in the /th groups (clusters)of patterns has been discovered by an layer,and g'is the derivative of the activation function g. output unit whose weight vector points to the center of Compute the deltas for the preceding layers by propa- gravity of the discovered group. gating the errors backwards: You can see from the competitive learning rule that the network will not stop learning (updating weights)unless 8=gh∑w6 the learning rate n is 0.A particular input pattern can fire different output units at different iterations during learn- for1=(L1),,1 ing.This brings up the stability issue of a learning system. 6.Update weights using The system is said to be stable if no pattern in the training data changes its category after a finite number of learning Aw =n6 v iterations.One way to achieve stability is to force the learn- ing rate to decrease gradually as the learning process pro- Go to step 2 and repeat for the next pattern until the ceeds towards 0.However,this artificial freezing of learning error in the output layer is below a prespecified thresh causes another problem termed plasticity,which is the abil. old or a maximum number of iterations is reached. ity to adapt to new data.This is known as Grossberg's sta- bility-plasticity dilemma in competitive learning. March 1996 37
sional Gaussian distribution and used for training a neuron. The weight vector of the neuron is initialized tow, as shown in the figure. As the learning proceeds, the weight vector moves progressively closer to the direction w of maximal variance in the data. In fact, w is the eigenvector of the covariance matrix of the data corresponding to the largest eigenvalue. COMPETITIVE WING RULES. Unlike Hebbian learning (in which multiple output units can be fired simultaneously), competitive-learning output units compete among themselves for activation. As a result, only one output unit is active at any given time. This phenomenon is known as winner-take-all. Competitive learning has been found to exist in biological neural network^.^ Competitive learning often clusters or categorizes the input data. Similar patterns are grouped by the network and represented by a single unit. This grouping is done automatically based on data correlations. The simplest competitive learning network consists of a single layer of output units as shown in Figure 4. Each output unit i in the network connects to all the input units (+) via weights, w,, j= 1,2, . . . , n. Each output unit also connects to all other output units via inhibitoryweights but has a self-feedbackwith an excitatoryweight. As a result of competition, only the unit i‘ with the largest (or the smallest) net input becomes the winner, that is, w,*. x 2 w,. x, Vi, or 11 w; - x /I < /I w, - x 11, Vi. When all the weight vectors are normalized, these two inequalities are equivalent. A simple competitive learning rule can be stated as 0, i+i”. Figure 6. An example of competitive learning: (a) before learning; (b) after learning. The most well-known example of competitive learning is vector quantization for data compression. It has been widely used in speech and image processing for efficient storage, transmission, and modeling. Its goal is to represent a set or distribution of input vectors with a relatively small number of prototype vectors (weight vectors), or a codebook. Once a codebook has been constructed and agreed upon by both the transmitter and the receiver, you need only transmit or store the index of the corresponding prototype to the input vector. Given an input vector, its corresponding prototype can be found by searching for the nearest prototype in the codebook. SUMMARY. Table 2 summaries various learning algorithms and their associated network architectures (this is not an exhaustive list). Both supervised and unsupervised learning paradigms employ learning rules based Note that only the weights of the winner unit get updated. The effect of this learning rule is to move the stored pattern in the winner unit (weights) a little bit closer to the input pattern. Figure 6 demonstrates a geometric interpretation of competitive learning. In this example, we assume that all input vectors have been normalized to have unit length. They are depicted as black dots in Figure 6. The weight vectors of the three units are randomly initialized. Their initial and final positions on the sphere after competitive learning are marked as Xs in Figures 6a and 6b, respectively. In Figure 6, each of the three natural groups (clusters) of patterns has been discovered by an output unit whose weight vector points to the center of gravity of the discovered group. You can see from the competitive learning rule that the network will not stop learning (updating weights) unless the learning rate q is 0. A particular input pattern can fire different output units at different iterations during learning. This brings up the stability issue of a learning system. The system is said to be stable if no pattern in the training data changes its category after a finite number of learning iterations. One way to achieve stabilityis to force the learning rate to decrease gradually as the learning process proceeds towards 0. However, this artificial freezing of learning causes another problem termed plasticity, which is the ability to adapt to new data. This is known as Grossberg’s stability-plasticity dilemma in competitive learning. March 1996
w2 w Input layer Hidden layers Output layer Figure 7.A typical three-layer feed-forward network architecture. on error-correction,Hebbian,and competitive learning. tures.However,each learning algorithm is designed for Learning rules based on error-correction can be used for training a specific architecture.Therefore,when we dis training feed-forward networks,while Hebbian learning cuss a learning algorithm,a particular network archi- rules have been used for all types of network architec- tecture association is implied.Each algorithm can Table 2.Well-known learning algorithms. Paradigm Learning rule Architecture Learning algorithm Task Supervised Error-correction Single-or Perceptron Pattern classification multilayer learning algorithms Function approximation perceptron Back-propagation Prediction,control Adaline and Madaline Boltzmann Recurrent Boltzmann learning Pattern classification aigorithm Hebbian Multilayer feed- Linear discriminant Data analysis forward analysis Pattern classification Competitive Competitive Learning vector Within-class quantization categorization Data compression ART network ARTMap Pattern classification Within-class categorization Unsupervised Error-correction Multilayer feed- Sammon's projection Data analysis forward Hebbian Feed-forward or Principal component Data analysis competitive analysis Data compression Hopfield Network Associative memory Associative memory learning Competitive Competitive Vector quantization Categorization Data compression Kohonen's SOM Kohonen's SOM Categorization Data analysis ART networks ART1.ART2 Categorization Hybrid Error-correction RBF network RBF learning Pattern classification and competitive algorithm Function approximation Prediction,control 38 Computer
Figure 7. A typical three-layer feed-forward network architecture. on error-correction, Hebbian, and competitive learning. Learning rules based on error-correction can be used for training feed-forward networks, while Hebbian learning rules have been used for all types of network architectures. However, each learning algorithm is designed for training a specific architecture. Therefore, when we discuss a learning algorithm, a particular network architecture association is implied. Each algorithm can Computer
Structure Description of Exclusive-OR Classes with General decision regions problem meshed regions region shapes Half plane bounded by hyperplane Single layer Arbitrary (complexity limited by number of hidden Two layer units) Arbitrary (complexity limited by number of hidden Three layer units) Figure 8.A geometric interpretation of the role of hidden unit in a two-dimensional input space. perform only a few tasks well.The last column of Table di [0,1]m,an m-dimensional hypercube.For classifi- 2 lists the tasks that each algorithm can perform.Due to cation purposes,m is the number of classes.The squared- space limitations,we do not discuss some other algo- error cost function most frequently used in the ANN rithms,including Adaline,Madaline,linear discrimi- literature is defined as nant analysis,15 Sammon's projection,1s and principal component analysis.2Interested readers can consult the (2) corresponding references (this article does not always )-d2 cite the first paper proposing the particular algorithms). MULTILAYER FEED-FORWARD The back-propagation algorithmis a gradient-descent NETWORKS method to minimize the squared-error cost function in Figure 7 shows a typical three-layer perceptron.In gen- Equation 2 (see "Back-propagation algorithm"sidebar). eral,a standard L-layer feed-forward network (we adopt A geometric interpretation (adopted and modified from the convention that the input nodes are not counted as a Lippmann)shown in Figure 8 can help explicate the role layer)consists of an input stage,(L-1)hidden lavers,and of hidden units (with the threshold activation function). an output layer of units successively connected (fully or Each unit in the first hidden layer forms a hyperplane locally)in a feed-forward fashion with no connections in the pattern space;boundaries between pattern classes between units in the same layer and no feedback connec- can be approximated by hyperplanes.A unit in the sec- tions between layers. ond hidden layer forms a hyperregion from the outputs of the first-layer units;a decision region is obtained by Multilayer perceptron performing an AND operation on the hyperplanes.The The most popular class of multilayer feed-forward net- output-layer units combine the decision regions made by works is multilayer perceptrons in which each computa- the units in the second hidden layer by performing logi- tional unit employs either the thresholding function or the cal OR operations.Remember that this scenario is sigmoid function.Multilayer perceptrons can form arbi- depicted only to explain the role of hidden units.Their trarily complex decision boundaries and represent any actual behavior,after the network is trained,could differ. Boolean function.5 The development of the back-propa- A two-layer network can form more complex decision gation learning algorithm for determining weights in a boundaries than those shown in Figure 8.Moreover,mul- multilayer perceptron has made these networks the most tilayer perceptrons with sigmoid activation functions can popular among researchers and users of neural networks form smooth decision boundaries rather than piecewise We denote w as the weight on the connection between linear boundaries. the ith unit in layer(1-1)to ith unit in layer l. Let(,d),(x2,d),...(x,d))be a set ofp Radial Basis Function network training patterns (input-output pairs),where xte R is The Radial Basis Function(RBF)network,3 which has the input vector in the n-dimensional pattern space,and two layers,isa specialclassof multilayer feed-forward net- March 1996 39
Description of I decision regions Structure Two laver I Arbitrary (complexity limited by number of hidden units) Half plane bounded by hyperplane Arbitrary (complexity I i mited by number of hidden Three laver I units) Exclusive-OR I Classes with I General II 4 I Figure 8. A geometric interpretation of the role of hidden unit in a two-dimensional input space. perform only a few tasks well. The last column of Table 2 lists the tasks that each algorithm can perform. Due to space limitations, we do not discuss some other algorithms, including Adaline, Madaline,14 linear discriminant analysis,15 Sammon's projecti~n,~~ and principal component analysis.2 Interested readers can consult the corresponding references (this article does not always cite the first paper proposing the particular algorithms). MULTllAYER FEED-FORWARD NETWORKS Figure 7 shows a typical three-layer perceptron. In general, a standard L-layer feed-forward network (we adopt the convention that the input nodes are not counted as a layer) consists of an input stage, (L-1) hidden layers, and an output layer of units successively connected (fully or locally) in a feed-forward fashion with no connections between units in the same layer and no feedback connections between layers. Multilayer perceptron The most popular class of multilayer feed-forward networks is multilayer perceptrons in which each computational unit employs either the thresholding function or the sigmoid function. Multilayer perceptrons can form arbitrarily complex decision boundaries and represent any Boolean function.6 The development of the back-propagation learning algorithm for determining weights in a multilayer perceptron has made these networks the most popular among researchers and users of neural networks. We denote w,I(l) as the weight on the connection between the ith unit in layer (2-1) to jth unit in layer 1. Let {(x(l), dc1)), (x(2), d@)), . . . , (xb), d(p))} be a set ofp training patterns (input-output pairs), where x(l) E R" is the input vector in the n-dimensional pattern space, and d(l) E [0, l]", an m-dimensional hypercube. For classification purposes, m is the number of classes. The squarederror cost function most frequently used in the ANN literature is defined as The back-propagation algorithm9 is a gradient-descent method to minimize the squared-error cost function in Equation 2 (see "Back-propagation algorithm" sidebar). A geometric interpretation (adopted and modified from Lippmann") shown in Figure 8 can help explicate the role of hidden units (with the threshold activation function). Each unit in the first hidden layer forms a hyperplane in the pattern space; boundaries between pattern classes can be approximated by hyperplanes. A unit in the second hidden layer forms a hyperregion from the outputs of the first-layer units; a decision region is obtained by performing an AND operation on the hyperplanes. The output-layer units combine the decision regions made by the units in the second hidden layer by performing logical OR operations. Remember that this scenario is depicted only to explain the role of hidden units. Their actual behavior, after the network is trained, could differ. A two-layer network can form more complex decision boundaries than those shown in Figure 8. Moreover, multilayer perceptrons with sigmoid activation functions can form smooth decision boundaries rather than piecewise linear boundaries. Radial Basis Function network The Radial Basis Function (RBF) network,3 which has two layers, is a special class of multilayer feed-forward netMarch 1996
works.Each unit in the hidden layer employs aradial basis Issues function,such as a Gaussian kernel,as the activation func- There are many issues in designing feed-forward net- tion.The radial basis function (or kernel function)is cen- works,including tered at the point specified by the weight vector associated with the unit.Both the positions and the widths of these how many layers are needed for a given task. kernels must be learned from training patterns.There are how many units are needed per layer, usually many fewer kernels in the RBF network than there how will the network perform on data not included in are training patterns.Each output unit implements a lin the training set(generalization ability),and ear combination of these radial basis functions.From the how large the training set should be for"good"gen- point of view of function approximation,the hidden units eralization provide a set of functions that constitute a basis set for rep- resenting input patterns in the space spanned by the hid Although multilayer feed-forward networks using back- den units. propagation have been widely employed for classification There are a variety of learning algorithms for the RBF and function approximation,2 many design parameters network.3 The basic one employs a two-step learning strat still must be determined by trial and error.Existing theo- egy,or hybrid learning.It estimates kernel positions and retical results provide only very loose guidelines for select- kernel widths using an unsupervised clustering algorithm ing these parameters in practice. followed by a supervised least mean square (LMS)algo rithm to determine the connection weights between the KOHONEN'S SELF-ORGANIZING MAPS hidden layer and the output layer.Because the output units The self-organizing map (SOM)has the desirable prop- are linear.a noniterative algorithm can be used.After this erty of topology preservation,which captures an impor- initial solution is obtained,a supervised gradient-based tant aspect of the feature maps in the cortex of highly algorithm can be used to refine the network parameters. developed animal brains.In a topology-preserving map- This hybrid learning algorithm for training the RBF net- ping,nearby input patterns should activate nearby output work converges much faster than the back-propagation units on the map.Figure 4 shows the basic network archi algorithm for training multilayer perceptrons.However, tecture of Kohonen's SOM.It basically consists of a two- for many problems,the RBF network often involves a dimensional array of units,each connected to all n input larger number of hidden units.This implies that the run- nodes.Let w.denote the n-dimensional vector associated time(after training)speed of the RBF network is often with the unit at location (i,of the 2D array.Each neuron slower than the runtime speed ofa multilayer perceptron. computes the Euclidean distance between the input vec- The efficiencies (error versus network size)of the RBF net- tor x and the stored weight vector w.. work and the multilayer perceptron are,however,prob- This SOM is a special type of competitive learning net- lem-dependent.It has been shown that the RBF network work that defines a spatial neighborhood for each output has the same asymptotic approximation power as a mul- unit.The shape of the local neighborhood can be square, tilayer perceptron. rectangular,or circular.Initial neighborhood size is often set to one half to two thirds of the network size and shrinks over time according to a schedule (for example,an expo- 5象3%en新ng algo零itn nentially decreasing function).During competitive learn- ing,all the weight vectors associated with the winner and its neighboring units are updated(see the"SOMlearning 1.Initialize weights to small random numbers;set initial learn algorithm”sidebar) ing rate and neighborhood. Kohonen's SOM can be used for projection of multi- 2.Present a pattern x,and evaluate the network outputs. variate data,density approximation,and clustering.It has 3.Select the unit(cc)with the minimum output: been successfully applied in the areas of speech recogni- tion,image processing,robotics,and process control.?The x -min x w design parameters include the dimensionality of the neu- ron array,the number of neurons in each dimension,the 4.Update all weights according to the following learning shape of the neighborhood,the shrinking schedule of the rule: neighborhood,and the learning rate. w:(t)(t)[x(t)-wg(t)if (i)e Nc(t). ADAPTIVE RESONANCE w.(tD- THEORY MODELS w t). otherwise, Recall that the stability-plasticity dilemma is an impor- tant issue in competitive learning.How do we learn new where N(t)is the neighborhood of the unit(cc)at things (plasticity)and yet retain the stability to ensure that time t,and a(t)is the learning rate. existing knowledge is not erased or corrupted?Carpenter 5.Decrease the value of a(t)and shrink the neighborhood and Grossberg's Adaptive Resonance Theory models N(t) (ART1,ART2,and ARTMap)were developed in an attempt 6.Repeat steps 2 through 5 until the change in weight val- to overcome this dilemma.The network has a sufficient ues is less than a prespecified threshold or a maximum supply of output units,but they are not used until deemed number of iterations is reached. necessary.A unit is said to be committed (uncommitted)if it is (is not)being used.The learning algorithm updates 40 Computer
works. Each unit in the hidden layer employs a radial basis function, such as a Gaussian kernel, as the activation function. The radial basis function (or kernel function) is centered at the point specified by the weight vector associated with the unit. Both the positions and the widths of these kernels must be learned from training patterns. There are usually many fewer kernels in the RBF network than there are training patterns. Each output unit implements a linear combination of these radial basis functions. From the point of view of function approximation, the hidden units provide a set of functions that constitute a basis set for representing input patterns in the space spanned by the hidden units. There are a variety of learning algorithms for the RBF network.3 The basic one employs a two-step learning strategy, or hybrid learning. It estimates kernel positions and kernel widths using an unsupervised clustering algorithm, followed by a supervised least mean square (LMS) algorithm to determine the connection weights between the hidden layer and the output layer. Because the output units are linear, a noniterative algorithm can be used. After this initial solution is obtained, a supervised gradient-based algorithm can be used to refine the network parameters. This hybrid learning algorithm for training the RBF network converges much faster than the back-propagation algorithm for training multilayer perceptrons. However, for many problems, the RBF network often involves a larger number of hidden units. This implies that the runtime (after training) speed of the RBF network is often slower than the runtime speed of a multilayer perceptron. The efficiencies (error versus network size) of the RBF network and the multilayer perceptron are, however, problem-dependent. It has been shown that the RBF network has the same asymptotic approximation power as a multilayer perceptron. Issues works, including There are many issues in designing feed-forward nethow many layers are needed for a given task, how many units are needed per layer, how will the network perform on data not included in how large the training set should be for “good” genthe training set (generalization ability), and eralization. Although multilayer feed-forward networks using backpropagation have been widely employed for classification and function approximation,2 many design parameters still must be determined by trial and error. Existing theoretical results provide onlyvery loose guidelines for selecting these parameters in practice. KOHONEN’S SELF-ORGANIZING MAPS The self-organizing map (SOM)I6 has the desirable property of topology preservation, which captures an important aspect of the feature maps in the cortex of highly developed animal brains. In a topology-preserving mapping, nearby input patterns should activate nearby output units on the map. Figure 4 shows the basic network architecture of Kohonen’s SOM. It basically consists of a twodimensional array of units, each connected to all n input nodes. Let wg denote the n-dimensional vector associated with the unit at location (i, j) of the 2D array. Each neuron computes the Euclidean distance between the input vector x and the stored weight vector wy. This SOM is a special type of competitive learning network that defines a spatial neighborhood for each output unit. The shape of the local neighborhood can be square, rectangular, or circular. Initial neighborhood size is often set to one half to two thirds of the network size and shrinks over time according to a schedule (for example, an exponentially decreasing function). During competitive learning, all the weight vectors associated with the winner and its neighboring units are updated (see the “SOM learning algorithm” sidebar). Kohonen’s SOM can be used for projection of multivariate data, density approximation, and clustering. It has been successfully applied in the areas of speech recognition, image processing, robotics, and process control.2 The design parameters include the dimensionality of the neuron array, the number of neurons in each dimension, the shape of the neighborhood, the shrinking schedule of the neighborhood, and the learning rate. ADAPTlM RESONANCE THEORY MODELS Recall that the stability-plasticzty dilemma is an important issue in competitive learning. How do we learn new things (plasticity) and yet retain the stability to ensure that existingknowledge is not erased or corrupted? Carpenter and Grossberg’s Adaptive Resonance Theory models (ART1, ART2, and ARTMap) were developed in an attempt to overcome this dilemma.” The network has a sufficient supply of output units, but they are not used until deemed necessary. Aunit is said to be committed (uncommitted) if it is (is not) being used. The learning algorithm updates Computer