Principe,J C."Artificial Neural Networks The Electrical Engineering Handbook Ed. Richard C. Dorf Boca raton crc Press llc. 2000
Principe, J.C. “Artificial Neural Networks” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
20 Artificial Neural Networks 0.1 Definitions and s Introduction·Defi and Style of Computation. ANN Types 0. 2 Multilayer Perceptrons Function of Each PE. How to Train MLPs. Applying Back Propagation in Practice. A Posteriori Probabilities 20.3 Radial Basis function Networks 20.4 Time Lagged Networks Memory Structures. Training-Focused TLN Architectures 20.5 Hebbian Learning and Principal Component Analysis Hebbian Learning. Principal Component Analysis.Associative Jose C. Principe Memories 0.6 Competitive Learning and Kohonen Networks 20.1 Definitions and Scope Introduction Artificial neural networks(ANN)are among the newest signal-processing technologies in the engineer's toolbox. he field is highly interdisciplinary, but our approach will restrict the view to the engineering perspective. In engineering, neural networks serve two important functions: as pattern classifiers and as nonlinear adaptive filters. We will provide a brief overview of the theory, learning rules, and applications of the most important neural network models Definitions and Style of Computation An ANN is an adaptive, most often nonlinear system that learns to perform a function(an input/output map from data. Adaptive means that the system parameters are changed during operation, normally called the training phase. After the training phase the anN parameters are fixed and the system is deployed to solve the problem at hand(the testing phase). The ANN is built with a systematic step-by-step procedure to optimize a performance criterion or to follow some implicit internal constraint, which is commonly referred to as the learning rule. The input/output training data are fundamental in neural network technology, because they convey the necessary information to"discover"the optimal operating point. The nonlinear nature of the neural network processing elements(PEs)provides the system with lots of flexibility to achieve practically any desired input/output map, i. e, some ANNs are universal mappers. There is a style in neural computation that is worth describing(Fig. 20. 1). An input is presented to the network and a corresponding desired or target response set at the output(when this is the case the training is called supervised). An error is composed from the difference between the desired response and the system c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 20 Artificial Neural Networks 20.1 Definitions and Scope Introduction • Definitions and Style of Computation • ANN Types and Applications 20.2 Multilayer Perceptrons Function of Each PE • How to Train MLPs • Applying BackPropagation in Practice • A Posteriori Probabilities 20.3 Radial Basis Function Networks 20.4 Time Lagged Networks Memory Structures • Training-Focused TLN Architectures 20.5 Hebbian Learning and Principal Component Analysis Networks Hebbian Learning • Principal Component Analysis • Associative Memories 20.6 Competitive Learning and Kohonen Networks 20.1 Definitions and Scope Introduction Artificial neural networks (ANN) are among the newest signal-processing technologies in the engineer’s toolbox. The field is highly interdisciplinary, but our approach will restrict the view to the engineering perspective. In engineering, neural networks serve two important functions: as pattern classifiers and as nonlinear adaptive filters. We will provide a brief overview of the theory, learning rules, and applications of the most important neural network models. Definitions and Style of Computation An ANN is an adaptive, most often nonlinear system that learns to perform a function (an input/output map) from data. Adaptive means that the system parameters are changed during operation, normally called the training phase. After the training phase the ANN parameters are fixed and the system is deployed to solve the problem at hand (the testing phase). The ANN is built with a systematic step-by-step procedure to optimize a performance criterion or to follow some implicit internal constraint, which is commonly referred to as the learning rule. The input/output training data are fundamental in neural network technology, because they convey the necessary information to “discover” the optimal operating point. The nonlinear nature of the neural network processing elements (PEs) provides the system with lots of flexibility to achieve practically any desired input/output map, i.e., some ANNs are universal mappers. There is a style in neural computation that is worth describing (Fig. 20.1). An input is presented to the network and a corresponding desired or target response set at the output (when this is the case the training is called supervised). An error is composed from the difference between the desired response and the system Jose C. Principe University of Florida
Y2 Y1 DI D2 Q公A图汁豆和器器 yp yp dp dp FIGURE 20.1 The style of neural computation. output. This error information is fed back to the system and adjusts the system parameters in a systemati fashion( the learning rule). The process is repeated until the performance is acceptable. It is clear from this description that the performance hinges heavily on the data. If one does not have data that cover a significant portion of the operating conditions or if they are noisy, then neural network technology is probably not the right solution. On the other hand, if there is plenty of data and the problem is poorly understood to derive an approximate model, then neural network technology is a good choice. This operating procedure should be contrasted with the traditional engineering design, made of exhaustive subsystem specifications and intercommunication protocols In ANNs, the designer chooses the network topol ogy, the performance function, the learning rule, and the criterion to stop the training phase, but the system automatically adjusts the parameters. So, it is difficult to bring a priori information into the design, and when the system does not work properly it is also hard to incrementally refine the solution. But ANN-based solutions are extremely efficient in terms of development time and resources, and in many difficult problems ANNs provide performance that is difficult to match with other technologies. Denker 10 years ago said that "ANNs are the second best way to implement a solution"motivated by the simplicity of their design and because of their universality, only shadowed by the traditional design obtained by studying the physics of the problem.At resent, ANNs are emerging as the technology of choice for many applications, such as pattern recognition, prediction, system identification, and control ANN TyPes and Applications It is always risky to establish a taxonomy of a technology, but our motivation is one of providing a quick overview of the application areas and the most popular topologies and learning paradigms Association Hopfield [Zurada, 1992; Haykin, 1994 Hebbian [Zurada, 1992; Haykin, 1994; Kung, 1993] Multilayer perceptron [Zurada, 1992; Haykin, 1994; Back-propagation [Zurada, 1992; op,1995 Haykin, 1994; Bishop, 1995] Linear associative mem. [Zurada, 1992; Haykin, 1994 Hebbian Pattern Multilayer perceptron [Zurada, 1992; Haykin, 1994; Back-propagation Radial basis functions [Zurada, 1992: Bishop, 1995 Least mean squa [Bishop, 1995] Competitive [Zurada, 1992; Haykin, 1994 extraction Kohonen [Zurada, 1992; Haykin, 1994 Multilayer perceptron [Kung, 19931 Back-propagatio Principal comp. anal. [Zurada, 1992; Kung, 19931 Oja's [Zurada, 1992; Kung, 1993] Prediction, Time-lagged networks [Zurada, 1992; Kung, 1993; Back-propagation through time system ID de vries and Principe, 1992] [Zurada, 19921 Fully recurrent nets [Zurada, 1992] C 2000 by CRC Press LLC
© 2000 by CRC Press LLC output. This error information is fed back to the system and adjusts the system parameters in a systematic fashion (the learning rule). The process is repeated until the performance is acceptable. It is clear from this description that the performance hinges heavily on the data. If one does not have data that cover a significant portion of the operating conditions or if they are noisy, then neural network technology is probably not the right solution. On the other hand, if there is plenty of data and the problem is poorly understood to derive an approximate model, then neural network technology is a good choice. This operating procedure should be contrasted with the traditional engineering design, made of exhaustive subsystem specifications and intercommunication protocols. In ANNs, the designer chooses the network topology, the performance function, the learning rule, and the criterion to stop the training phase, but the system automatically adjusts the parameters. So, it is difficult to bring a priori information into the design, and when the system does not work properly it is also hard to incrementally refine the solution. But ANN-based solutions are extremely efficient in terms of development time and resources, and in many difficult problems ANNs provide performance that is difficult to match with other technologies. Denker 10 years ago said that “ANNs are the second best way to implement a solution” motivated by the simplicity of their design and because of their universality, only shadowed by the traditional design obtained by studying the physics of the problem. At present, ANNs are emerging as the technology of choice for many applications, such as pattern recognition, prediction, system identification, and control. ANN Types and Applications It is always risky to establish a taxonomy of a technology, but our motivation is one of providing a quick overview of the application areas and the most popular topologies and learning paradigms. FIGURE 20.1 The style of neural computation. Supervised Unsupervised Application Topology Learning Learning Association Hopfield [Zurada, 1992; Haykin, 1994] — Hebbian [Zurada, 1992; Haykin, 1994; Kung, 1993] Multilayer perceptron [Zurada, 1992; Haykin, 1994; Bishop, 1995] Back-propagation [Zurada, 1992; Haykin, 1994; Bishop, 1995] — Linear associative mem. [Zurada, 1992; Haykin, 1994] — Hebbian Pattern recognition Multilayer perceptron [Zurada, 1992; Haykin, 1994; Bishop, 1995] Back-propagation — Radial basis functions [Zurada, 1992; Bishop, 1995] Least mean square k-means [Bishop, 1995] Feature extraction Competitive [Zurada, 1992; Haykin, 1994] — Competitive Kohonen [Zurada, 1992; Haykin, 1994] — Kohonen Multilayer perceptron [Kung, 1993] Back-propagation — Principal comp. anal. [Zurada, 1992; Kung, 1993] — Oja’s [Zurada, 1992; Kung, 1993] Prediction, system ID Time-lagged networks [Zurada, 1992; Kung, 1993; de Vries and Principe, 1992] Back-propagation through time [Zurada, 1992] — Fully recurrent nets [Zurada, 1992]
input layer dden layer output layer 0吃 m FIGURE 20.2 MLP with one hidden layer(d-k-m) PE net f(net) tanh(anet/=I+exp( I net>0 f=O net= o ∑x+b I (bias) Tanh Logistic Threshold figURe 20. 3 A PE and the most common nonlinearities It is clear that multilayer perceptrons(MLPs), the back-Propagation algorithm and its extensions-time-lagged networks(TLN)and back-propagation through time(BPTT), respectively -hold a prominent position in ANN technology. It is therefore only natural to spend most of our overview presenting the theory and tools of back-propagation learning. It is also important to notice that Hebbian learning (and its extension, the Oja rule) is also a very useful (and biologically plausible) learning mechanism. It is an unsupervised learning method since there is no need to specify the desired or target response to the ann 20.2 Multilayer Perceptrons Multilayer perceptrons are a layered arrangement of nonlinear PEs as shown in Fig. 20. 2. The layer that receives ne input is called the input layer, and the layer that produces the output is the output layer. The layers that do not have direct access to the external world are called hidden layers. a layered network with just the input and output layers is called the perceptron. Each connection between PEs is weighted by a scalar, w, called a weight, which is adapted during learn The PEs in the MLP are composed of an adder followed by a smooth saturating nonlinearity of the sigmoid pe(Fig. 20.3). The most common saturating nonlinearities are the logistic function and the hyperbolic tangent. The threshold is used in other nets. The importance of the mLP is that it is a universal mapper (implements arbitrary input/output maps)when the topology has at least two hidden layers and sufficient number of PEs [Haykin, 1994]. Even MLPs with a single hidden layer are able to approximate continuous input/output maps. This means that rarely we will need to choose topologies with more than two hidden layers. But these are existence proofs, so the issue that we must solve as engineers is to choose how many layers and how many PEs in each layer are required to produce good results Many problems in engineering can be thought of in terms of a transformation of an input space, containing the input, to an output space where the desired response exists. For instance, dividing data into classes can be thought of as transforming the input into 0 and 1 responses that will code the classes [Bishop, 1995]. Likewise entification of an unknown system can also be framed as a mapping(function approximation) from the input to the system output [Kung, 1993]. The MLP is highly recommended for these applications c 2000 by CRC Press LLC
© 2000 by CRC Press LLC It is clear that multilayer perceptrons (MLPs), the back-propagation algorithm and its extensions — time-lagged networks (TLN) and back-propagation through time (BPTT), respectively — hold a prominent position in ANN technology. It is therefore only natural to spend most of our overview presenting the theory and tools of back-propagation learning. It is also important to notice that Hebbian learning (and its extension, the Oja rule) is also a very useful (and biologically plausible) learning mechanism. It is an unsupervised learning method since there is no need to specify the desired or target response to the ANN. 20.2 Multilayer Perceptrons Multilayer perceptrons are a layered arrangement of nonlinear PEs as shown in Fig. 20.2. The layer that receives the input is called the input layer, and the layer that produces the output is the output layer. The layers that do not have direct access to the external world are called hidden layers. A layered network with just the input and output layers is called the perceptron. Each connection between PEs is weighted by a scalar, wi , called a weight, which is adapted during learning. The PEs in the MLP are composed of an adder followed by a smooth saturating nonlinearity of the sigmoid type (Fig. 20.3). The most common saturating nonlinearities are the logistic function and the hyperbolic tangent. The threshold is used in other nets. The importance of the MLP is that it is a universal mapper (implements arbitrary input/output maps) when the topology has at least two hidden layers and sufficient number of PEs [Haykin, 1994]. Even MLPs with a single hidden layer are able to approximate continuous input/output maps. This means that rarely we will need to choose topologies with more than two hidden layers. But these are existence proofs, so the issue that we must solve as engineers is to choose how many layers and how many PEs in each layer are required to produce good results. Many problems in engineering can be thought of in terms of a transformation of an input space, containing the input, to an output space where the desired response exists. For instance, dividing data into classes can be thought of as transforming the input into 0 and 1 responses that will code the classes [Bishop, 1995]. Likewise, identification of an unknown system can also be framed as a mapping (function approximation) from the input to the system output [Kung, 1993]. The MLP is highly recommended for these applications. FIGURE 20.2 MLP with one hidden layer (d-k-m). FIGURE 20.3 A PE and the most common nonlinearities
(x1,x2)=w:x1+w2x2+b=0 XI FIGURE 20.4 A two-input PE and its separation surface. Function of each pe Let us study briefly the function of a single PE with two inputs [Zurada, 1992]. If the nonlinearity is the threshold nonlinearity we can immediately see that the output is simply I and-1. The surface that divides these subspaces is called a separation surface, and in this case it is a line of equation yW, w)=w,*,+w,,+b (20.1) i.e., the PE weights and the bias control the orientation and position of the separation line, respectively (Fig. 20.4). In many dimensions the separation surface becomes an hyperplane of dimension one less than the dimensionality of the input space. So, each PE creates a dichotomy in the input space. For smooth nonlinearities ne separation surface is not crisp; it becomes fuzzy but the same principles apply. In this case, the size of the eights controls the width of the fuzzy boundary (larger weights shrink the fuzzy boundary) The perceptron input/output map is built from a juxtaposition of linear separation surfaces, so the perceptron gives zero classification error only for linearly separable classes(i.e,, classes that can be exactly classified by hyperplanes). When one adds one layer to the perceptron creating a one hidden layer MLP, the type of separation surfaces hanges drastically. It can be shown that this learning machine is able to create"bumps"in the input space, i.e., an area of high response surrounded by low responses [Zurada, 1992]. The function of each PE is always the same, no matter if the PE is part of a perceptron or an MLP. However, notice that the output layer in the MLP works with the result of hidden layer activations, creating an embedding of functions and producing more complex separation surfaces. The one-hidden-layer MLP is able to produce nonlinear separation surfaces If one adds an extra layer(i. e, two hidden layers), the learning machine now can combine at will bumps, which can be interpreted as a universal mapper, since there is evidence that any function can be approximated by localized bumps. One important aspect to remember is that changing a single weight in the MLP can drastically change the location of the separation surfaces; i. e, the MLP achieves the input/output map through he interplay of all its weights How to Train mlps One fundamental issue is how to adapt the weights w, of the MLP to achieve a given input/output map. The core ideas have been around for many years in optimization, and they are extensions of well-known engineering principles, such as the least mean square(LMs)algorithm of adaptive filtering [Haykin, 1994. Let us review the theory here. Assume that we have a linear PE ((net)=net)and that one wants to adapt the weights as to minimize the square difference between the desired signal and the PE response(Fig. 20.5) This problem has an analytical solution known as the least squares[Haykin, 1994]. The optimal weights are obtained as the product of the inverse of the input autocorrelation function(R) and the cross-correlation vector()between the input and the desired response. The analytical solution is equivalent to a search for the minimum of the quadratic performance surface J(w )using gradient descent, where the weights at each iteration k are adjusted by c 2000 by CRC Press LLC
© 2000 by CRC Press LLC Function of Each PE Let us study briefly the function of a single PE with two inputs [Zurada, 1992]. If the nonlinearity is the threshold nonlinearity we can immediately see that the output is simply 1 and –1. The surface that divides these subspaces is called a separation surface, and in this case it is a line of equation (20.1) i.e., the PE weights and the bias control the orientation and position of the separation line, respectively (Fig. 20.4). In many dimensions the separation surface becomes an hyperplane of dimension one less than the dimensionality of the input space. So, each PE creates a dichotomy in the input space. For smooth nonlinearities the separation surface is not crisp; it becomes fuzzy but the same principles apply. In this case, the size of the weights controls the width of the fuzzy boundary (larger weights shrink the fuzzy boundary). The perceptron input/output map is built from a juxtaposition of linear separation surfaces, so the perceptron gives zero classification error only for linearly separable classes (i.e., classes that can be exactly classified by hyperplanes). When one adds one layer to the perceptron creating a one hidden layer MLP, the type of separation surfaces changes drastically. It can be shown that this learning machine is able to create “bumps” in the input space, i.e., an area of high response surrounded by low responses [Zurada, 1992]. The function of each PE is always the same, no matter if the PE is part of a perceptron or an MLP. However, notice that the output layer in the MLP works with the result of hidden layer activations, creating an embedding of functions and producing more complex separation surfaces. The one-hidden-layer MLP is able to produce nonlinear separation surfaces. If one adds an extra layer (i.e., two hidden layers), the learning machine now can combine at will bumps, which can be interpreted as a universal mapper, since there is evidence that any function can be approximated by localized bumps. One important aspect to remember is that changing a single weight in the MLP can drastically change the location of the separation surfaces; i.e., the MLP achieves the input/output map through the interplay of all its weights. How to Train MLPs One fundamental issue is how to adapt the weights wi of the MLP to achieve a given input/output map. The core ideas have been around for many years in optimization, and they are extensions of well-known engineering principles, such as the least mean square (LMS) algorithm of adaptive filtering [Haykin, 1994]. Let us review the theory here. Assume that we have a linear PE (f(net) = net) and that one wants to adapt the weights as to minimize the square difference between the desired signal and the PE response (Fig. 20.5). This problem has an analytical solution known as the least squares [Haykin, 1994]. The optimal weights are obtained as the product of the inverse of the input autocorrelation function (R–1) and the cross-correlation vector (P) between the input and the desired response. The analytical solution is equivalent to a search for the minimum of the quadratic performance surface J(wi ) using gradient descent, where the weights at each iteration k are adjusted by FIGURE 20.4 A two-input PE and its separation surface. y w w w x w x b 1 2 1 1 2 2 ( , ) = + + = 0
minJ→wapr=R parameters FIGURE 20.5 Computing analytically optimal weights for the linear PE. w(k+1)=w(k)-nv k (20.2) where n is a small constant called the and V J(k)is the gradient of the performance surface at iteration k. Bernard widrow in the late 1960s proposed a very efficient estimate to compute the gradient at each iteration (8=18)-2(1)=-x (20.3) which when substituted into Eq (20.2)produces the so-called LMS algorithm. He showed that the LMS converged to the analytic solution provided the step size n is small enough. Since it is a steepest descent procedure, the largest step size is limited by the inverse of the largest eigenvalue of the input autocorrelation matrix. The larger the step size(below this limit), the faster is the convergence, but the final values will"rattle around the optimal value in a basin that has a radius proportional to the step size. Hence, there is a fundamental trade-off between speed of convergence and accuracy in the final weight values. One great appeal of the Lms algorithm is that it is very efficient (just one multiplication per weight)and requires only local quantities to The LMS algorithm can be framed as a computation of partial derivatives of the cost with respect to the unknowns, i.e the weight values. In fact, with the chainrule one rites dy d x(-y)2(Ex) we obtain the LMs algorithm for the linear PE. What happens if the PE is nonlinear? If the nonlinearity is differentiable(smooth), we still can apply the same method, because of the chain rule, which prescribes that (Fig.20.6) <品m②), ∫ logisti(ne1)=x(1-x) funb (net )=0.5(1-F FIGURE 20.6 How to extend LMS to nonlinear PEs with the chain rule c 2000 by CRC Press LLC
© 2000 by CRC Press LLC (20.2) where h is a small constant called the step size, and — J (k) is the gradient of the performance surface at iteration k. Bernard Widrow in the late 1960s proposed a very efficient estimate to compute the gradient at each iteration (20.3) which when substituted into Eq. (20.2) produces the so-called LMS algorithm. He showed that the LMS converged to the analytic solution provided the step size h is small enough. Since it is a steepest descent procedure, the largest step size is limited by the inverse of the largest eigenvalue of the input autocorrelation matrix. The larger the step size (below this limit), the faster is the convergence, but the final values will “rattle” around the optimal value in a basin that has a radius proportional to the step size. Hence, there is a fundamental trade-off between speed of convergence and accuracy in the final weight values. One great appeal of the LMS algorithm is that it is very efficient (just one multiplication per weight) and requires only local quantities to be computed. The LMS algorithm can be framed as a computation of partial derivatives of the cost with respect to the unknowns, i.e., the weight values. In fact, with the chainrule one writes (20.4) we obtain the LMS algorithm for the linear PE. What happens if the PE is nonlinear? If the nonlinearity is differentiable (smooth), we still can apply the same method, because of the chain rule, which prescribes that (Fig. 20.6) FIGURE 20.5 Computing analytically optimal weights for the linear PE. FIGURE 20.6 How to extend LMS to nonlinear PEs with the chain rule. w k w k J k J J w i i i i i ( + 1) = ( ) - —h ( ) — = ¶ ¶ — J k( ) = ( ) ( ( )) = - ( ) ( ) w J k w k k x k i i i i ¶ ¶ ¶ ¶ ~ e e 1 2 2 ¶ ¶ ¶ ¶ ¶ ¶ ¶ ¶ ¶ ¶ e J w J y y w y d y w w x x i i i i i i = = Â( - ) Ê Ë ˆ ¯ (Â ) = - 2
k tpe i h pe FIGURE 20 7 How to adapt the weights connected to ith Pe. d) a dy d dy dnet dw et=d-n)f(net)x=-ef'(net)x (20.5) where f(net)is the derivative of the nonlinearity computed at the operating point Equation(20.5)is known the delta rule, and it will train the perceptron[ Haykin, 1994]. Note that throughout the derivation we skipped the pattern index p for simplicity, but this rule is applied for each input pattern. However, the delta rule cannot train MLPs since it requires the knowledge of the error signal at each PE The principle of the ordered derivatives can be extended to multilayer networks, provided we organize the computations in flows of activation and error propagation. The principle is very easy to understand, but a little complex to formulate in equation form [ Haykin, 1994] Suppose that we want to adapt the weights connected to a hidden layer PE, the ith PE(Fig. 20.7). One can decompose the computation of the partial derivative of the cost with respect to the weight wi as dj d dy d net (20.6) i.e., the partial derivative with respect to the weight is the product of the partial derivative with respect to the PE state- part 1 in Eq (20.6)-times the partial derivative of the local activation to the weights- part 2 in Eq (20.6). This last quantity is exactly the same as for the nonlinear PE ((net )x), so the big issue is the omputation of o. For an output PE, a becomes the injected error e in Eq (20.4). For the hidden ith pe dI is evaluated by summing all the errors that reach the PE from the top layer through the topology when the injected errors Ek are clamped at the top layer, or in an equation aJ ∑ E,f'(net)y (20.7) Substituting back in Eq (20.6)we finally get aL=-x,/"(net,)>,/"(net- )wH (20.8) c 2000 by CRC Press LLC
© 2000 by CRC Press LLC (20.5) where f ¢(net) is the derivative of the nonlinearity computed at the operating point. Equation (20.5) is known as the delta rule, and it will train the perceptron [Haykin, 1994]. Note that throughout the derivation we skipped the pattern index p for simplicity, but this rule is applied for each input pattern. However, the delta rule cannot train MLPs since it requires the knowledge of the error signal at each PE. The principle of the ordered derivatives can be extended to multilayer networks, provided we organize the computations in flows of activation and error propagation. The principle is very easy to understand, but a little complex to formulate in equation form [Haykin, 1994]. Suppose that we want to adapt the weights connected to a hidden layer PE, the ith PE (Fig. 20.7). One can decompose the computation of the partial derivative of the cost with respect to the weight wij as (20.6) i.e., the partial derivative with respect to the weight is the product of the partial derivative with respect to the PE state — part 1 in Eq. (20.6) — times the partial derivative of the local activation to the weights — part 2 in Eq. (20.6). This last quantity is exactly the same as for the nonlinear PE (f ¢(neti )xj ), so the big issue is the computation of . For an output PE, becomes the injected error e in Eq. (20.4). For the hidden ith PE is evaluated by summing all the errors that reach the PE from the top layer through the topology when the injected errors ek are clamped at the top layer, or in an equation (20.7) Substituting back in Eq. (20.6) we finally get (20.8) FIGURE 20.7 How to adapt the weights connected to ith PE. ¶ ¶ ¶ ¶ ¶ ¶ ¶ ¶ e J w J y y w net d y f x f x i i i i = = -( - ) ¢( ) = - ¢( ) net net net ¶ ¶ ¶ ¶ ¶ ¶ ¶ ¶ J w J y y ij i w i i ij i = net net 1 2 ¶ ¶ J y ¶ ¶ J y ¶ ¶ J y ¶ ¶ ¶ ¶ ¶ ¶ ¶ ¶ e J y J y y y f w i k k k k i k k k ki k = Ê Ë Á ˆ ¯   ˜ = ¢( ) net net net ¶ ¶ e J w x f f w ij j i k k ki k = - ¢( ) ¢( ) Ê Ë Á ˆ ¯ net  ne ˜ t 1 2
This equation embodies the back-propagation training algorithm [ Haykin, 1994; Bishop It can be rewritten as the product of a local activation(part 1)and a local error(part 2), exactly as the LMS and the delta rules. But now the local error is a composition of errors that flow through the topology, which becomes equivalent to the existence of a desired response at the pe There is an intrinsic flow in the implementation of the back-propagation algorithm: first, inputs are applied to the net and activation ted everywhere to yield the output activation. Second, the external errors are muted by subtracting the net output from the desired response. Third, these external errors are utilized in Eq (20.8 )to compute the local errors for the layer immediately preceding the output layer, and the computations chained up to the input layer. Once all the local errors are available, Eq (20. 2)can be used to update every weight. These three steps are then repeated for other training patterns until the error is acceptable Step three is equivalent to injecting the external errors in the dual topology and back-propagating them up to the input layer[Haykin, 1994]. The dual topology is obtained from the original one by reversing data flow and substituting summing junctions by splitting nodes and vice versa. The error at each PE of the dual topology is then multiplied by the activation of the original network to compute the weight updates. So, effectively the dual topology is being used to compute the local errors which makes the procedure highly efficient. This is the eason back-propagation trains a network of N weights with a number of multiplications proportional to N, (O(N)), instead of (O(N ))for previous methods of computing partial derivatives known in control theor Using the dual topology to implement back-propagation is the best and most general method to program the algorithm in a digital computer. pplying Back-Propagation in Practice Now that we know an algorithm to train MLPs, let us see what are the practical issues to apply it. We address the following aspects: size of training set vs weights, search procedures, how to stop training, and to set the topology for maximum generalization. The size of the training set is very important for good performance. Remember that the ann gets its information from the training set. If the training data do not cover the full range of operating conditions, the system m perform badly when deployed. Under no circumstances should the training set be less than the number of weights in the ANN. A good size of the training data is ten times the number of weights in the network, with the lower limit being set around three times the number of weights(these values should be taken as an indication, subject to experimentation for each case)[Haykin, 1994 Search procedures Searching along the direction of the gradient is fine if the performance surface is quadratic. However, in ANNs rely is this the case, because of the use of nonlinear PEs and topologies with several layers. So, gradient descent can be caught in local minima, which makes the search very slow in regions of small curvature. One efficient way to speed up the search in regions of small curvature and, at the same time, to stabilize it in narrow valleys is to include a momentum term in the weight adaptation ,(n+1)=()+ns(x()+a(w()-n(n-) (20.9) The value of momentum a should be set experimentally between 0.5 and 0.9. There are many more modifi cations to the conventional gradient search, such as adaptive step sizes, annealed noise, conjugate gradients, and second-order methods(using information contained in the Hessian matrix), but the simplicity and power of momentum learning is hard to beat [Haykin, 1994; Bishop, 1995. How to Stop Training The stop criterion is a fundamental aspect of training. The simple ideas of capping the number of iterations or of letting the system train until a predetermined error value are not recommended. The reason is that we want the ann to perform well in the test set data; i e, we would like the system to perform well in data it c 2000 by CRC Press LLC
© 2000 by CRC Press LLC This equation embodies the back-propagation training algorithm [Haykin, 1994; Bishop, 1995]. It can be rewritten as the product of a local activation (part 1) and a local error (part 2), exactly as the LMS and the delta rules. But now the local error is a composition of errors that flow through the topology, which becomes equivalent to the existence of a desired response at the PE. There is an intrinsic flow in the implementation of the back-propagation algorithm: first, inputs are applied to the net and activations computed everywhere to yield the output activation. Second, the external errors are computed by subtracting the net output from the desired response. Third, these external errors are utilized in Eq. (20.8) to compute the local errors for the layer immediately preceding the output layer, and the computations chained up to the input layer. Once all the local errors are available, Eq. (20.2) can be used to update every weight. These three steps are then repeated for other training patterns until the error is acceptable. Step three is equivalent to injecting the external errors in the dual topology and back-propagating them up to the input layer [Haykin, 1994]. The dual topology is obtained from the original one by reversing data flow and substituting summing junctions by splitting nodes and vice versa. The error at each PE of the dual topology is then multiplied by the activation of the original network to compute the weight updates. So, effectively the dual topology is being used to compute the local errors which makes the procedure highly efficient. This is the reason back-propagation trains a network of N weights with a number of multiplications proportional to N, (O(N)), instead of (O(N2 )) for previous methods of computing partial derivatives known in control theory. Using the dual topology to implement back-propagation is the best and most general method to program the algorithm in a digital computer. Applying Back-Propagation in Practice Now that we know an algorithm to train MLPs, let us see what are the practical issues to apply it. We will address the following aspects: size of training set vs. weights, search procedures, how to stop training, and how to set the topology for maximum generalization. Size of Training Set The size of the training set is very important for good performance. Remember that the ANN gets its information from the training set. If the training data do not cover the full range of operating conditions, the system may perform badly when deployed. Under no circumstances should the training set be less than the number of weights in the ANN. A good size of the training data is ten times the number of weights in the network, with the lower limit being set around three times the number of weights (these values should be taken as an indication, subject to experimentation for each case) [Haykin, 1994]. Search Procedures Searching along the direction of the gradient is fine if the performance surface is quadratic. However, in ANNs rarely is this the case, because of the use of nonlinear PEs and topologies with several layers. So, gradient descent can be caught in local minima, which makes the search very slow in regions of small curvature. One efficient way to speed up the search in regions of small curvature and, at the same time, to stabilize it in narrow valleys is to include a momentum term in the weight adaptation (20.9) The value of momentum a should be set experimentally between 0.5 and 0.9. There are many more modifi- cations to the conventional gradient search, such as adaptive step sizes, annealed noise, conjugate gradients, and second-order methods (using information contained in the Hessian matrix), but the simplicity and power of momentum learning is hard to beat [Haykin, 1994; Bishop, 1995]. How to Stop Training The stop criterion is a fundamental aspect of training. The simple ideas of capping the number of iterations or of letting the system train until a predetermined error value are not recommended. The reason is that we want the ANN to perform well in the test set data; i.e., we would like the system to perform well in data it w n w n nx n w n w n ij ij j ij ij ( + 1 1 ) = ( ) + hd a ( ) ( ) + ( ( ) - - ( ))
ever saw before (good generalization)[Bishop, 1995]. The error in the training set tends to decrease with teration when the aNn has enough degrees of freedom to represent the input/output map. However, the system may be remembering the training patterns(overfitting) instead of finding the underlying mapping rule. This is called overtraining. To avoid overtraining the performance in a validation set, i. e, a set of input data that the system never saw before, must be checked regularly during training(i. e, once every 50 passes over the aining set). The training should be stopped when the performance in the validation set starts to increase, despite the fact that the performance in the training set continues to decrease. This method is called cross validation. The validation set should be 10% of the training set, and distinct from it. The size of the topology should also be carefully selected. If the number of layers or the size of each layer is too small, the network does not have enough degrees of freedom to classify the data or to approximate the function, and the performance suffers On the other hand, if the size of the network is too large, performance may also suffer. This is the phenomenon of overfitting that we mentioned above. But one alternative way to control it is to reduce the size of the network There are basically two procedures to set the size of the network: either one starts small and adds new PEs or one starts with a large network and prunes PEs [Haykin, 1994]. One quick way to prune the network is to impose a penalty term in the performance function-a regularizing term--such as limiting the slope of the nput/output map[Bishop, 1995 A regularization term that can be implemented locally (n+1)=v(n) +n8(n)x,(n) (20.10) where n is the weight decay parameter and 8 the local error Weight decay tends to drive unimportant weights to zero A Posteriori probabilities We will finish the discussion of the mlP by noting that this topology when trained with the mean square error is able to estimate directly at its outputs a posteriori probabilities, i.e., the probability that a given input pattern belongs to a given class [Bishop, 1995]. This property is very useful because the MLP outputs can be terpreted as probabilities and operated as numbers. In order to guarantee this property, one has to make sure that each class is attributed to one output PE, that the topology is sufficiently large to represent the mapping, that the training has converged to the absolute minimum, and that the outputs are normalized between 0 and 1. The first requirements are met by good design, while the last can be easily enforced if the softmax activation PE Bisho 20.3 Radial Basis Function Networks The radial basis function(RBF) network constitutes another way of implementing arbitrary input/output mappings. The most significant difference between the MLP and rBF lies in the PE nonlinearity. While the Pe in the MLP responds to the full input space, the Pe in the rBF is local, normally a Gaussian kernel in the input
© 2000 by CRC Press LLC never saw before (good generalization) [Bishop, 1995]. The error in the training set tends to decrease with iteration when the ANN has enough degrees of freedom to represent the input/output map. However, the system may be remembering the training patterns (overfitting) instead of finding the underlying mapping rule. This is called overtraining. To avoid overtraining the performance in a validation set, i.e., a set of input data that the system never saw before, must be checked regularly during training (i.e., once every 50 passes over the training set). The training should be stopped when the performance in the validation set starts to increase, despite the fact that the performance in the training set continues to decrease. This method is called cross validation. The validation set should be 10% of the training set, and distinct from it. Size of the Topology The size of the topology should also be carefully selected. If the number of layers or the size of each layer is too small, the network does not have enough degrees of freedom to classify the data or to approximate the function, and the performance suffers. On the other hand, if the size of the network is too large, performance may also suffer. This is the phenomenon of overfitting that we mentioned above. But one alternative way to control it is to reduce the size of the network. There are basically two procedures to set the size of the network: either one starts small and adds new PEs or one starts with a large network and prunes PEs [Haykin, 1994]. One quick way to prune the network is to impose a penalty term in the performance function — a regularizing term — such as limiting the slope of the input/output map [Bishop, 1995]. A regularization term that can be implemented locally is (20.10) where l is the weight decay parameter and d the local error. Weight decay tends to drive unimportant weights to zero. A Posteriori Probabilities We will finish the discussion of the MLP by noting that this topology when trained with the mean square error is able to estimate directly at its outputs a posteriori probabilities, i.e., the probability that a given input pattern belongs to a given class [Bishop, 1995]. This property is very useful because the MLP outputs can be interpreted as probabilities and operated as numbers. In order to guarantee this property, one has to make sure that each class is attributed to one output PE, that the topology is sufficiently large to represent the mapping, that the training has converged to the absolute minimum, and that the outputs are normalized between 0 and 1. The first requirements are met by good design, while the last can be easily enforced if the softmax activation is used as the output PE [Bishop, 1995], (20.11) 20.3 Radial Basis Function Networks The radial basis function (RBF) network constitutes another way of implementing arbitrary input/output mappings. The most significant difference between the MLP and RBF lies in the PE nonlinearity. While the PE in the MLP responds to the full input space, the PE in the RBF is local, normally a Gaussian kernel in the input space. Hence, it only responds to inputs that are close to its center; i.e., it has basically a local response. wn wn w n nx n ij ij ij i j ( + ) = ( ) - ( + ( )) Ê Ë Á Á ˆ ¯ ˜ ˜ 1 1 + ( ) ( ) 1 2 l hd y j j = ( ) Â ( ) exp exp net net
∑wφ(x-x) 八 FIGURE 20.8 Radial Basis Function(RBF)network. The RBF network is also a layered net with the hidden layer built from Gaussian kernels and a linear(or nonlinear)output layer(Fig. 20.8). Training of the RBF network is done normally in two stages[Haykin, 1994] st, the centers x are adaptively placed in the input space using competitive learning or k means clustering [Bishop, 1995], which are unsupervised procedures. Competitive learning is explained later in the chapter. The variances of each Gaussian are chosen as a percentage (30 to 50%)to the distance to the nearest center. The goal is to cover adequately the input data distribution. Once the RBF is located, the second layer weights w are trained using the LMS procedure RBF networks are easy to work with, they train very fast, and they have shown good properties function approximation as classification. The problem is that they require lots of Gaussian kernels in dimensional spaces. 20.4 Time-Lagged Networks The MLP is the most common neural network topology, but it can only handle instantaneous information, since the system has no memory and it is feedforward. In engineering, the processing of signals that exist in es systems with memory, i.e., linear filters. Another alternative to implement memory is to use feedback, which gives rise to recurrent networks. Fully recurrent networks are difficult to train and to stabilize, so it is preferable to develop topologies based on MLPs but where explicit subsystems to store the past information are included. These subsystems are called short-term memory structures (de Vries and Principe, 1992]. The combination of an MLP with short-term memory structures is called a time-lagged network(TLN). The memory structures can be eventually recurrent, but the feedback is local, so stability is still easy to guarantee. Here,we will cover just one Tln topology, called focused, where the memory is at the input layer. The most general TLN have memory added anywhere in the network, but they require other more-involved training strategies(BPtT [Haykin, 1994). The interested reader is referred to de vries and Principe [ 1992] for further details he function of a short-term memory in the focused TLN is to represent the past of the input signal, while the nonlinear PEs provide the mapping as in the mlp(Fig. 20.9) Memory Structures The simplest memory structure is built from a tap delay line( Fig. 20.10). The memory by delays is a single input,multiple-output system that has no free parameters except its size k. The tap delay memory is the memory utilized in the time-delay neural network(TDNN) which has been utilized successfully in speed recognition and system identification [ Kung, 1993] different mechanism for linear memory is the feedback(Fig. 20. 11). Feedback allows the system to remem- ber past events because of the exponential decay of the response. This memory has limited resolution because of the low pass required for long memories. But notice that unlike the memory by delay, memory by feedback provides the learning system with a free parameter u that controls the length of the memory. Memory by feedback has been used in Elman and Jordan networks [ Haykin, 1994] c 2000 by CRC Press LLC
© 2000 by CRC Press LLC The RBF network is also a layered net with the hidden layer built from Gaussian kernels and a linear (or nonlinear) output layer (Fig. 20.8). Training of the RBF network is done normally in two stages [Haykin, 1994]: first, the centers xi are adaptively placed in the input space using competitive learning or k means clustering [Bishop, 1995], which are unsupervised procedures. Competitive learning is explained later in the chapter. The variances of each Gaussian are chosen as a percentage (30 to 50%) to the distance to the nearest center. The goal is to cover adequately the input data distribution. Once the RBF is located, the second layer weights wi are trained using the LMS procedure. RBF networks are easy to work with, they train very fast, and they have shown good properties both for function approximation as classification. The problem is that they require lots of Gaussian kernels in highdimensional spaces. 20.4 Time-Lagged Networks The MLP is the most common neural network topology, but it can only handle instantaneous information, since the system has no memory and it is feedforward. In engineering, the processing of signals that exist in time requires systems with memory, i.e., linear filters. Another alternative to implement memory is to use feedback, which gives rise to recurrent networks. Fully recurrent networks are difficult to train and to stabilize, so it is preferable to develop topologies based on MLPs but where explicit subsystems to store the past information are included. These subsystems are called short-term memory structures [de Vries and Principe, 1992]. The combination of an MLP with short-term memory structures is called a time-lagged network (TLN). The memory structures can be eventually recurrent, but the feedback is local, so stability is still easy to guarantee. Here, we will cover just one TLN topology, called focused, where the memory is at the input layer. The most general TLN have memory added anywhere in the network, but they require other more-involved training strategies (BPTT [Haykin, 1994]). The interested reader is referred to de Vries and Principe [1992] for further details. The function of a short-term memory in the focused TLN is to represent the past of the input signal, while the nonlinear PEs provide the mapping as in the MLP (Fig. 20.9). Memory Structures The simplest memory structure is built from a tap delay line (Fig. 20.10). The memory by delays is a singleinput, multiple-output system that has no free parameters except its size K. The tap delay memory is the memory utilized in the time-delay neural network (TDNN) which has been utilized successfully in speech recognition and system identification [Kung, 1993]. A different mechanism for linear memory is the feedback (Fig. 20.11). Feedback allows the system to remember past events because of the exponential decay of the response. This memory has limited resolution because of the low pass required for long memories. But notice that unlike the memory by delay, memory by feedback provides the learning system with a free parameter m that controls the length of the memory. Memory by feedback has been used in Elman and Jordan networks [Haykin, 1994]. FIGURE 20.8 Radial Basis Function (RBF) network