McCulloch-Pitts "unit" Output isaqshedlinear function of the inputs: ←9=g, NEURAL NETWORKS ai-gi) CHAPTER 20,SECTION 5 了-< 器。mo空 ACesoeIenpiaienoredneurcbgt5plpoeno Outline Activation functions ◇Brains g(mna ◇Neural net ◇Perceptrons Multilayer perceptrons (a) (a)is a step function or threshold functio (b)is a sigmoid function 1/(1 Brains Implementing logical functions =03 -05 ○○○ AND OR NOT
Neural networks Chapter 20, Section 5 Chapter 20, Section 5 1 Outline ♦ Brains ♦ Neural networks ♦ Perceptrons ♦ Multilayer perceptrons ♦ Applications of neural networks Chapter 20, Section 5 2 Brains 1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential Axon Cell body or Soma Nucleus Dendrite Synapses Axonal arborization Axon from another cell Synapse Chapter 20, Section 5 3 McCulloch–Pitts “unit” Output is a “squashed” linear function of the inputs: ai ← g(ini) = g ΣjWj,iaj Output Σ Input Links Activation Function Input Function Output Links a0 = −1 ai = g(ini ) ai g W ini j,i W0,i Bias Weight aj A gross oversimplification of real neurons, but its purpose is to develop understanding of what networks of simple units can do Chapter 20, Section 5 4 Activation functions (a) (b) +1 +1 ini ini g(ini g(in ) i ) (a) is a step function or threshold function (b) is a sigmoid function 1/(1 + e −x ) Changing the bias weight W0,i moves the threshold location Chapter 20, Section 5 5 Implementing logical functions AND W0 = 1.5 W1 = 1 W2 = 1 OR W2 = 1 W1 = 1 W0 = 0.5 NOT W1 = –1 W0 = – 0.5 McCulloch and Pitts: every Boolean function can be implemented Chapter 20, Section 5 6
Network structures Expressiveness of perceptrons eed-forward networks Consider a perceptron with g step function (Rosenblatt.1957.1960) Can represent AND,R,NOT,majority ete but eed-fowardetrk implement Represents a linear separator in input space: 4>0wx>0 eocdhascaiatonfacl (a)x1 and xz (c)x xorx MinskyPapert (1)pricked the balloo Feed-forward example Perceptron learning 回 3● W The squared error for an example with input x and true output y is E-Em-hw(x) 5 Perform optimization search by gradient descent: 回 (4 :m×-.= Feed-forward network a parameterized family of nonlinear functions update rule 形一形+ax Ex(m)×写 Adjusting weights changes the function:do eaing this way! Single-layer perceptrons Perceptron learning contd. tent functio Outout units all pperate separately-no shared weishts Adjusting weights m the location and s Perceptron learns majority function easily.DTL is hopeles DTL learns restaurant function easily.perceptro nnot teoresent it
Network structures Feed-forward networks: – single-layer perceptrons – multi-layer perceptrons Feed-forward networks implement functions, have no internal state Recurrent networks: – Hopfield networks have symmetric weights (Wi,j = Wj,i) g(x) = sign(x), ai = ± 1; holographic associative memory – Boltzmann machines use stochastic activation functions, ≈ MCMC in Bayes nets – recurrent neural nets have directed cycles with delays ⇒ have internal state (like flip-flops), can oscillate etc. Chapter 20, Section 5 7 Feed-forward example W1,3 W1,4 W2,3 W2,4 W3,5 W4,5 1 2 3 4 5 Feed-forward network = a parameterized family of nonlinear functions: a5 = g(W3,5 · a3 + W4,5 · a4) = g(W3,5 · g(W1,3 · a1 + W2,3 · a2) + W4,5 · g(W1,4 · a1 + W2,4 · a2)) Adjusting weights changes the function: do learning this way! Chapter 20, Section 5 8 Single-layer perceptrons Input Units Units Output Wj,i -4 -2 0 2 x1 4 -4 -2 0 2 4 x2 0 0.2 0.4 0.6 0.8 1 Perceptron output Output units all operate separately—no shared weights Adjusting weights moves the location, orientation, and steepness of cliff Chapter 20, Section 5 9 Expressiveness of perceptrons Consider a perceptron with g = step function (Rosenblatt, 1957, 1960) Can represent AND, OR, NOT, majority, etc., but not XOR Represents a linear separator in input space: ΣjWjxj > 0 or W · x > 0 (a) x1 and x2 1 0 0 1 x1 x2 (b) x1 or x2 0 1 1 0 x1 x2 (c) x1 xor x2 ? 0 1 1 0 x1 x2 Minsky & Papert (1969) pricked the neural network balloon Chapter 20, Section 5 10 Perceptron learning Learn by adjusting weights to reduce error on training set The squared error for an example with input x and true output y is E = 1 2 Err 2 ≡ 1 2 (y − hW(x))2 , Perform optimization search by gradient descent: ∂E ∂Wj = Err × ∂Err ∂Wj = Err × ∂ ∂Wj y − g(Σ n j = 0Wjxj) = −Err × g 0 (in) × xj Simple weight update rule: Wj ← Wj + α × Err × g 0 (in) × xj E.g., +ve error ⇒ increase network output ⇒ increase weights on +ve inputs, decrease on -ve inputs Chapter 20, Section 5 11 Perceptron learning contd. Perceptron learning rule converges to a consistent function for any linearly separable data set 0.4 0.5 0.6 0.7 0.8 0.9 1 Proportion correct on test set 0 10 20 30 40 50 60 70 80 90 100 Training set size - MAJORITY on 11 inputs Perceptron Decision tree 0.4 0.5 0.6 0.7 0.8 0.9 1 Proportion correct on test set 0 10 20 30 40 50 60 70 80 90 100 Training set size - RESTAURANT data Perceptron Decision tree Perceptron learns majority function easily, DTL is hopeless DTL learns restaurant function easily, perceptron cannot represent it Chapter 20, Section 5 12
Multilayer perceptrons Back-propagation derivation The squaredsingle example is defined as nb hand E-i-a) =-(ai)g(ini)aj=-ajAi 66占名凸ù Expressiveness of MLPs Back-propagation derivation contd. s w/2 byers,all functions w/3 layers -- 9A Combine tw threshod functions 。-字A,a Combine two perpendicular ridges to make a bump =-AWmm many hidden units (cf DTLproof) Back-propagation learning Back-propagation learning contd. Output layer mes for single-ayer W:-+axa1x△ Training curve for 10 restaurantxmples:findsxfit where△:=Em:xgml iden layer back-propagate from △,=9(m,)ΣWA W一W+a×g×△ (Most deny thaback-proptionin the bain) 5010.15020 203035000 ence.local minim
Multilayer perceptrons Layers are usually fully connected; numbers of hidden units typically chosen by hand Input units Hidden units Output units ai Wj,i aj Wk,j ak Chapter 20, Section 5 13 Expressiveness of MLPs All continuous functions w/ 2 layers, all functions w/ 3 layers -4 -2 0 2 x1 4 -4 -2 0 2 4 x2 0 0.2 0.4 0.6 0.8 1 hW (x1 , x2 ) -4 -2 0 2 x1 4 -4 -2 0 2 4 x2 0 0.2 0.4 0.6 0.8 1 hW (x1 , x2 ) Combine two opposite-facing threshold functions to make a ridge Combine two perpendicular ridges to make a bump Add bumps of various sizes and locations to fit any surface Proof requires exponentially many hidden units (cf DTL proof) Chapter 20, Section 5 14 Back-propagation learning Output layer: same as for single-layer perceptron, Wj,i ← Wj,i + α × aj × ∆i where ∆i = Err i × g 0 (ini) Hidden layer: back-propagate the error from the output layer: ∆j = g 0 (inj) X i Wj,i∆i . Update rule for weights in hidden layer: Wk,j ← Wk,j + α × ak × ∆j . (Most neuroscientists deny that back-propagation occurs in the brain) Chapter 20, Section 5 15 Back-propagation derivation The squared error on a single example is defined as E = 1 2 X i (yi − ai) 2 , where the sum is over the nodes in the output layer. ∂E ∂Wj,i = −(yi − ai) ∂ai ∂Wj,i = −(yi − ai) ∂g(ini) ∂Wj,i = −(yi − ai)g 0 (ini) ∂ini ∂Wj,i = −(yi − ai)g 0 (ini) ∂ ∂Wj,i X j Wj,iaj = −(yi − ai)g 0 (ini)aj = −aj∆i Chapter 20, Section 5 16 Back-propagation derivation contd. ∂E ∂Wk,j = − X i (yi − ai) ∂ai ∂Wk,j = − X i (yi − ai) ∂g(ini) ∂Wk,j = − X i (yi − ai)g 0 (ini) ∂ini ∂Wk,j = − X i ∆i ∂ ∂Wk,j X j Wj,iaj = − X i ∆iWj,i ∂aj ∂Wk,j = − X i ∆iWj,i ∂g(inj ) ∂Wk,j = − X i ∆iWj,ig 0 (inj) ∂inj ∂Wk,j = − X i ∆iWj,ig 0 (inj) ∂ ∂Wk,j X k Wk,jak = − X i ∆iWj,ig 0 (inj)ak = −ak∆j Chapter 20, Section 5 17 Back-propagation learning contd. At each epoch, sum gradient updates for all examples and apply Training curve for 100 restaurant examples: finds exact fit 0 2 4 6 8 10 12 14 0 50 100 150 200 250 300 350 400 Total error on training set Number of epochs Typical problems: slow convergence, local minima Chapter 20, Section 5 18
Back-propagation learning contd. eamning curve for MLP with 4 hidden units: Dedsion tree 。 Handwritten digit recognition 0/2356289 p13456/99 Summary Most brains have lots of neurons:each neuron linear-threshold unit (? Perceptrons (one-layer networks)insufficiently expressive be trained by ae Many applications:speech.driving.handwriting,fraud detection,etc er ing.ystem modelling
Back-propagation learning contd. Learning curve for MLP with 4 hidden units: 0.4 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 70 80 90 100 Proportion correct on test set Training set size - RESTAURANT data Decision tree Multilayer network MLPs are quite good for complex pattern recognition tasks, but resulting hypotheses cannot be understood easily Chapter 20, Section 5 19 Handwritten digit recognition 3-nearest-neighbor = 2.4% error 400–300–10 unit MLP = 1.6% error LeNet: 768–192–30–10 unit MLP = 0.9% error Current best (kernel machines, vision algorithms) ≈ 0.6% error Chapter 20, Section 5 20 Summary Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, fraud detection, etc. Engineering, cognitive modelling, and neural system modelling subfields have largely diverged Chapter 20, Section 5 21