NATURE VOL. 323 9 OCTOBER 1986 LETTERSTO NATURE delineating the absolute indigeneity of amino acids in fossils. Arco, Exxon, Phillips Petroleum, Texaco Inc., The Upjohn Co. As AMS techniques are refined to handle smaller samples, it We also acknowledge the donors of the Petroleum Research may also become possible to date individual amino acid enan- Fund, administered by the American Chemical Society(grant tiomers by the C method. If one enantiomer is entirely derived 16144-AC2 to M.H.E., grant 14805-AC2 to S.A.M.) for support. from the other by racemization during diagenesis, the individual S.A.M. acknowledges NSERC(grant A2644) for partial support. D- and L-enantiomers for a given amino acid should have identical C ages. Received 19 May: accepted 15 July 1986 Older, more poorly preserved fossils may not always prove 1. Bada, J. L. & Protsch, R. Proc. natn. Acad. ScL U.S.A. 70, 1331-1334 (1973) amenable to the determination of amino acid indigeneity by the 2. Bada, J. L., Schroeder, R. A. & Carter, G. F. Science 184, 791-793(1974). stable isotope method, as the prospects for complete replace- 3. Boulton. G. S. et al Nature 298. 437-441(1982) 4. Wehmiller, J. F. in Quaternary Dating Methods (ed. Mahaney, W. C.) 171-193 (Elsevier ment of indigenous amino acids with non-indigenous amino Amsterdam. 1984). acids increases with time. As non-indigenous amino acids 5. Engel, M. H., Zumberge, J. E. & Nagy, B. Analyt. Biochem. 82, 415-422 (1977). 6. Bada, J. 1. A. Reu. Earth planet. Sci. 13, 241-268 (1985). undergo racemization, the enantiomers may have identical 7. Chisholm, B. S., Nelson, D. E. & Schwarcz, H. P. Science 216, 1131-1132(1982) isotopic compositions and still not be related to the original 8. Ambrose, S. H. & DeNiro, M. J. Nature 319,321-324(1986) organisms. Such a circumstance may, however, become easier 9. Macko, S. A., Estep, M. L. F., Hare, P. E. & Hoering, T. C. Yb. Carnegie Instn Wash. 82, 404-410(1983) to recognize as more information becomes available concerning 10. Hare, P. E. & Estep, M. L. F. Yb. Camegie Instn Wash. 82, 410-414 (1983) the distribution and stable isotopic composition of the amino 11. Engel, M. H. & Hare, P. E. in Chemistry and Biochemistry of the Amino Acids (ed. Barrett. G. C.) 462-479 (Chapman and Hall, London, 1985). acid constituents of modern representatives of fossil organisms 12. Johnstone, R. A. W. & Rose, M. E. in Chemistry and Biochemistry of the Amino Acids (ed. Also, AMS dates on individual amino acid enantiomers may Barrett, G. C.) 480-524(Chapman and Hall, London, 1985). in some cases, help to clarify indigeneity problems, in particular 13. Weinstein, S., Engel, M. H. & Hare, P. E. in Practical Protein Chemistry-A Handbook (ed. Darbre, A.)337-344 (Wiley, New York, 1986). when stratigraphic controls can be used to estimate a genera 14. Bada, J. L., Gillespie, R., Gowlett, J. A. J. & Hedges, R. E. M. Nature 312, 442-444(1984). age range for the fossil in question. 15. Mitterer, R. M. & Kriausakul, N. Org. Geochem. 7, 91-98(1984). 16. Williams, K. M. & Smith, G. G. Onigins Life 8, 91-144 (1977) Finally, the development of techniques for determining the 17. Engel, M. H. & Hare, P. E. Yb. Carnegie Instm Wash. 81, 425-430 (1982). stable isotopic composition of amino acid enantiomers may 18. Hare, P. E. Yb. Carnegie Instn Wash. 73, 576-581(1974). 19. Pillinger. C. T. Naure 296. 802 (1982) enable us to establish whether non-racemic amino acids in some 20. Neuberger, A. Ado. Protein Chem 4, 298-383(1948). carbonaceous meteorites2' are indigenous, or result in part fron 21. Engel, M. H. & Macko, S. A. Analyt. Chem. 56, 2598-2600(1984). 22. Dungworth. G. Chem. Geol 17, 135-153 (1976) terrestrial contamination. 23. Weinstein, S., Engel, M. H. & Hare, P. E. Analyt. Biochem. 121, 370-377(1982) M.H.E. thanks the NSF, Division of Earth Sciences (grant 24. Macko, S. A., Lee, W. Y. & Parker, P. L. J. exp. mar. Biol. Ecol 63, 145-149 (1982) EAR-8352055)and the following contributors to his Presidential 25. Macko, S. A., Estep, M. L. F. & Hoering, T. C. Yb. Carnegie Instn Wash. 81,413-417(1982). 26. Vallentyne, J. R. Geochim, cosmochim. Acta 28, 157-188(1964). Young Investigator Award for partial support of this research 27. Engel, M. H. & Nagy, B. Nature 296, 837-840(1982). Learning representations more difficult when we introduce hidden units whose actual or by back-propagating errors desired states are not specified by the task. (In perceptrons, there are 'feature analysers' between the input and output that are not true hidden units because their input connections are David E. Rumelhart*, Geoffrey E. Hintont fixed by hand, so their states are completely determined by the & Ronald J. Williams* input vector: they do not learn representations.) The learning procedure must decide under what circumstances the hidden * Institute for Cognitive Science, C-015, University of California, units should be active in order to help achieve the desired San Diego, La Jolla, California 92093. USA input-output behaviour. This amounts to deciding what these Department of Computer Science, Carnegie-Mellon University, units should represent. We demonstrate that a general purpose Pittsburgh, Philadelphia 15213, USA and relatively simple procedure is powerful enough to construct appropriate internal representations. The simplest form of the learning procedure is for layered We describe a new learning procedure, back-propagation, for networks which have a layer of input units at the bottom; any networks of neurone-like units. The procedure repeatedly adjusts number of intermediate layers; and a layer of output units at the weights of the connections in the network so as to minimize a the top. Connections within a layer or from higher to lower measure of the difference between the actual output vector of the layers are forbidden, but connections can skip intermediate net and the desired output vector. As a result of the weight layers. An input vector is presented to the network by setting adjustments, internal 'hidden' units which are not part of the input the states of the input units. Then the states of the units in each or output come to represent important features of the task domain. layer are determined by applying equations(1)and(2)to the and the regularities in the task are captured by the interactions connections coming from lower layers. All units within a layer of these units. The ability to create useful new features distin have their states set in parallel, but different layers have their guishes back-propagation from earlier, simpler methods such as states set sequentially, starting at the bottom and working the perceptron-convergence procedure'. upwards until the states of the output units are determined. There have been many attempts to design self-organizing The total input, x, to unit j is a linear function of the outputs, neural networks. The aim is to find a powerful synaptic y, of the units that are connected to j and of the weights, w modification rule that will allow an arbitrarily connected neural on these connections network to develop an internal structure that is appropriate for x=yW (1) a particular task domain. The task is specified by giving the desired state vector of the output units for each state vector of Units can be given biases by introducing an extra input to each the input units. If the input units are directly connected to the unit which always has a value of 1. The weight on this extra output units it is relatively easy to find learning rules that input is called the bias and is equivalent to a threshold of the iteratively adjust the relative strengths of the connections so as opposite sign. It can be treated just like the other weights. to progressively reduce the difference between the actual and A unit has a real-valued output, y, which is a non-linear desired output vectors. Learning becomes more interesting but function of its total input (2) tTo whom correspondence should be addressed 1+e 1986 Nature Publishing Group
© Nature Publishing Group 1986
534 LETTERSTONATURE NATURE VOL.323 9 OCTOBER 1986 -8.8 -8.8 Christopher Penetope Andrew =Christine Output unit Margaret Arthur Victoria=James Jennifer Charles Colin Charlolte 14.2 -14.2 Roberto=Maria Pierro Francesca -3.6 3.6 Gina Emilio Lucia Marco Angela Tomaso 7.2 -71 Alfonso Sophia hidden 1.1 hidden Fig.2 Two isomorphic family trees.The information can be unit -7.2 7.1 unit expressed as a set of triples of the form (person 1Xrelationship) (person 2),where the possible relationships are (father,mother, husband,wife,son,daughter,uncle,aunt,brother,sister,nephew, 3.6 -3.6 niece}.A layered net can be said to 'know'these triples if it can produce the third term of each triple when given the first two.The first two terms are encoded by activating two of the input units, -14.2 14.2 and the network must then complete the proposition by activating the output unit that represents the third term. Input units Fig.1 A network that has learned to detect mirror symmetry in ■■· the input vector.The numbers on the arcs are weights and the numbers inside the nodes are biases.The learning required 1,425 。·■■· sweeps through the set of 64 possible input vectors,with the weights being adjusted on the basis of the accumulated gradient after each sweep.The values of the parameters in equation(9)were s=0.1 and a=0.9.The initial weights were random and were uniformly distributed between-0.3 and 0.3.The key property of this solution is that for a given hidden unit,weights that are symmetric about the middle of the input vector are equal in magnitude and opposite Fig.3 Activity levels in a five-layer network after it has learned. in sign.So if a symmetrical pattern is presented,both hidden units The bottom layer has 24 input units on the left for representing will receive a net input of 0 from the input units,and,because the (person 1)and 12 input units on the right for representing the hidden units have a negative bias,both will be off.In this case the relationship.The white squares inside these two groups show the output unit,having a positive bias,will be on.Note that the weights activity levels of the units.There is one active unit in the first group on each side of the midpoint are in the ratio 1:2:4.This ensures representing Colin and one in the second group representing the that each of the eight patterns that can occur above the midpoint relationship has-aunt'.Each of the two input groups is totally sends a unique activation sum to each hidden unit,so the only connected to its own group of 6 units in the second layer.These pattern below the midpoint that can exactly balance this sum is groups learn to encode people and relationships as distributed the symmetrical one.For all non-symmetrical patterns,both hidden patterns of activity.The second layer is totally connected to the units will receive non-zero activations from the input units.The central layer of 12 units,and these are connected to the penultimate two hidden units have identical patterns of weights but with layer of 6 units.The activity in the penultimate layer must activate opposite signs,so for every non-symmetric pattern one hidden unit the correct output units,each of which stands for a particular will come on and suppress the output unit. (person 2).In this case,there are two correct answers (marked by It is not necessary to use exactly the functions given in equations black dots)because Colin has two aunts.Both the input units and (1)and(2).Any input-output function which has a bounded the output units are laid out spatially with the English people in derivative will do.However,the use of a linear function for one row and the isomorphic Italians immediately below. combining the inputs to a unit before applying the nonlinearity greatly simplifies the learning procedure. The backward pass starts by computing aE/ay for each of The aim is to find a set of weights that ensure that for each the output units.Differentiating equation (3)for a particular input vector the output vector produced by the network is the case,c,and suppressing the index c gives same as (or sufficiently close to)the desired output vector.If aE/ay=yj-d (4) there is a fixed,finite set of input-output cases,the total error in the performance of the network with a particular set of weights We can then apply the chain rule to compute aE/ax, can be computed by comparing the actual and desired output vectors for every case.The total error,E,is defined as aE/ax=aE/ay'dy/dx E=∑∑(0%e-dc)2 Differentiating equation (2)to get the value of dy/dx,and (3) substituting gives where c is an index over cases (input-output pairs),j is an aE/ax=aE/ayy(1-y) (5) index over output units,y is the actual state of an output unit and d is its desired state.To minimize E by gradient descent This means that we know how a change in the total input x to it is necessary to compute the partial derivative of E with respect an output unit will affect the error.But this total input is just a linear function of the states of the lower level units and it is to each weight in the network.This is simply the sum of the partial derivatives for each of the input-output cases.For a also a linear function of the weights on the connections,so it given case,the partial derivatives of the error with respect to is easy to compute how the error will be affected by changing each weight are computed in two passes.We have already these states and weights.For a weight wi,from i to j the derivative is described the forward pass in which the units in each layer have their states determined by the input they receive from units in oE/owit=oE/8xioxj/8w lower layers using equations (1)and(2).The backward pass which propagates derivatives from the top layer back to the =aE/axiyi (6) bottom one is more complicated. and for the output of the ith unit the contribution to aE/ay 1986 Nature Publishing Group
© Nature Publishing Group 1986
NATURE VOL.323 9 OCTOBER 1986 LETTERSTONATURE 535 resulting from the effect of i on j is simply aE/axax/ay:=E/axw so taking into account all the connections emanating from unit 数面非立面 i we have aE/ay=∑8E/8xwm (7) 点■■■ We have now seen how to compute aE/ay for any unit in the 报量量五三 题点道指·■ 3 ■用路三可 重微三可 penultimate layer when given aE/ay for all units in the last layer.We can therefore repeat this procedure to compute this Fig.4 The weights from the 24 input units that represent people term for successively earlier layers,computing aE/ow for the to the 6 units in the second layer that learn distributed representa- weights as we go. tions of people.White rectangles,excitatory weights;black rec One way of using aE/ow is to change the weights after every tangles,inhibitory weights;area of the rectangle encodes the mag- nitude of the weight.The weights from the 12 English people are input-output case.This has the advantage that no separate in the top row of each unit.Unit 1 is primarily concerned with the memory is required for the derivatives.An alternative scheme. distinction between English and Italian and most of the other units which we used in the research reported here,is to accumulate ignore this distinction.This means that the representation of an aE/ow over all the input-output cases before changing the English person is very similar to the representation of their Italian weights.The simplest version of gradient descent is to change equivalent.The network is making use of the isomorphism between each weight by an amount proportional to the accumulated the two family trees to allow it to share structure and it will therefore aE/aw tend to generalize sensibly from one tree to the other.Unit 2 encodes which generation a person belongs to,and unit 6 encodes △W=-EaE/6w (8) which branch of the family they come from.The features captured by the hidden units are not at all explicit in the input and output This method does not converge as rapidly as methods which encodings,since these use a separate unit for each person.Because make use of the second derivatives,but it is much simpler and the hidden features capture the underlying structure of the task can easily be implemented by local computations in parallel domain,the network generalizes correctly to the four triples on hardware.It can be significantly improved,without sacrificing which it was not trained.We trained the network for 1500 sweeps the simplicity and locality,by using an acceleration method in using s=0.005 and a =0.5 for the first 20 sweeps and &=0.01 and which the current gradient is used to modify the velocity of the a =0.9 for the remaining sweeps.To make it easier to interpret point in weight space instead of its position the weights we introduced 'weight-decay'by decrementing every weight by 0.2%after each weight change.After prolonged learning, △w(t)=-e8E/8w(1)+a△w(1-1) (9) the decay was balanced by aE/aw,so the final magnitude of each where t is incremented by 1 for each sweep through the whole weight indicates its usefulness in reducing the error.To prevent the network needing large weights to drive the outputs to 1 or 0, set of input-output cases,and a is an exponential decay factor the error was considered to be zero if output units that should be between 0 and 1 that determines the relative contribution of the on had activities above 0.8 and output units that should be off had current gradient and earlier gradients to the weight change. activities below 0.2. To break symmetry we start with small random weights Variants on the learning procedure have been discovered independently by David Parker(personal communication)and by Yann Le Cun" One simple task that cannot be done by just connecting the input units to the output units is the detection of symmetry.To detect whether the binary activity levels of a one-dimensional array of input units are symmetrical about the centre point,it is essential to use an intermediate layer because the activity in A set of an individual input unit,considered alone,provides no evidence corresponding weights about the symmetry or non-symmetry of the whole input vector, so simply adding up the evidence from the individual input units is insufficient.(A more formal proof that intermediate units are required is given in ref.2.)The learning procedure discovered an elegant solution using just two intermediate units, as shown in Fig.1. Another interesting task is to store the information in the two family trees (Fig.2).Figure 3 shows the network we used,and Fig.5 A synchronous iterative net that is run for three iterations Fig.4 shows the 'receptive fields'of some of the hidden units and the equivalent layered net.Each time-step in the recurrent net after the network was trained on 100 of the 104 possible triples. corresponds to a layer in the layered net.The learning procedure So far,we have only dealt with layered,feed-forward for layered nets can be mapped into a learning procedure for networks.The equivalence between layered networks and recur- iterative nets.Two complications arise in performing this mapping: rent networks that are run iteratively is shown in Fig.5. first,in a layered net the output levels of the units in the intermedi- The most obvious drawback of the learning procedure is that ate layers during the forward pass are required for performing the the error-surface may contain local minima so that gradient backward pass(see equations(5)and (6)).So in an iterative net descent is not guaranteed to find a global minimum.However, it is necessary to store the history of output states of each unit. experience with many tasks shows that the network very rarely Second,for a layered net to be equivalent to an iterative net, gets stuck in poor local minima that are significantly worse than corresponding weights between different layers must have the same the global minimum.We have only encountered this undesirable value.To preserve this property,we average aE/aw for all the weights in each set of corresponding weights and then change each behaviour in networks that have just enough connections to weight in the set by an amount proportional to this average gradient. perform the task.Adding a few more connections creates extra With these two provisos,the learning procedure can be applied dimensions in weight-space and these dimensions provide paths directly to iterative nets.These nets can then either learn to perform around the barriers that create poor local minima in the lower iterative searches or learn sequential structures". dimensional subspaces. 1986 Nature Publishing Group
© Nature Publishing Group 1986
536 LETTERS TONATURE NATURE VOL 323 9 OCTOBER 1986 The learning procedure,in its current form,is not a plausible C156 model of learning in brains.However,applying the procedure to various tasks shows that interesting internal representations can be constructed by gradient descent in weight-space,and this suggests that it is worth looking for more biologically plausible ways of doing gradient descent in neural networks. We thank the System Development Foundation and the Office 00 of Naval Research for financial support. Received 1 May:accepted 31 July 1986 1.Rosenblatt,F.Principles of Neurodynamics (Spartan,Washington,DC.1961). 2.Minsky,M.L.Papert,S.Perceptrans (MIT,Cambridge,1969). 。。 3.Le Cun.Y.Proc Cognitioa 85,599-604 (1985). 4.Rumelhart,D.E..Hinton,G.E.&Williams,R.J.in Parallel Distribured Processing 10 20 30 40 50 60 70 ons in the Microstrcture of Cognition.Vol.1:Foundations(eds Rumelhart,D.E. MeCtelland,J.L)318-362(MIT,Cambridge,1986). C184 Bilateral amblyopia after a short period of reverse occlusion in kittens Kathryn M.Murphy*&Donald E.Mitchell 00.000 Department of Psychology,Dalhousie University, ●0 Halifax Nova Scotia,Canada B3H 4J1 The majority of neurones in the visual cortex of both adult cats 0 10 20 30 40g0 60 70 Days since termination and kittens can be excited by visual stimulation of either eye. of reverse occlusion Nevertheless,if one eye is deprived of patterned vision early in life,most cortical cells can only be activated by visual stimuli Fig.1.Changes in visual acuity during the period of binocular presented to the nondeprived eye and behaviourally the deprived vision for two kittens (C155 and C164)that were previously eye is apparently useless2.Although the consequences of monocularly deprived until 5 weeks of age,and then reverse monocular deprivation can be severe,they can in many circum- occluded for 18 days.Acuity of the initially deprived eye;O, stances be rapidly reversed with the early implementation of reverse acuity of the initially nondeprived eye. occlusion which forces the use of the initially deprived eye However,by itself reverse occlusion does not restore a normal intravenous pentothal and maintained by artificial respiration distribution of cortical occular dominance and only promotes with 70%N2O and 30%O2 supplemented with intravenous visual recovery in one eyes..In an effort to find a procedure that Nembutal;EEG,EKG,body temperature,and expired CO2 might restore good binocular vision,we have examined the effects levels were monitored.The eyes were brought to focus on a on acuity and cortical ocular dominance of a short,but physiologi- tangent screen 137 cm distant from the kitten using contact cally optimal period of reverse occlusion,followed by a period of lenses with 3 mm artificial pupils.Single units were recorded binocular vision beginning at 7.5 weeks of age.Surprisingly,despite along one long penetration in area 17 down the medial bank of the early introduction of binocular vision,both eyes attained the postlateral gyrus in each hemisphere,always beginning in acuities that were only approximately 1/3 of normal acuity levels. the hemisphere contralateral to the initially open eye.Receptive Despite the severe bilateral amblyopia,cortical ocular dominance fields were sampled according to established procedures,every appeared similar to that of normal cats.This is the first demonstra- 100 wm along the penetration in a cortical region corresponding tion of severe bilateral amblyopia following consecutive periods to the horizontal meridian of visual space.All units were located of monocular occlusion. within 15 of the area centralis,with the majority within 5. Nine kittens were used,of which eight were monocularly The longitudinal changes in visual acuity of both eyes follow- deprived by eyelid suture from about the time of natural eye ing introduction of binocular vision are shown in Fig.1 for two opening (6 to 11 days)until 5 weeks of age,at which time the representative kittens.At the end of 18 days of reverse occlusion initially deprived eye was opened and the other eye was sutured the vision of the initially deprived eye had recovered to only closed for 18 days.Physiological recordings from area 17 were rudimentary levels (1-2.5 cycles per degree)while at the same made from one normal control and from five monocularly- time the initially nondeprived eye had been rendered blind. deprived kittens,one immediately after reverse occlusion (as a During the subsequent period of binocular visual exposure the control);the remaining four after a further 4 weeks at least vision of both eyes improved slightly,but only to a very limited (range 4-8 weeks)of normal binocular vision.Grating acuity extent (to between 1.7 and 3.4 cycles per degree).The results thresholds were determined for both eyes of a further three from the third animal were very similar.After more than 2 kittens(subjected to the same regime-monocular deprivation, months of binocular exposure the acuities of the initially 18 days reverse suturing,followed by normal binocular vision) deprived and nondeprived eyes were respectively,2.54 and 3.35 by use of a jumping stands..None of the kittens tested cycles per degree.Surprisingly,after 2 months of binocular behaviourally were examined physiologically.Single unit vision,the acuity of both eyes of these animals remained at recordings were made in area 17 of the anaesthetized,paralysed about one-third to one-half of normal levels.Although the kittens (one normal,five experimental)with glass coated initially deprived eye was opened at the peak of the sensitive platinum-iridium electrodes.Anaesthesia was induced by period (5 weeks of age)and the initially nondeprived eye was closed for a relatively brief period of time (18 days),this depriva- .Present address:School of Optometry,University of Califomia,Berkeley.California 94720, tion regimen had a devastiating and permanent effect upon the USA. visual acuity of both eyes. 1986 Nature Publishing Group
© Nature Publishing Group 1986