Christopher M.Bishop Pattern Recognition and Machine Learning ②Springer
Christopher M. Bishop Pattern Recognition and Machine Learning
This book is dedicated to my family: Jenna,Mark,and Hugh Total eclipse of the sun,Antalya,Turkey,29 March 2006
This book is dedicated to my family: Jenna, Mark, and Hugh Total eclipse of the sun, Antalya, Turkey, 29 March 2006
Preface Pattern recognition has its origins in engineering,whereas machine learning grew out of computer science.However,these activities can be viewed as two facets of the same field,and together they have undergone substantial development over the past ten years.In particular,Bayesian methods have grown from a specialist niche to become mainstream,while graphical models have emerged as a general framework for describing and applying probabilistic models.Also,the practical applicability of Bayesian methods has been greatly enhanced through the development of a range of approximate inference algorithms such as variational Bayes and expectation propa- gation.Similarly,new models based on kernels have had significant impact on both algorithms and applications. This new textbook reflects these recent developments while providing a compre- hensive introduction to the fields of pattern recognition and machine learning.It is aimed at advanced undergraduates or first year PhD students,as well as researchers and practitioners,and assumes no previous knowledge of pattern recognition or ma- chine learning concepts.Knowledge of multivariate calculus and basic linear algebra is required,and some familiarity with probabilities would be helpful though not es- sential as the book includes a self-contained introduction to basic probability theory. Because this book has broad scope,it is impossible to provide a complete list of references,and in particular no attempt has been made to provide accurate historical attribution of ideas.Instead,the aim has been to give references that offer greater detail than is possible here and that hopefully provide entry points into what,in some cases,is a very extensive literature.For this reason,the references are often to more recent textbooks and review articles rather than to original sources. The book is supported by a great deal of additional material,including lecture slides as well as the complete set of figures used in the book,and the reader is encouraged to visit the book web site for the latest information: http://research.microsoft.com/~cmbishop/PRML vii
Preface Pattern recognition has its origins in engineering, whereas machine learning grew out of computer science. However, these activities can be viewed as two facets of the same field, and together they have undergone substantial development over the past ten years. In particular, Bayesian methods have grown from a specialist niche to become mainstream, while graphical models have emerged as a general framework for describing and applying probabilistic models. Also, the practical applicability of Bayesian methods has been greatly enhanced through the development of a range of approximate inference algorithms such as variational Bayes and expectation propagation. Similarly, new models based on kernels have had significant impact on both algorithms and applications. This new textbook reflects these recent developments while providing a comprehensive introduction to the fields of pattern recognition and machine learning. It is aimed at advanced undergraduates or first year PhD students, as well as researchers and practitioners, and assumes no previous knowledge of pattern recognition or machine learning concepts. Knowledge of multivariate calculus and basic linear algebra is required, and some familiarity with probabilities would be helpful though not essential as the book includes a self-contained introduction to basic probability theory. Because this book has broad scope, it is impossible to provide a complete list of references, and in particular no attempt has been made to provide accurate historical attribution of ideas. Instead, the aim has been to give references that offer greater detail than is possible here and that hopefully provide entry points into what, in some cases, is a very extensive literature. For this reason, the references are often to more recent textbooks and review articles rather than to original sources. The book is supported by a great deal of additional material, including lecture slides as well as the complete set of figures used in the book, and the reader is encouraged to visit the book web site for the latest information: http://research.microsoft.com/∼cmbishop/PRML vii
viii PREFACE Exercises The exercises that appear at the end of every chapter form an important com- ponent of the book.Each exercise has been carefully chosen to reinforce concepts explained in the text or to develop and generalize them in significant ways,and each is graded according to difficulty ranging from ()which denotes a simple exercise taking a few minutes to complete,through to (**)which denotes a significantly more complex exercise. It has been difficult to know to what extent these solutions should be made widely available.Those engaged in self study will find worked solutions very ben- eficial,whereas many course tutors request that solutions be available only via the publisher so that the exercises may be used in class.In order to try to meet these conflicting requirements,those exercises that help amplify key points in the text,or that fill in important details,have solutions that are available as a PDF file from the book web site.Such exercises are denoted byww.Solutions for the remaining exercises are available to course tutors by contacting the publisher(contact details are given on the book web site).Readers are strongly encouraged to work through the exercises unaided,and to turn to the solutions only as required. Although this book focuses on concepts and principles,in a taught course the students should ideally have the opportunity to experiment with some of the key algorithms using appropriate data sets.A companion volume (Bishop and Nabney, 2008)will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab software implementing most of the algorithms discussed in this book. Acknowledgements First of all I would like to express my sincere thanks to Markus Svensen who has provided immense help with preparation of figures and with the typesetting of the book in IATEX.His assistance has been invaluable. I am very grateful to Microsoft Research for providing a highly stimulating re- search environment and for giving me the freedom to write this book (the views and opinions expressed in this book,however,are my own and are therefore not neces- sarily the same as those of Microsoft or its affiliates). Springer has provided excellent support throughout the final stages of prepara- tion of this book,and I would like to thank my commissioning editor John Kimmel for his support and professionalism,as well as Joseph Piliero for his help in design- ing the cover and the text format and MaryAnn Brickner for her numerous contribu- tions during the production phase.The inspiration for the cover design came from a discussion with Antonio Criminisi. I also wish to thank Oxford University Press for permission to reproduce ex- cerpts from an earlier textbook,Neural Networks for Pattern Recognition (Bishop, 1995a).The images of the Mark 1 perceptron and of Frank Rosenblatt are repro- duced with the permission of Arvin Calspan Advanced Technology Center.I would also like to thank Asela Gunawardana for plotting the spectrogram in Figure 13.1, and Bernhard Scholkopf for permission to use his kernel PCA code to plot Fig- ure12.17
viii PREFACE Exercises The exercises that appear at the end of every chapter form an important component of the book. Each exercise has been carefully chosen to reinforce concepts explained in the text or to develop and generalize them in significant ways, and each is graded according to difficulty ranging from (⋆), which denotes a simple exercise taking a few minutes to complete, through to (⋆⋆⋆), which denotes a significantly more complex exercise. It has been difficult to know to what extent these solutions should be made widely available. Those engaged in self study will find worked solutions very beneficial, whereas many course tutors request that solutions be available only via the publisher so that the exercises may be used in class. In order to try to meet these conflicting requirements, those exercises that help amplify key points in the text, or that fill in important details, have solutions that are available as a PDF file from the book web site. Such exercises are denoted by www . Solutions for the remaining exercises are available to course tutors by contacting the publisher (contact details are given on the book web site). Readers are strongly encouraged to work through the exercises unaided, and to turn to the solutions only as required. Although this book focuses on concepts and principles, in a taught course the students should ideally have the opportunity to experiment with some of the key algorithms using appropriate data sets. A companion volume (Bishop and Nabney, 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab software implementing most of the algorithms discussed in this book. Acknowledgements First of all I would like to express my sincere thanks to Markus Svensen who ´ has provided immense help with preparation of figures and with the typesetting of the book in LATEX. His assistance has been invaluable. I am very grateful to Microsoft Research for providing a highly stimulating research environment and for giving me the freedom to write this book (the views and opinions expressed in this book, however, are my own and are therefore not necessarily the same as those of Microsoft or its affiliates). Springer has provided excellent support throughout the final stages of preparation of this book, and I would like to thank my commissioning editor John Kimmel for his support and professionalism, as well as Joseph Piliero for his help in designing the cover and the text format and MaryAnn Brickner for her numerous contributions during the production phase. The inspiration for the cover design came from a discussion with Antonio Criminisi. I also wish to thank Oxford University Press for permission to reproduce excerpts from an earlier textbook, Neural Networks for Pattern Recognition (Bishop, 1995a). The images of the Mark 1 perceptron and of Frank Rosenblatt are reproduced with the permission of Arvin Calspan Advanced Technology Center. I would also like to thank Asela Gunawardana for plotting the spectrogram in Figure 13.1, and Bernhard Scholkopf for permission to use his kernel PCA code to plot Fig- ¨ ure 12.17
PREFACE ix Many people have helped by proofreading draft material and providing com- ments and suggestions,including Shivani Agarwal,Cedric Archambeau,Arik Azran, Andrew Blake,Hakan Cevikalp,Michael Fourman,Brendan Frey,Zoubin Ghahra- mani,Thore Graepel,Katherine Heller,Ralf Herbrich,Geoffrey Hinton,Adam Jo- hansen,Matthew Johnson,Michael Jordan,Eva Kalyvianaki,Anitha Kannan,Julia Lasserre,David Liu,Tom Minka,Ian Nabney,Tonatiuh Pena,Yuan Qi,Sam Roweis, Balaji Sanjiya,Toby Sharp,Ana Costa e Silva,David Spiegelhalter,Jay Stokes,Tara Symeonides,Martin Szummer,Marshall Tappen,Ilkay Ulusoy,Chris Williams,John Winn,and Andrew Zisserman. Finally,I would like to thank my wife Jenna who has been hugely supportive throughout the several years it has taken to write this book. Chris Bishop Cambridge February 2006
PREFACE ix Many people have helped by proofreading draft material and providing comments and suggestions, including Shivani Agarwal, Cedric Archambeau, Arik Azran, ´ Andrew Blake, Hakan Cevikalp, Michael Fourman, Brendan Frey, Zoubin Ghahramani, Thore Graepel, Katherine Heller, Ralf Herbrich, Geoffrey Hinton, Adam Johansen, Matthew Johnson, Michael Jordan, Eva Kalyvianaki, Anitha Kannan, Julia Lasserre, David Liu, Tom Minka, Ian Nabney, Tonatiuh Pena, Yuan Qi, Sam Roweis, Balaji Sanjiya, Toby Sharp, Ana Costa e Silva, David Spiegelhalter, Jay Stokes, Tara Symeonides, Martin Szummer, Marshall Tappen, Ilkay Ulusoy, Chris Williams, John Winn, and Andrew Zisserman. Finally, I would like to thank my wife Jenna who has been hugely supportive throughout the several years it has taken to write this book. Chris Bishop Cambridge February 2006
Mathematical notation I have tried to keep the mathematical content of the book to the minimum neces- sary to achieve a proper understanding of the field.However,this minimum level is nonzero,and it should be emphasized that a good grasp of calculus,linear algebra, and probability theory is essential for a clear understanding of modern pattern recog- nition and machine learning techniques.Nevertheless,the emphasis in this book is on conveying the underlying concepts rather than on mathematical rigour. I have tried to use a consistent notation throughout the book,although at times this means departing from some of the conventions used in the corresponding re- search literature.Vectors are denoted by lower case bold Roman letters such as x,and all vectors are assumed to be column vectors.A superscript T denotes the transpose of a matrix or vector,so that xT will be a row vector.Uppercase bold roman letters,such as M,denote matrices.The notation (w1,...,wM)denotes a row vector with M elements,while the corresponding column vector is written as w=(w1,...,wM)T. The notation [a,b is used to denote the closed interval from a to b,that is the interval including the values a and b themselves,while (a,b)denotes the correspond- ing open interval,that is the interval excluding a and b.Similarly,a,b)denotes an interval that includes a but excludes b.For the most part,however,there will be little need to dwell on such refinements as whether the end points of an interval are included or not. The M x M identity matrix (also known as the unit matrix)is denoted IM, which will be abbreviated to I where there is no ambiguity about it dimensionality. It has elements Iij that equal 1 if i=j and 0 if ij. A functional is denoted fy where y(x)is some function.The concept of a functional is discussed in Appendix D. The notation g()=O(f(x))denotes that f()/g(x)is bounded as x-oo For instance if g(x)=3x2 +2,then g(x)=O(x2). The expectation of a function f(r,y)with respect to a random variable r is de- noted by Ef(,y)].In situations where there is no ambiguity as to which variable is being averaged over,this will be simplified by omitting the suffix,for instance xi
Mathematical notation I have tried to keep the mathematical content of the book to the minimum necessary to achieve a proper understanding of the field. However, this minimum level is nonzero, and it should be emphasized that a good grasp of calculus, linear algebra, and probability theory is essential for a clear understanding of modern pattern recognition and machine learning techniques. Nevertheless, the emphasis in this book is on conveying the underlying concepts rather than on mathematical rigour. I have tried to use a consistent notation throughout the book, although at times this means departing from some of the conventions used in the corresponding research literature. Vectors are denoted by lower case bold Roman letters such as x, and all vectors are assumed to be column vectors. A superscript T denotes the transpose of a matrix or vector, so that xT will be a row vector. Uppercase bold roman letters, such as M, denote matrices. The notation (w1,...,wM) denotes a row vector with M elements, while the corresponding column vector is written as w = (w1,...,wM)T. The notation [a, b] is used to denote the closed interval from a to b, that is the interval including the values a and b themselves, while (a, b) denotes the corresponding open interval, that is the interval excluding a and b. Similarly, [a, b) denotes an interval that includes a but excludes b. For the most part, however, there will be little need to dwell on such refinements as whether the end points of an interval are included or not. The M × M identity matrix (also known as the unit matrix) is denoted IM, which will be abbreviated to I where there is no ambiguity about it dimensionality. It has elements Iij that equal 1 if i = j and 0 if i ̸= j. A functional is denoted f[y] where y(x) is some function. The concept of a functional is discussed in Appendix D. The notation g(x) = O(f(x)) denotes that |f(x)/g(x)| is bounded as x → ∞. For instance if g(x)=3x2 + 2, then g(x) = O(x2). The expectation of a function f(x, y) with respect to a random variable x is denoted by Ex[f(x, y)]. In situations where there is no ambiguity as to which variable is being averaged over, this will be simplified by omitting the suffix, for instance xi
xii MATHEMATICAL NOTATION Er].If the distribution of r is conditioned on another variable z,then the corre- sponding conditional expectation will be written Ef().Similarly,the variance is denoted varf(),and for vector variables the covariance is written covx,y].We shall also use covx as a shorthand notation for covx,x.The concepts of expecta- tions and covariances are introduced in Section 1.2.2. If we have N values x1,...,xN of a D-dimensional vector x=(1,...,p)T we can combine the observations into a data matrix X in which the nth row of X corresponds to the row vector xT.Thus the n,i element of X corresponds to the ith element of the nth observation xn.For the case of one-dimensional variables we shall denote such a matrix by x,which is a column vector whose nth element is n. Note that x(which has dimensionality N)uses a different typeface to distinguish it from x(which has dimensionality D)
xii MATHEMATICAL NOTATION E[x]. If the distribution of x is conditioned on another variable z, then the corresponding conditional expectation will be written Ex[f(x)|z]. Similarly, the variance is denoted var[f(x)], and for vector variables the covariance is written cov[x, y]. We shall also use cov[x] as a shorthand notation for cov[x, x]. The concepts of expectations and covariances are introduced in Section 1.2.2. If we have N values x1,..., xN of a D-dimensional vector x = (x1,...,xD)T, we can combine the observations into a data matrix X in which the nth row of X corresponds to the row vector xT n. Thus the n, i element of X corresponds to the i th element of the nth observation xn. For the case of one-dimensional variables we shall denote such a matrix by x, which is a column vector whose nth element is xn. Note that x (which has dimensionality N) uses a different typeface to distinguish it from x (which has dimensionality D)
Contents Preface vij Mathematical notation xi Contents xiii 1 Introduction 1 1.1 Example:Polynomial Curve Fitting 4 1.2 Probability Theory 12 1.2.1 Probability densities 7 1.2.2 Expectations and covariances 19 1.2.3 Bayesian probabilities 21 1.2.4 The Gaussian distribution 24 1.2.5 Curve fitting re-visited 2 1.2.6 Bayesian curve fitting 30 1.3 Model Selection...,......。·.··.·········· 1.4 The Curse of Dimensionality.··.················: 3 1.5 Decision Theory························· 8 1.5.1 Minimizing the misclassification rate 39 1.5.2 Minimizing the expected loss........ 41 l.5.3 The reject option...···.·· 1.5.4 Inference and decision.............. 44。 42 1.5.5 Loss functions for regression....... 46 1.6 Information Theory....·..·····. 1.6.1 Relative entropy and mutual information Exercises..............·..·.·...·.· 85568 xiii
Contents Preface vii Mathematical notation xi Contents xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting . . . . ............. 4 1.2 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2.1 Probability densities . . . . . . . . . . . . . . . . . . . . . 17 1.2.2 Expectations and covariances . . . . . . . . . . . . . . . . 19 1.2.3 Bayesian probabilities . . . . . . . . . . . . . . . . . . . . 21 1.2.4 The Gaussian distribution . . . . . . . . . . . . . . . . . . 24 1.2.5 Curve fitting re-visited . . . . . . . . . . . . . . . . . . . . 28 1.2.6 Bayesian curve fitting . . . . . . . . . . . . . . . . . . . . 30 1.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.4 The Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . 33 1.5 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 1.5.1 Minimizing the misclassification rate . . . . . . . . . . . . 39 1.5.2 Minimizing the expected loss . . . . . . . . . . . . . . . . 41 1.5.3 The reject option . . . . . . . . . . . . . . . . . . . . . . . 42 1.5.4 Inference and decision . . . . . . . . . . . . . . . . . . . . 42 1.5.5 Loss functions for regression . . . . . . . . . . . . . . . . . 46 1.6 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 48 1.6.1 Relative entropy and mutual information . . . . . . . . . . 55 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 xiii
xiv CONTENTS 2 Probability Distributions 67 2.1 Binary Variables 6 2.1.1 The beta distribution 2.2 Multinomial Variables...... 4 2.2.1 The Dirichlet distribution 76 2.3 The Gaussian Distribution...... 78 2.3.1 Conditional Gaussian distributions.. 85 2.3.2 Marginal Gaussian distributions.. 88 2.3.3 Bayes'theorem for Gaussian variables.. 90 2.3.4 Maximum likelihood for the Gaussian 93 2.3.5 Sequential estimation..................... 94 2.3.6 Bayesian inference for the Gaussian .... 97 2.3.7 Student's t-distribution........ 102 2.3.8 Periodic variables.......·.···- 105 2.3.9 Mixtures of Gaussians 110 2.4 The Exponential Family..... 113 2.4.1 Maximum likelihood and sufficient statistics 116 2.4.2 Conjugate priors 117 2.4.3 Noninformative priors 117 2.5 Nonparametric Methods 120 2.5.1 Kernel density estimators.............. 122 2.5.2 Nearest-neighbour methods··········· 124 127 3 Linear Models for Regression 137 3.1 Linear Basis Function Models..··..·。·.········· 138 3.1.1 Maximum likelihood and least squares....... 。。。 140 3.1.2 Geometry of least squares 143 3.1.3 Sequential learning....... 143 3.1.4 Regularized least squares 144 3.1.5 Multiple outputs..... 146 3.2 The Bias-Variance Decomposition .. 147 3.3 Bayesian Linear Regression 152 3.3.1 Parameter distribution 152 3.3.2 Predictive distribution 156 3.3.3 Equivalent kernel........ 159 3.4 Bayesian Model Comparison...... 161 3.5 The Evidence Approximation······ 165 3.5.1 Evaluation of the evidence function 166 3.5.2 Maximizing the evidence function 168 3.5.3 Effective number of parameters.········ 170 3.6 Limitations of Fixed Basis Functions.......... 172 Exercises 173
xiv CONTENTS 2 Probability Distributions 67 2.1 Binary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.1.1 The beta distribution . . . . . . . . . . . . . . . . . . . . . 71 2.2 Multinomial Variables . . . . . . . . . . . . . . . . . . . . . . . . 74 2.2.1 The Dirichlet distribution . . . . . . . . . . . . . . . . . . . 76 2.3 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 78 2.3.1 Conditional Gaussian distributions . . . . . . . . . . . . . . 85 2.3.2 Marginal Gaussian distributions . . . . . . . . . . . . . . . 88 2.3.3 Bayes’ theorem for Gaussian variables . . . . . . . . . . . . 90 2.3.4 Maximum likelihood for the Gaussian . . . . . . . . . . . . 93 2.3.5 Sequential estimation . . . . . . . . . . . . . . . . . . . . . 94 2.3.6 Bayesian inference for the Gaussian . . . . . . . . . . . . . 97 2.3.7 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . 102 2.3.8 Periodic variables . . . . . . . . . . . . . . . . . . . . . . . 105 2.3.9 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . 110 2.4 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 113 2.4.1 Maximum likelihood and sufficient statistics . . . . . . . . 116 2.4.2 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . 117 2.4.3 Noninformative priors . . . . . . . . . . . . . . . . . . . . 117 2.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 120 2.5.1 Kernel density estimators . . . . . . . . . . . . . . . . . . . 122 2.5.2 Nearest-neighbour methods . . . . . . . . . . . . . . . . . 124 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 3 Linear Models for Regression 137 3.1 Linear Basis Function Models . . . . . . . . . . . . . . . . . . . . 138 3.1.1 Maximum likelihood and least squares . . . . . . . . . . . . 140 3.1.2 Geometry of least squares . . . . . . . . . . . . . . . . . . 143 3.1.3 Sequential learning . . . . . . . . . . . . . . . . . . . . . . 143 3.1.4 Regularized least squares . . . . . . . . . . . . . . . . . . . 144 3.1.5 Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . 146 3.2 The Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . 147 3.3 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . 152 3.3.1 Parameter distribution . . . . . . . . . . . . . . . . . . . . 152 3.3.2 Predictive distribution . . . . . . . . . . . . . . . . . . . . 156 3.3.3 Equivalent kernel . . . . . . . . . . . . . . . . . . . . . . . 159 3.4 Bayesian Model Comparison . . . . . . . . . . . . . . . . . . . . . 161 3.5 The Evidence Approximation . . . . . . . . . . . . . . . . . . . . 165 3.5.1 Evaluation of the evidence function . . . . . . . . . . . . . 166 3.5.2 Maximizing the evidence function . . . . . . . . . . . . . . 168 3.5.3 Effective number of parameters . . . . . . . . . . . . . . . 170 3.6 Limitations of Fixed Basis Functions . . . . . . . . . . . . . . . . 172 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
CONTENTS XV 4 Linear Models for Classification 179 4.1 Discriminant Functions..·.···················· 181 4.l.1Tw0 classes.......·..。·.。···..·。···· 181 4.l.2 Multiple classes.·.··················· 182 4.l.3 Least squares for classification.......··..···· 184 4.1.4 Fisher's linear discriminant........ 186 4.1.5 Relation to least squares... 189 4.1.6 Fisher's discriminant for multiple classes 191 4.l.7 The perceptron algorithm..············· 192 4.2 Probabilistic Generative Models...··.········ 196 4.2.1 Continuous inputs................ 198 4.2.2 Maximum likelihood solution 200 4.2.3 Discrete features....·。···. 202 4.2.4 Exponential family......... 202 4.3 Probabilistic Discriminative Models.. 203 4.3.1 Fixed basis functions...... 204 4.3.2 Logistic regression········· 205 4.3.3 Iterative reweighted least squares 44.4..404。 207 4.3.4 Multiclass logistic regression.... 209 4.3.5 Probit regression··.··· 210 4.3.6 Canonical link functions............ 212 4.4 The Laplace Approximation .. 213 4.4.1 Model comparison and BIC 216 4.5 Bayesian Logistic Regression 217 4.5.1 Laplace approximation 217 4.5.2 Predictive distribution 218 Exercises············· 220 5 Neural Networks 225 5.1 Feed-forward Network Functions 227 5.1.1 Weight-space symmetries 231 5.2 Network Training....。·..· 232 5.2.1 Parameter optimization... 236 5.2.2 Local quadratic approximation 237 5.2.3 Use of gradient information 239 5.2.4 Gradient descent optimization...·.·..·.······ 240 5.3 Error Backpropagation..·..·...··.······.· 241 5.3.1 Evaluation of error-function derivatives 242 5.3.2 A simple example 245 5.3.3 Efficiency of backpropagation 246 5.3.4 The Jacobian matrix.···. 247 5.4 The Hessian Matrix.......·......·.·.· 249 5.4.1 Diagonal approximation 250 5.4.2 Outer product approximation.·.······· 251 5.4.3 Inverse Hessian.............. 252
CONTENTS xv 4 Linear Models for Classification 179 4.1 Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . 181 4.1.1 Two classes . . . . . . . . . . . . . . . . . . . . . . . . . . 181 4.1.2 Multiple classes . . . . . . . . . . . . . . . . . . . . . . . . 182 4.1.3 Least squares for classification . . . . . . . . . . . . . . . . 184 4.1.4 Fisher’s linear discriminant . . . . . . . . . . . . . . . . . . 186 4.1.5 Relation to least squares . . . . . . . . . . . . . . . . . . . 189 4.1.6 Fisher’s discriminant for multiple classes . . . . . . . . . . 191 4.1.7 The perceptron algorithm . . . . . . . . . . . . . . . . . . . 192 4.2 Probabilistic Generative Models . . . . . . . . . . . . . . . . . . . 196 4.2.1 Continuous inputs . . . . . . . . . . . . . . . . . . . . . . 198 4.2.2 Maximum likelihood solution . . . . . . . . . . . . . . . . 200 4.2.3 Discrete features . . . . . . . . . . . . . . . . . . . . . . . 202 4.2.4 Exponential family . . . . . . . . . . . . . . . . . . . . . . 202 4.3 Probabilistic Discriminative Models . . . . . . . . . . . . . . . . . 203 4.3.1 Fixed basis functions . . . . . . . . . . . . . . . . . . . . . 204 4.3.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . 205 4.3.3 Iterative reweighted least squares . . . . . . . . . . . . . . 207 4.3.4 Multiclass logistic regression . . . . . . . . . . . . . . . . . 209 4.3.5 Probit regression . . . . . . . . . . . . . . . . . . . . . . . 210 4.3.6 Canonical link functions . . . . . . . . . . . . . . . . . . . 212 4.4 The Laplace Approximation . . . . . . . . . . . . . . . . . . . . . 213 4.4.1 Model comparison and BIC . . . . . . . . . . . . . . . . . 216 4.5 Bayesian Logistic Regression . . . . . . . . . . . . . . . . . . . . 217 4.5.1 Laplace approximation . . . . . . . . . . . . . . . . . . . . 217 4.5.2 Predictive distribution . . . . . . . . . . . . . . . . . . . . 218 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 5 Neural Networks 225 5.1 Feed-forward Network Functions . . . . . . . . . . . . . . . . . . 227 5.1.1 Weight-space symmetries . . . . . . . . . . . . . . . . . . 231 5.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 5.2.1 Parameter optimization . . . . . . . . . . . . . . . . . . . . 236 5.2.2 Local quadratic approximation . . . . . . . . . . . . . . . . 237 5.2.3 Use of gradient information . . . . . . . . . . . . . . . . . 239 5.2.4 Gradient descent optimization . . . . . . . . . . . . . . . . 240 5.3 Error Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 241 5.3.1 Evaluation of error-function derivatives . . . . . . . . . . . 242 5.3.2 A simple example . . . . . . . . . . . . . . . . . . . . . . 245 5.3.3 Efficiency of backpropagation . . . . . . . . . . . . . . . . 246 5.3.4 The Jacobian matrix . . . . . . . . . . . . . . . . . . . . . 247 5.4 The Hessian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 249 5.4.1 Diagonal approximation . . . . . . . . . . . . . . . . . . . 250 5.4.2 Outer product approximation . . . . . . . . . . . . . . . . . 251 5.4.3 Inverse Hessian . . . . . . . . . . . . . . . . . . . . . . . . 252