Data Mining and Knowledge Discovery,2,121-167(1998) 1998 Kluwer Academic Publishers,Boston.Manufactured in The Netherlands. A Tutorial on Support Vector Machines for Pattern Recognition CHRISTOPHER J.C.BURGES burges@lucent.com Bell Laboratories,Lucent Technologies Editor:Usama Fayyad Abstract.The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization We then describe linear Support Vector Machines(SVMs)for separable and non-separable data,working through a non-trivial example in detail.We describe a mechanical analogy,and discuss when SVM solutions are unique and when they are global.We describe how support vector training can be practically implemented,and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data.We show how Support Vector machines can have very large (even infinite)VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels.While very high VC dimension would normally bode ill for generalization performance,and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs,there are several arguments which support the observed high accuracy of SVMs,which we review.Results of some experiments which were inspired by these arguments are also presented.We give numerous examples and proofs of most of the key theorems. There is new material,and I hope that the reader will find that even old material is cast in a fresh light. Keywords:support vector machines,statistical leaming theory,VC dimension,pattern recognition 1.Introduction The purpose of this paper is to provide an introductory yet extensive tutorial on the basic ideas behind Support Vector Machines(SVMs).The books (Vapnik,1995;Vapnik,1998) contain excellent descriptions of SVMs,but they leave room for an account whose purpose from the start is to teach.Although the subject can be said to have started in the late seventies (Vapnik,1979),it is only now receiving increasing attention,and so the time appears suitable for an introductory review.The tutorial dwells entirely on the pattern recognition problem.Many of the ideas there carry directly over to the cases of regression estimation and linear operator inversion,but space constraints precluded the exploration of these topics here. The tutorial contains some new material.All of the proofs are my own versions,where I have placed a strong emphasis on their being both clear and self-contained,to make the material as accessible as possible.This was done at the expense of some elegance and generality:however generality is usually easily added once the basic ideas are clear.The longer proofs are collected in the Appendix. By way of motivation,and to alert the reader to some of the literature,we summarize some recent applications and extensions of support vector machines.For the pattern recog- nition case,SVMs have been used for isolated handwritten digit recognition(Cortes and Vapnik,1995;Scholkopf,Burges and Vapnik,1995;Scholkopf,Burges and Vapnik,1996; Burges and Scholkopf,1997),object recognition(Blanz et al.,1996),speaker identification (Schmidt,1996),charmed quark detection,face detection in images(Osuna,Freund and
Data Mining and Knowledge Discovery, 2, 121–167 (1998) °c 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A Tutorial on Support Vector Machines for Pattern Recognition CHRISTOPHER J.C. BURGES burges@lucent.com Bell Laboratories, Lucent Technologies Editor: Usama Fayyad Abstract. The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We describe how support vector training can be practically implemented, and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large (even infinite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC dimension would normally bode ill for generalization performance, and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs, there are several arguments which support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that the reader will find that even old material is cast in a fresh light. Keywords: support vector machines, statistical learning theory, VC dimension, pattern recognition 1. Introduction The purpose of this paper is to provide an introductory yet extensive tutorial on the basic ideas behind Support Vector Machines (SVMs). The books (Vapnik, 1995; Vapnik, 1998) contain excellent descriptions of SVMs, but they leave room for an account whose purpose from the start is to teach. Although the subject can be said to have started in the late seventies (Vapnik, 1979), it is only now receiving increasing attention, and so the time appears suitable for an introductory review. The tutorial dwells entirely on the pattern recognition problem. Many of the ideas there carry directly over to the cases of regression estimation and linear operator inversion, but space constraints precluded the exploration of these topics here. The tutorial contains some new material. All of the proofs are my own versions, where I have placed a strong emphasis on their being both clear and self-contained, to make the material as accessible as possible. This was done at the expense of some elegance and generality: however generality is usually easily added once the basic ideas are clear. The longer proofs are collected in the Appendix. By way of motivation, and to alert the reader to some of the literature, we summarize some recent applications and extensions of support vector machines. For the pattern recognition case, SVMs have been used for isolated handwritten digit recognition (Cortes and Vapnik, 1995; Sch¨olkopf, Burges and Vapnik, 1995; Sch¨olkopf, Burges and Vapnik, 1996; Burges and Sch¨olkopf, 1997), object recognition (Blanz et al., 1996), speaker identification (Schmidt, 1996), charmed quark detection1, face detection in images (Osuna, Freund and
122 BURGES Girosi,1997a),and text categorization (Joachims,1997).For the regression estimation case,SVMs have been compared on benchmark time series prediction tests(Muller et al., 1997;Mukherjee,Osuna and Girosi,1997),the Boston housing problem(Drucker et al., 1997),and(on artificial data)on the PET operator inversion problem(Vapnik,Golowich and Smola,1996).In most of these cases,SVM generalization performance (i.e.error rates on test sets)either matches or is significantly better than that of competing methods The use of SVMs for density estimation(Weston et al.,1997)and ANOVA decomposition (Stitson et al.,1997)has also been studied.Regarding extensions,the basic SVMs contain no prior knowledge of the problem(for example,a large class of SVMs for the image recognition problem would give the same results if the pixels were first permuted randomly (with each image suffering the same permutation),an act of vandalism that would leave the best performing neural networks severely handicapped)and much work has been done on incorporating prior knowledge into SVMs(Scholkopf,Burges and Vapnik,1996:Scholkopf et al.,1998a;Burges,1998).Although SVMs have good generalization performance,they can be abysmally slow in test phase,a problem addressed in (Burges,1996;Osuna and Girosi,1998).Recent work has generalized the basic ideas(Smola.Scholkopf and Muller. 1998a;Smola and Scholkopf,1998),shown connections to regularization theory (Smola, Scholkopf and Muller,1998b:Girosi.1998:Wahba.1998).and shown how SVMideas can be incorporated in a wide range of other algorithms(Scholkopf,Smola and Muller,1998b; Scholkopf et al.1998c).The reader may also find the thesis of(Scholkopf.1997)helpful The problem which drove the initial development of SVMs occurs in several guises- the bias variance tradeoff(Geman and Bienenstock,1992).capacity control (Guyon et al.. 1992),overfitting(Montgomery and Peck,1992)-but the basic idea is the same.Roughly speaking,for a given learning task,with a given finite amount of training data,the best generalization performance will be achieved if the right balance is struck between the accuracy attained on that particular training set,and the"capacity"of the machine,that is, the ability of the machine to learn any training set without error.A machine with too much capacity is like a botanist with a photographic memory who,when presented with a new tree,concludes that it is not a tree because it has a different number of leaves from anything she has seen before;a machine with too little capacity is like the botanist's lazy brother, who declares that if it's green,it's a tree.Neither can generalize well.The exploration and formalization of these concepts has resulted in one of the shining peaks of the theory of statistical learning(Vapnik,1979). In the following,bold typeface will indicate vector or matrix quantities;normal typeface will be used for vector and matrix components and for scalars.We will label components of vectors and matrices with Greek indices,and label vectors and matrices themselves with Roman indices.Familiarity with the use of Lagrange multipliers to solve problems with equality or inequality constraints is assumed2 2.A Bound on the Generalization Performance of a Pattern Recognition Learning Machine There is a remarkable family of bounds governing the relation between the capacity of a learning machine and its performance3.The theory grewout ofconsiderations ofunder what circumstances,and how quickly,the mean of some empirical quantity converges uniformly
122 BURGES Girosi, 1997a), and text categorization (Joachims, 1997). For the regression estimation case, SVMs have been compared on benchmark time series prediction tests (M¨uller et al., 1997; Mukherjee, Osuna and Girosi, 1997), the Boston housing problem (Drucker et al., 1997), and (on artificial data) on the PET operator inversion problem (Vapnik, Golowich and Smola, 1996). In most of these cases, SVM generalization performance (i.e. error rates on test sets) either matches or is significantly better than that of competing methods. The use of SVMs for density estimation (Weston et al., 1997) and ANOVA decomposition (Stitson et al., 1997) has also been studied. Regarding extensions, the basic SVMs contain no prior knowledge of the problem (for example, a large class of SVMs for the image recognition problem would give the same results if the pixels were first permuted randomly (with each image suffering the same permutation), an act of vandalism that would leave the best performing neural networks severely handicapped) and much work has been done on incorporating prior knowledge into SVMs (Sch¨olkopf, Burges and Vapnik, 1996; Sch¨olkopf et al., 1998a; Burges, 1998). Although SVMs have good generalization performance, they can be abysmally slow in test phase, a problem addressed in (Burges, 1996; Osuna and Girosi, 1998). Recent work has generalized the basic ideas (Smola, Sch¨olkopf and M¨uller, 1998a; Smola and Sch¨olkopf, 1998), shown connections to regularization theory (Smola, Sch¨olkopf and M¨uller, 1998b; Girosi, 1998; Wahba, 1998), and shown how SVM ideas can be incorporated in a wide range of other algorithms (Sch¨olkopf, Smola and M¨uller, 1998b; Sch¨olkopf et al, 1998c). The reader may also find the thesis of (Sch¨olkopf, 1997) helpful. The problem which drove the initial development of SVMs occurs in several guises - the bias variance tradeoff (Geman and Bienenstock, 1992), capacity control (Guyon et al., 1992), overfitting (Montgomery and Peck, 1992) - but the basic idea is the same. Roughly speaking, for a given learning task, with a given finite amount of training data, the best generalization performance will be achieved if the right balance is struck between the accuracy attained on that particular training set, and the “capacity” of the machine, that is, the ability of the machine to learn any training set without error. A machine with too much capacity is like a botanist with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like the botanist’s lazy brother, who declares that if it’s green, it’s a tree. Neither can generalize well. The exploration and formalization of these concepts has resulted in one of the shining peaks of the theory of statistical learning (Vapnik, 1979). In the following, bold typeface will indicate vector or matrix quantities; normal typeface will be used for vector and matrix components and for scalars. We will label components of vectors and matrices with Greek indices, and label vectors and matrices themselves with Roman indices. Familiarity with the use of Lagrange multipliers to solve problems with equality or inequality constraints is assumed2. 2. A Bound on the Generalization Performance of a Pattern Recognition Learning Machine There is a remarkable family of bounds governing the relation between the capacity of a learning machine and its performance3. The theory grew out of considerations of under what circumstances, and how quickly, the mean of some empirical quantity converges uniformly
SUPPORT VECTOR MACHINES 123 as the number of data points increases,to the true mean (that which would be calculated from an infinite amount of data)(Vapnik,1979).Let us start with one of these bounds The notation here will largely follow that of (Vapnik,1995).Suppose we are given l observations.Each observation consists of a pair:a vector xiE R",i=1....,l and the associated "truth"yi,given to us by a trusted source.In the tree recognition problem,xi might be a vector of pixel values(e.g.n=256 for a 16x16 image),and yi would be 1 if the image contains a tree,and-1 otherwise(we use-1 here rather than 0 to simplify subsequent formulae).Now it is assumed that there exists some unknown probability distribution P(x,y)from which these data are drawn,ie.,the data are assumed"iid"(independently drawn and identically distributed).(We will use P for cumulative probability distributions and p for their densities).Note that this assumption is more general than associating a fixed y with every x:it allows there to be a distribution ofy for a given x.In that case,the trusted source would assign labelsyi according to a fixed distribution,conditional on x.However, after this Section,we will be assuming fixed y for given x. Now suppose we have a machine whose task it is to learn the mapping xiy The machine is actually defined by a set of possible mappings xf(x,)where the functions f(x,a)themselves are labeled by the adjustable parameters a.The machine is assumed to be deterministic:for a given input x,and choice of o,it will always give the same output f(x,a).A particular choice of a generates what we will call a"trained machine."Thus. for example,a neural network with fixed architecture.with a corresponding to the weights and biases,is a learning machine in this sense The expectation of the test error for a trained machine is therefore: R(a)=l-f(x,a)ldP(x.) (1) Note that,when a density p(x,y)exists,dP(x,y)may be written p(x,y)dxdy.This is a nice way of writing the true mean error,but unless we have an estimate of what P(x,y)is, it is not very useful The quantity R(a)is called the expected risk,or just the risk.Here we will call it the actual risk,to emphasize that it is the quantity that we are ultimately interested in.The "empirical risk"R()is defined to be just the measured mean error rate on the training set(for a fixed,finite number of observations)4: Remp(a)= ∑l-fk,al (2) =1 Note that no probability distribution appears here.Remp(a)is a fixed number for a particular choice of a and for a particular training set {x,y. The quantity-f(xi,a)is called the loss.For the case described here,it can only take the values 0 and 1.Now choose some n such that 0<n<1.Then for losses taking these values,with probability 1-n,the following bound holds(Vapnik,1995): R(a)≤Remp(a)+ h(log(21/h)+1)-log(n/4) (3)
SUPPORT VECTOR MACHINES 123 as the number of data points increases, to the true mean (that which would be calculated from an infinite amount of data) (Vapnik, 1979). Let us start with one of these bounds. The notation here will largely follow that of (Vapnik, 1995). Suppose we are given l observations. Each observation consists of a pair: a vector xi ∈ Rn, i = 1,...,l and the associated “truth” yi, given to us by a trusted source. In the tree recognition problem, xi might be a vector of pixel values (e.g. n = 256 for a 16x16 image), and yi would be 1 if the image contains a tree, and -1 otherwise (we use -1 here rather than 0 to simplify subsequent formulae). Now it is assumed that there exists some unknown probability distribution P(x, y) from which these data are drawn, i.e., the data are assumed “iid” (independently drawn and identically distributed). (We will use P for cumulative probability distributions, and p for their densities). Note that this assumption is more general than associating a fixed y with every x: it allows there to be a distribution of y for a given x. In that case, the trusted source would assign labels yi according to a fixed distribution, conditional on xi. However, after this Section, we will be assuming fixed y for given x. Now suppose we have a machine whose task it is to learn the mapping xi 7→ yi. The machine is actually defined by a set of possible mappings x 7→ f(x, α), where the functions f(x, α) themselves are labeled by the adjustable parameters α. The machine is assumed to be deterministic: for a given input x, and choice of α, it will always give the same output f(x, α). A particular choice of α generates what we will call a “trained machine.” Thus, for example, a neural network with fixed architecture, with α corresponding to the weights and biases, is a learning machine in this sense. The expectation of the test error for a trained machine is therefore: R(α) = Z 1 2 |y − f(x, α)|dP(x, y) (1) Note that, when a density p(x, y) exists, dP(x, y) may be written p(x, y)dxdy. This is a nice way of writing the true mean error, but unless we have an estimate of what P(x, y) is, it is not very useful. The quantity R(α) is called the expected risk, or just the risk. Here we will call it the actual risk, to emphasize that it is the quantity that we are ultimately interested in. The “empirical risk” Remp(α) is defined to be just the measured mean error rate on the training set (for a fixed, finite number of observations)4: Remp(α) = 1 2l X l i=1 |yi − f(xi, α)|. (2) Note that no probability distribution appears here. Remp(α) is a fixed number for a particular choice of α and for a particular training set {xi, yi}. The quantity 1 2 |yi − f(xi, α)| is called the loss. For the case described here, it can only take the values 0 and 1. Now choose some η such that 0 ≤ η ≤ 1. Then for losses taking these values, with probability 1 − η, the following bound holds (Vapnik, 1995): R(α) ≤ Remp(α) + sµh(log(2l/h) + 1) − log(η/4) l ¶ (3)
124 BURGES where h is a non-negative integer called the Vapnik Chervonenkis (VC)dimension,and is a measure ofthe notion of capacity mentioned above.In the following we will call the right hand side of Eq.(3)the"risk bound."We depart here from some previous nomenclature: the authors of(Guyon et al.,1992)call it the "guaranteed risk"but this is something of a misnomer,since it is really a bound on a risk,not a risk,and it holds only with a certain probability,and so is not guaranteed.The second term on the right hand side is called the “VC confidence” We note three key points about this bound.First,remarkably,it is independent of P(x,y). It assumes only that both the training data and the test data are drawn independently ac- cording to some P(x,y).Second,it is usually not possible to compute the left hand side. Third,if we know h,we can easily compute the right hand side.Thus given several different learning machines(recall that"learning machine"is just another name for a family of func- tions f(x,a)),and choosing a fixed,sufficiently small n,by then taking that machine which minimizes the right hand side,we are choosing that machine which gives the lowest upper bound on the actual risk.This gives a principled method for choosing a learning machine for a given task,and is the essential idea of structural risk minimization(see Section 2.6). Given a fixed family of learning machines to choose from,to the extent that the bound is tight for at least one of the machines,one will not be able to do better than this.To the extent that the bound is not tight for any,the hope is that the right hand side still gives useful information as to which learning machine minimizes the actual risk.The bound not being tight for the whole chosen family of learning machines gives critics a justifiable target at which to fire their complaints.At present,for this case,we must rely on experiment to be the judge 2.1.The VC Dimension The VC dimension is a property of a set of functions ff(a)(again,we use a as a generic set of parameters:a choice of a specifies a particular function),and can be defined for various classes of function f.Here we will only consider functions that correspond to the two-class pattern recognition case,so that f(x,a)E{-1,1}Vx,a.Now if a given set of I points can be labeled in all possible 2 ways,and for each labeling,a member of the set f(a)can be found which correctly assigns those labels,we say thatthat set of points is shattered by that set of functions.The VC dimension for the set of functions {f(a)}is defined as the maximum number of training points that can be shattered by ff(a).Note that,if the VC dimension is h,then there exists at least one set of h points that can be shattered,but it in general it will not be true that every set of h points can be shattered. 2.2.Shattering Points with Oriented Hyperplanes in R Suppose that the space in which the data live is R2,and the set {f()}consists of oriented straight lines,so that for a given line,all points on one side are assigned the class 1,and all points on the other side,the class-1.The orientation is shown in Figure I by an arrow, specifying on which side of the line points are to be assigned the label 1.While it is possible to find three points that can be shattered by this set of functions,it is not possible to find four.Thus the VC dimension of the set of oriented lines in R-is three
124 BURGES where h is a non-negative integer called the Vapnik Chervonenkis (VC) dimension, and is a measure of the notion of capacity mentioned above. In the following we will call the right hand side of Eq. (3) the “risk bound.” We depart here from some previous nomenclature: the authors of (Guyon et al., 1992) call it the “guaranteed risk”, but this is something of a misnomer, since it is really a bound on a risk, not a risk, and it holds only with a certain probability, and so is not guaranteed. The second term on the right hand side is called the “VC confidence.” We note three key points about this bound. First, remarkably, it is independent of P(x, y). It assumes only that both the training data and the test data are drawn independently according to some P(x, y). Second, it is usually not possible to compute the left hand side. Third, if we know h, we can easily compute the right hand side. Thus given several different learning machines (recall that “learning machine” is just another name for a family of functions f(x, α)), and choosing a fixed, sufficiently small η, by then taking that machine which minimizes the right hand side, we are choosing that machine which gives the lowest upper bound on the actual risk. This gives a principled method for choosing a learning machine for a given task, and is the essential idea of structural risk minimization (see Section 2.6). Given a fixed family of learning machines to choose from, to the extent that the bound is tight for at least one of the machines, one will not be able to do better than this. To the extent that the bound is not tight for any, the hope is that the right hand side still gives useful information as to which learning machine minimizes the actual risk. The bound not being tight for the whole chosen family of learning machines gives critics a justifiable target at which to fire their complaints. At present, for this case, we must rely on experiment to be the judge. 2.1. The VC Dimension The VC dimension is a property of a set of functions {f(α)} (again, we use α as a generic set of parameters: a choice of α specifies a particular function), and can be defined for various classes of function f. Here we will only consider functions that correspond to the two-class pattern recognition case, so that f(x, α) ∈ {−1, 1} ∀x, α. Now if a given set of l points can be labeled in all possible 2l ways, and for each labeling, a member of the set {f(α)} can be found which correctly assigns those labels, we say that that set of points is shattered by that set of functions. The VC dimension for the set of functions {f(α)} is defined as the maximum number of training points that can be shattered by {f(α)}. Note that, if the VC dimension is h, then there exists at least one set of h points that can be shattered, but it in general it will not be true that every set of h points can be shattered. 2.2. Shattering Points with Oriented Hyperplanes in Rn Suppose that the space in which the data live is R2, and the set {f(α)} consists of oriented straight lines, so that for a given line, all points on one side are assigned the class 1, and all points on the other side, the class −1. The orientation is shown in Figure 1 by an arrow, specifying on which side of the line points are to be assigned the label 1. While it is possible to find three points that can be shattered by this set of functions, it is not possible to find four. Thus the VC dimension of the set of oriented lines in R2 is three
SUPPORT VECTOR MACHINES 125 Figure /Three points in R,shattered by oriented lines. Let's now consider hyperplanes in R".The following theorem will prove useful(the proof is in the Appendix): THEOREM 1 Consider some set ofm points in R".Choose any one of the points as origin. Then the m points can be shattered by oriented hyperplanes if and only if the position vectors of the remaining points are linearly independent Corollary:The VC dimension of the set of oriented hyperplanes in R"is n+1,since we can always choose n+1 points,and then choose one of the points as origin,such that the position vectors of the remaining n points are linearly independent,but can never choose n+2 such points(since no n+1 vectors in R"can be linearly independent). An alternative proof of the corollary can be found in (Anthony and Biggs,1995),and references therein. 2.3.The VC Dimension and the Number of Parameters The VC dimension thus gives concreteness to the notion of the capacity of a given set of functions.Intuitively,one might be led to expect that learning machines with many parameters would have high VC dimension,while learning machines with few parameters would have low VC dimension.There is a striking counterexample to this,due to E.Levin and J.S.Denker(Vapnik,1995):A learning machine with just one parameter,but with infinite VC dimension (a family of classifiers is said to have infinite VC dimension if it can shatter I points,no matter how large l).Define the step function (z),xER:{0()= 1 Vz >0;0(z)=-1Vz <0}.Consider the one-parameter family of functions,defined by f(x,a)≡(sin(ax),x,a∈R (4) You choose some number l,and present me with the task of finding l points that can be shattered.I choose them to be:
SUPPORT VECTOR MACHINES 125 Figure 1. Three points in R2, shattered by oriented lines. Let’s now consider hyperplanes in Rn. The following theorem will prove useful (the proof is in the Appendix): Theorem 1 Consider some set of m points in Rn. Choose any one of the points as origin. Then the m points can be shattered by oriented hyperplanes5 if and only if the position vectors of the remaining points are linearly independent6. Corollary: The VC dimension of the set of oriented hyperplanes in Rn is n+ 1, since we can always choose n + 1 points, and then choose one of the points as origin, such that the position vectors of the remaining n points are linearly independent, but can never choose n + 2 such points (since no n + 1 vectors in Rn can be linearly independent). An alternative proof of the corollary can be found in (Anthony and Biggs, 1995), and references therein. 2.3. The VC Dimension and the Number of Parameters The VC dimension thus gives concreteness to the notion of the capacity of a given set of functions. Intuitively, one might be led to expect that learning machines with many parameters would have high VC dimension, while learning machines with few parameters would have low VC dimension. There is a striking counterexample to this, due to E. Levin and J.S. Denker (Vapnik, 1995): A learning machine with just one parameter, but with infinite VC dimension (a family of classifiers is said to have infinite VC dimension if it can shatter l points, no matter how large l). Define the step function θ(x), x ∈ R : {θ(x) = 1 ∀x > 0; θ(x) = −1 ∀x ≤ 0}. Consider the one-parameter family of functions, defined by f(x, α) ≡ θ(sin(αx)), x, α ∈ R. (4) You choose some number l, and present me with the task of finding l points that can be shattered. I choose them to be:
126 BURGES 人 x=0 1 2 Figure 2.Four points that cannot be shattered by 0(sin()),despite infinite VC dimension x1=10-i,i=1,…,l. (5) You specify any labels you like 1,2,…,,∈{-1,1} (6) Then f(a)gives this labeling if I choose a to be a=π1+∑-h10 (7) 2 Thus the VC dimension of this machine is infinite. Interestingly,even though we can shatter an arbitrarily large number of points,we can also find just four points that cannot be shattered.They simply have to be equally spaced, and assigned labels as shown in Figure 2.This can be seen as follows:Write the phase at xI as o1=2nm+0.Then the choice of label y =1 requires 0/3.Then at x4,/30.37 (and for n =0.05 and l 10,000),the VC confidence exceeds unity,and so for higher values the bound is guaranteed not tight
126 BURGES x=0 1234 Figure 2. Four points that cannot be shattered by θ(sin(αx)), despite infinite VC dimension. xi = 10−i , i = 1, ··· ,l. (5) You specify any labels you like: y1, y2, ··· , yl, yi ∈ {−1, 1}. (6) Then f(α) gives this labeling if I choose α to be α = π(1 +X l i=1 (1 − yi)10i 2 ). (7) Thus the VC dimension of this machine is infinite. Interestingly, even though we can shatter an arbitrarily large number of points, we can also find just four points that cannot be shattered. They simply have to be equally spaced, and assigned labels as shown in Figure 2. This can be seen as follows: Write the phase at x1 as φ1 = 2nπ +δ. Then the choice of label y1 = 1 requires 0 π/3. Then at x4, π/3 0.37 (and for η = 0.05 and l = 10, 000), the VC confidence exceeds unity, and so for higher values the bound is guaranteed not tight
SUPPORT VECTOR MACHINES 127 1.4 1.2 0.8 9 0.6 0.4 02 0.1020.30.40.50.60.70.80.91 h/I=VC Dimension Sample Size Figure 3.VC confidence is monotonic in h 2.5.Two Examples Consider the k'th nearest neighbour classifier.with k=1.This set of functions has infinite VC dimension and zero empirical risk,since any number of points,labeled arbitrarily,will be successfully learned by the algorithm(provided no two points of opposite class lie right on top of each other).Thus the bound provides no information.In fact,for any classifier with infinite VC dimension,the bound is not even valid?.However,even though the bound is not valid,nearest neighbour classifiers can still perform well.Thus this first example is a cautionary tale:infinite"capacity"does not guarantee poor performance. Let's follow the time honoured tradition of understanding things by trying to break them, and see if we can come up with a classifier for which the bound is supposed to hold,but which violates the bound.We want the left hand side of Eq.(3)to be as large as possible, and the right hand side to be as small as possible.So we want a family of classifiers which gives the worst possible actual risk of0.5,zero empirical risk up to some number of training observations,and whose VC dimension is easy to compute and is less than l(so that the bound is non trivial).An example is the following,which I call the"notebook classifier." This classifier consists of a notebook with enough room to write down the classes of m training observations,where m<1.For all subsequent patterns,the classifier simply says that all patterns have the same class.Suppose also that the data have as many positive (y=+1)as negative (y=-1)examples,and that the samples are chosen randomly.The notebook classifier will have zero empirical risk for up to m observations;0.5 training error for all subsequent observations;0.5 actual error,and VC dimension h=m.Substituting these values in Eq.(3),the bound becomes: ≤n(2/m)+1-(1/m)ln(/④ m (8) which is certainly met for all n if fa=()exp-)≤1,z=m/0,0≤z≤1 (9) which is true,since f(z)is monotonic increasing,and f(=1)=0.236
SUPPORT VECTOR MACHINES 127 0.2 0.4 0.6 0.8 1 1.2 1.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 VC Confidence h / l = VC Dimension / Sample Size Figure 3. VC confidence is monotonic in h 2.5. Two Examples Consider the k’th nearest neighbour classifier, with k = 1. This set of functions has infinite VC dimension and zero empirical risk, since any number of points, labeled arbitrarily, will be successfully learned by the algorithm (provided no two points of opposite class lie right on top of each other). Thus the bound provides no information. In fact, for any classifier with infinite VC dimension, the bound is not even valid7. However, even though the bound is not valid, nearest neighbour classifiers can still perform well. Thus this first example is a cautionary tale: infinite “capacity” does not guarantee poor performance. Let’s follow the time honoured tradition of understanding things by trying to break them, and see if we can come up with a classifier for which the bound is supposed to hold, but which violates the bound. We want the left hand side of Eq. (3) to be as large as possible, and the right hand side to be as small as possible. So we want a family of classifiers which gives the worst possible actual risk of 0.5, zero empirical risk up to some number of training observations, and whose VC dimension is easy to compute and is less than l (so that the bound is non trivial). An example is the following, which I call the “notebook classifier.” This classifier consists of a notebook with enough room to write down the classes of m training observations, where m ≤ l. For all subsequent patterns, the classifier simply says that all patterns have the same class. Suppose also that the data have as many positive (y = +1) as negative (y = −1) examples, and that the samples are chosen randomly. The notebook classifier will have zero empirical risk for up to m observations; 0.5 training error for all subsequent observations; 0.5 actual error, and VC dimension h = m. Substituting these values in Eq. (3), the bound becomes: m 4l ≤ ln(2l/m)+1 − (1/m) ln(η/4) (8) which is certainly met for all η if f(z) = ³z 2 ´ exp(z/4−1) ≤ 1, z ≡ (m/l), 0 ≤ z ≤ 1 (9) which is true, since f(z) is monotonic increasing, and f(z = 1) = 0.236
128 BURGES h4 (h3(h2h1 h1<h2<h3 Figure 4.Nested subsets of functions,ordered by VC dimension 2.6.Structural Risk Minimization We can now summarize the principle of structural risk minimization(SRM)(Vapnik,1979). Note that the VC confidence term in Eq.(3)depends on the chosen class of functions, whereas the empirical risk and actual risk depend on the one particular function chosen by the training procedure.We would like to find that subset of the chosen set of functions,such that the risk bound for that subset is minimized.Clearly we cannot arrange things so that the VC dimension h varies smoothly,since it is an integer.Instead,introduce a"structure" by dividing the entire class of functions into nested subsets(Figure 4).For each subset we must be able either to compute h,or to get a bound on h itself.SRM then consists of finding that subset of functions which minimizes the bound on the actual risk.This can be done by simply training a series of machines,one for each subset,where for a given subset the goal of training is simply to minimize the empirical risk.One then takes that trained machine in the series whose sum of empirical risk and VC confidence is minimal. We have now laid the groundwork necessary to begin our exploration of support vector machines. 3.Linear Support Vector Machines 3.1.The Separable Case We will start with the simplest case:linear machines trained on separable data(as we shall see,the analysis for the general case-nonlinear machines trained on non-separable data results in a very similar quadratic programming problem).Again label the training data x,,i=1,...,l,i-1,1),xiE Rd.Suppose we have some hyperplane which separates the positive from the negative examples(a"separating hyperplane").The points x which lie on the hyperplane satisfy w.x+b=0,where w is normal to the hyperplane, wll is the perpendicular distance from the hyperplane to the origin,and lwll is the Euclidean norm of w.Let d(d)be the shortest distance from the separating hyperplane to the closest positive(negative)example.Define the"margin"of a separating hyperplane to be d+d_.For the linearly separable case,the support vector algorithm simply looks for the separating hyperplane with largest margin.This can be formulated as follows:suppose that all the training data satisfy the following constraints:
128 BURGES h4 h1 < h2 < h3 ... h3 h2 h1 Figure 4. Nested subsets of functions, ordered by VC dimension. 2.6. Structural Risk Minimization We can now summarize the principle of structural risk minimization (SRM) (Vapnik, 1979). Note that the VC confidence term in Eq. (3) depends on the chosen class of functions, whereas the empirical risk and actual risk depend on the one particular function chosen by the training procedure. We would like to find that subset of the chosen set of functions, such that the risk bound for that subset is minimized. Clearly we cannot arrange things so that the VC dimension h varies smoothly, since it is an integer. Instead, introduce a “structure” by dividing the entire class of functions into nested subsets (Figure 4). For each subset, we must be able either to compute h, or to get a bound on h itself. SRM then consists of finding that subset of functions which minimizes the bound on the actual risk. This can be done by simply training a series of machines, one for each subset, where for a given subset the goal of training is simply to minimize the empirical risk. One then takes that trained machine in the series whose sum of empirical risk and VC confidence is minimal. We have now laid the groundwork necessary to begin our exploration of support vector machines. 3. Linear Support Vector Machines 3.1. The Separable Case We will start with the simplest case: linear machines trained on separable data (as we shall see, the analysis for the general case - nonlinear machines trained on non-separable data - results in a very similar quadratic programming problem). Again label the training data {xi, yi}, i = 1, ··· ,l, yi ∈ {−1, 1}, xi ∈ Rd. Suppose we have some hyperplane which separates the positive from the negative examples (a “separating hyperplane”). The points x which lie on the hyperplane satisfy w · x + b = 0, where w is normal to the hyperplane, |b|/kwk is the perpendicular distance from the hyperplane to the origin, and kwk is the Euclidean norm of w. Let d+ (d−) be the shortest distance from the separating hyperplane to the closest positive (negative) example. Define the “margin” of a separating hyperplane to be d+ +d−. For the linearly separable case, the support vector algorithm simply looks for the separating hyperplane with largest margin. This can be formulated as follows: suppose that all the training data satisfy the following constraints:
SUPPORT VECTOR MACHINES 129 ● H @ Margin Figure 5.Linear separating hyperplanes for the separable case.The support vectors are circled. x:·w+b≥+1for=+1 (10) x;·w+b≤-1fori=-1 (11) These can be combined into one set of inequalities: (x·w+b)-1≥0i (12) Now consider the points for which the equality in Eq.(10)holds(requiring that there exists such a point is equivalent to choosing a scale for w and b).These points lie on the hyperplane H1:xi.w+6=1 with normal w and perpendicular distance from the origin 1-b/wll.Similarly,the points for which the equality in Eq.(11)holds lie on the hyperplane H2:xiw+b=-1,with normal again w,and perpendicular distance from the origin-1-bl/llwll.Hence d=d=1/wll and the margin is simply 2/wll. Note that Hi and H2 are parallel(they have the same normal)and that no training points fall between them.Thus we can find the pair of hyperplanes which gives the maximum margin by minimizing w2,subject to constraints(12). Thus we expect the solution for a typical two dimensional case to have the form shown in Figure 5.Those training points for which the equality in Eq.(12)holds(i.e.those which wind up lying on one of the hyperplanes H1,H2),and whose removal would change the solution found,are called support vectors;they are indicated in Figure 5 by the extra circles. We will now switch to a Lagrangian formulation of the problem.There are two reasons for doing this.The first is that the constraints(12)will be replaced by constraints on the Lagrange multipliers themselves,which will be much easier to handle.The second is that in this reformulation of the problem,the training data will only appear(in the actual training and test algorithms)in the form of dot products between vectors.This is a crucial property which will allow us to generalize the procedure to the nonlinear case(Section 4). Thus,we introduce positive Lagrange multipliers o,i=1,...,l,one for each of the inequality constraints(12).Recall that the rule is that for constraints of the form ci>0,the constraint equations are multiplied by positive Lagrange multipliers and subtracted from
SUPPORT VECTOR MACHINES 129 -b |w| w Origin Margin H1 H2 Figure 5. Linear separating hyperplanes for the separable case. The support vectors are circled. xi · w + b ≥ +1 for yi = +1 (10) xi · w + b ≤ −1 for yi = −1 (11) These can be combined into one set of inequalities: yi(xi · w + b) − 1 ≥ 0 ∀i (12) Now consider the points for which the equality in Eq. (10) holds (requiring that there exists such a point is equivalent to choosing a scale for w and b). These points lie on the hyperplane H1 : xi · w + b = 1 with normal w and perpendicular distance from the origin |1 − b|/kwk. Similarly, the points for which the equality in Eq. (11) holds lie on the hyperplane H2 : xi · w + b = −1, with normal again w, and perpendicular distance from the origin | − 1 − b|/kwk. Hence d+ = d− = 1/kwk and the margin is simply 2/kwk. Note that H1 and H2 are parallel (they have the same normal) and that no training points fall between them. Thus we can find the pair of hyperplanes which gives the maximum margin by minimizing kwk2, subject to constraints (12). Thus we expect the solution for a typical two dimensional case to have the form shown in Figure 5. Those training points for which the equality in Eq. (12) holds (i.e. those which wind up lying on one of the hyperplanes H1, H2), and whose removal would change the solution found, are called support vectors; they are indicated in Figure 5 by the extra circles. We will now switch to a Lagrangian formulation of the problem. There are two reasons for doing this. The first is that the constraints (12) will be replaced by constraints on the Lagrange multipliers themselves, which will be much easier to handle. The second is that in this reformulation of the problem, the training data will only appear (in the actual training and test algorithms) in the form of dot products between vectors. This is a crucial property which will allow us to generalize the procedure to the nonlinear case (Section 4). Thus, we introduce positive Lagrange multipliers αi, i = 1, ··· , l, one for each of the inequality constraints (12). Recall that the rule is that for constraints of the form ci ≥ 0, the constraint equations are multiplied by positive Lagrange multipliers and subtracted from
130 BURGES the objective function,to form the Lagrangian.For equality constraints,the Lagrange multipliers are unconstrained.This gives Lagrangian: Lp三w2-∑a(w+b)+∑a4 (13) 1=1 i=1 We must now minimize Lp with respect to w,b,and simultaneously require that the derivatives of Lp with respect to all the ai vanish,all subject to the constraints ai>0 (let's call this particular set of constraints C1).Now this is a convex quadratic programming problem,since the objective function is itself convex,and those points which satisfy the constraints also form a convex set (any linear constraint defines a convex set,and a set of N simultaneous linear constraints defines the intersection of N convex sets,which is also a convex set).This means that we can equivalently solve the following"dual"problem: maximize Lp,subject to the constraints that the gradient of Lp with respect to w and b vanish,and subject also to the constraints that the a;>0(let's call that particular set of constraints C2).This particular dual formulation of the problem is called the Wolfe dual (Fletcher,1987).It has the property that the maximum of Lp,subject to constraints C2, occurs at the same values of the w,b and a,as the minimum of Lp,subject to constraints C18 Requiring that the gradient of Lp with respect to w and b vanish give the conditions: w=∑a (14) ∑班=0, (15) Since these are equality constraints in the dual formulation,we can substitute them into Eq.(13)to give n=∑a-∑aw (16) 1.3 Note that we have now given the Lagrangian different labels(P for primal,D for dual)to emphasize that the two formulations are different:Lp and Lp arise from the same objective function but with different constraints:and the solution is found by minimizing Lp or by maximizing Lp.Note also that if we formulate the problem with 6 =0,which amounts to requiring that all hyperplanes contain the origin,the constraint(15)does not appear.This is a mild restriction for high dimensional spaces,since it amounts to reducing the number of degrees of freedom by one. Support vector training(for the separable,linear case)therefore amounts to maximizing Lp with respect to the subject to constraints(15)and positivity of the with solution given by (14).Notice that there is a Lagrange multiplier o for every training point.In the solution,those points for which0 are called"support vectors",and lie on one of the hyperplanes H1,H2.All other training points have o=0 and lie either on Hi or H2(such that the equality in Eq.(12)holds),or on that side of H or H2 such that the
130 BURGES the objective function, to form the Lagrangian. For equality constraints, the Lagrange multipliers are unconstrained. This gives Lagrangian: LP ≡ 1 2 kwk2 −X l i=1 αiyi(xi · w + b) +X l i=1 αi (13) We must now minimize LP with respect to w, b, and simultaneously require that the derivatives of LP with respect to all the αi vanish, all subject to the constraints αi ≥ 0 (let’s call this particular set of constraints C1). Now this is a convex quadratic programming problem, since the objective function is itself convex, and those points which satisfy the constraints also form a convex set (any linear constraint defines a convex set, and a set of N simultaneous linear constraints defines the intersection of N convex sets, which is also a convex set). This means that we can equivalently solve the following “dual” problem: maximize LP , subject to the constraints that the gradient of LP with respect to w and b vanish, and subject also to the constraints that the αi ≥ 0 (let’s call that particular set of constraints C2). This particular dual formulation of the problem is called the Wolfe dual (Fletcher, 1987). It has the property that the maximum of LP , subject to constraints C2, occurs at the same values of the w, b and α, as the minimum of LP , subject to constraints C1 8. Requiring that the gradient of LP with respect to w and b vanish give the conditions: w = X i αiyixi (14) X i αiyi = 0. (15) Since these are equality constraints in the dual formulation, we can substitute them into Eq. (13) to give LD = X i αi − 1 2 X i,j αiαjyiyjxi · xj (16) Note that we have now given the Lagrangian different labels (P for primal, D for dual) to emphasize that the two formulations are different: LP and LD arise from the same objective function but with different constraints; and the solution is found by minimizing LP or by maximizing LD. Note also that if we formulate the problem with b = 0, which amounts to requiring that all hyperplanes contain the origin, the constraint (15) does not appear. This is a mild restriction for high dimensional spaces, since it amounts to reducing the number of degrees of freedom by one. Support vector training (for the separable, linear case) therefore amounts to maximizing LD with respect to the αi, subject to constraints (15) and positivity of the αi, with solution given by (14). Notice that there is a Lagrange multiplier αi for every training point. In the solution, those points for which αi > 0 are called “support vectors”, and lie on one of the hyperplanes H1, H2. All other training points have αi = 0 and lie either on H1 or H2 (such that the equality in Eq. (12) holds), or on that side of H1 or H2 such that the