Preface Statistical Learning Theory now plays a more active role:after the general analysis of learning processes,the res earch in the area of synthesis of optimal algorithms was started.These studies,however,do not belong to history yet.They are a sulject of today's research activities. Vladimir Vapnik (1995) The Support Vector Machine has recently been introduced as a new technique for solving a variety of learning and function estimation problems.During a workshop at the annual Neur al Information Processing Systems (NIPS)conference,held in Breckenridge,Colorado in December 1997,a snapshot of the state of the art in Support Vector learning was recorded.A variety of people helped in this,among them our co-organizer Leon Bottou,the NIPS workshop chairs Steve Nowlan and Rich Zemel,and all the workshop speakers and attendees who contributed to lively discussions.After the workshop,we decided that it would be worthwhile to invest some time to have the snapshot printed. We invited all the speakers as well as other researchers to submit papers for this collection,and integr ated the results into the present book.We believe that it covers the full range of current Support Vector research at an early point in time.This is possible for two reasons.First,the field of SV learning is in its early (and thus exciting)days.Second,this book gathers expertise from all contributers,whom we wholeheartedly thank for all the work they have put into our joint effort.Any single person trying to accomplish this task would most likely have failed:either by writing a book which is less comprehensive,or by taking more time to complete the book. It is our hope that this outweighs the shortcomings of the book,most notably the fact that a collection of chapters can never be as homogeneous as a book conceived by a single person.We have tried to compensate for this by the selection and refereeing process of the submissions.In addition,we have written an introductory chapter describing the SV algorithm in some detail (chapter 1),and added a roadmap (chapter 2)which describes the actual contributions which are to follow in chapters 3 through 20. Bernhar d Scholkopf,Christopher J.C.Burges,Alexander J.Smola Berlin,Holmdel,July 1998 1998/08/251631
Preface Statistical Learning Theory now plays a more active role after the general analysis of learning processes the research in the area of synthesis of optimal algorithms was started These studies however do not belong to history yet They are a subject of todays research activities Vladimir Vapnik The Support Vector Machine has recently been introduced as a new technique for solving a variety of learning and function estimation problems During a workshop at the annual Neural Information Processing Systems NIPS conference held in Breckenridge Colorado in December a snapshot of the state of the art in Support Vector learning was recorded A variety of people helped in this among them our coorganizer Leon Bottou the NIPS workshop chairs Steve Nowlan and Rich Zemel and all the workshop speakers and attendees who contributed to lively discussions After the workshop we decided that it would be worthwhile to invest some time to have the snapshot printed We invited all the speakers as well as other researchers to submit papers for this collection and integrated the results into the present book We believe that it covers the full range of current Support Vector research at an early point in time This is possible for two reasons First the eld of SV learning is in its early and thus exciting days Second this book gathers expertise from all contributers whom we wholeheartedly thank for all the work they have put into our joint e ort Any single person trying to accomplish this task would most likely have failed either by writing a book which is less comprehensive or by taking more time to complete the book It is our hope that this outweighs the shortcomings of the book most notably the fact that a collection of chapters can never be as homogeneous as a book conceived by a single person We have tried to compensate for this by the selection and refereeing process of the submissions In addition we have written an introductory chapter describing the SV algorithm in some detail chapter and added a roadmap chapter which describes the actual contributions which are to follow in chapters through Bernhard Scholkopf Christopher JC Burges Alexander J Smola Berlin Holmdel July
Introduction to Support Vector Learning The goal of this chapter,which describes the centr al ideas of SV learning,is twofold. First,we want to provide an introduction for readers unfamiliar with this field. Second,this introduction serves as a source of the basic equations for the chapters of this book.For more exhaustive treatments,we refer the interested reader to Vapnik (1995);Scholkopf (1997);Burges (1998). 1.1 Learning Pattern Recognition from Examples Let us start with the problem of learning how to recognize patterns.Formally,we Training want to estimate a function f:RN{1}using input-output training data Data (x.,.).,(x,y)∈RN×{±1, 199/ such that f will correctly classify unseen examples (x,y),i.e.f(x)=y for examples (x,y)that were gener ated from the same underlying probability distribution P(x,y) as the training dat a.If we put no restriction on the class of functions that we choose our estimate f from,however,even a function which does well on the training data,e.g.by satisfying f(xi)=yi for all i=1,...,C,need not generalize well Test to unseen examples.To see this,note that for each function f and any test set Data (仅,立.),,(60∈RN×{±1},satisfying{区.,…,0n{x.,,x}={, there exists another function f*such that f*(xi)=f(xi)for all i=1,...,, yet f*()f()for all i 1,...,6.As we are only given the training data, we have no means of selecting which of the two functions (and hence which of the completely different sets of test outputs)is preferable.Hence,only minimizing the Empirical training error (or Onpilca lisk), Risk Rmlh=2∑fk)-M 1 190/ ). Risk does not imply a small test error (called Iisk),averaged over test examples dr awn from the underlying distribution P(x,y), Rlf]=f(x)-ul dP(x,). 192/ Statistical learning theory (Vapnik and Chervonenkis,1974;Vapnik,1979),or VC (Vapnik-Chervonenkis)theory,shows that it is imperative to restrict the class of 1998/08/251631
Introduction to Support Vector Learning The goal of this chapter which describes the central ideas of SV learning is twofold First we want to provide an introduction for readers unfamiliar with this eld Second this introduction serves as a source of the basic equations for the chapters of this book For more exhaustive treatments we refer the interested reader to Vapnik Scholkopf Burges Learning Pattern Recognition from Examples Let us start with the problem of learning how to recognize patterns Formally we want to estimate a function f R Training N fg using inputoutput training data Data x y x y RN fg such that f will correctly classify unseen examples x y ie f x y for examples x y that were generated from the same underlying probability distribution P x y as the training data If we put no restriction on the class of functions that we choose our estimate f from however even a function which does well on the training data eg by satisfying f xi yi for all i need not generalize well Test to unseen examples To see this note that for each function f and any test set Data x y x y RN fg satisfying fx x gfx xg fg there exists another function f such that f xi f xi for all i yet f xi f xi for all i As we are only given the training data we have no means of selecting which of the two functions and hence which of the completely di erent sets of test outputs is preferable Hence only minimizing the Empirical training error or empirical risk Risk Rempf X i jf xi yi j Risk does not imply a small test error called risk averaged over test examples drawn from the underlying distribution P x y Rf Z jf x yj dP x y Statistical learning theory Vapnik and Chervonenkis Vapnik or VC VapnikChervonenkis theory shows that it is imperative to restrict the class of
1 Introduction to Support Vector Learning frctiatfice togewhic ha acapacity atis suer teanat c aaaletarrgcetal tear poice bunds athetet EnaLThemImkatimo teebaurd,wh cererdabch teen pka nik ardthecaraat c terurctimncas,lead toteprapec structural risk minimization (k,1979)Thebetkro Iarecty crceItc tei te Vcmeria VC dimension,cireda thelagetniberh cf ians thatcanbeseraatedmn aIOsblewas urgfurctim o tesyengas (flGater4)lane an pec a bardi theraoirg<e te ameriao thecas ofuctia atthelcamrgimac irecanm penert tenfcra furctia c thatcas,wh apoallt caticati-7,tebard R(a)≤Remp(a)+ ’h,1Cg) (1.4) hOC wheetheconfidence term Cireda h1Cg+1)-1C/4) (1.5) Tig terbaurd canberainuatedinterns ccercactt,sGeannealed VC entropy r theGrowth function ITheeaellaly (3KCeledtobehacer toealate(clhoeer cafer9),bit te pa afurcanerfa ioemte cacpha jatc the (prik,1995)1Alterrativecaract carcept that canbeuedtof m uatebcurd Irclcethefat shattering dimension,(fKafer4 Theb aurd(14)ceere scherutere parator lenaks kiplcewewaked toicamna“g中acdG”wheP(xty)-Ps)·Pg),il1 whehelattemx (aain roigrinatimaat thelaey,wth ufai P)IGMEnatianrg san pec fixedske wecanthensuey (elpwIth alcarrgmarewhK aee zuotarirge (poicdwehaeroe an pe cahaccirgeag che lHoer mnacer toigricckcetelacan iaeg.marewl recesanly Iequealage CmerilThu te cicerce term (15). raaⅡg踟carKally w,wll belage ardh ebaurd(.4)wIl not sL正at IOslencre thatdletotesma tarrgena weshcude rectasm ai tet enThs mae turcerstarcablen (14)cahodircerercerto ast It aattheurcelryIrgds tbltimp(xy):Itawas hod (pokedthath<), bititcte rct aw as maearaja pecctim abardcanenclate beche vadif Its lagerthantem ai uh encriatelinacertogetrafi ja peccticfic,(14)herurctian Iacemutberetrctedsic tattecaraat egl cin enia i sma erag (ineatiantot eaaialeancutc catal 1.2 Hyperplane Classifiers Toceigncanirgagtms cetuareed toccerp acas ofuctia whce caraat canbe coptecl pk ard lerer (1963)ard prik ard 1.8/08/25131
Introduction to Support Vector Learning functions that f is chosen from to one which has a capacity that is suitable for the amount of available training data VC theory provides bounds on the test error The minimization of these bounds which depend on both the empirical risk and the capacity of the function class leads to the principle of structural risk minimization Vapnik The bestknown capacity concept of VC theory is the VC dimension VC dimension de ned as the largest number h of points that can be separated in all possible ways using functions of the given class cf chapter An example of a VC bound is the following if h is the VC dimension of the class of functions that the learning machine can implement then for all functions of that class with a probability of at least the bound R Remp h log holds where the con dence term is de ned as h log s h log h log Tighter bounds can be formulated in terms of other concepts such as the annealed VC entropy or the Growth function These are usually considered to be harder to evaluate cf however chapter but they play a fundamental role in the conceptual part of VC theory Vapnik Alternative capacity concepts that can be used to formulate bounds include the fat shattering dimension cf chapter The bound deserves some further explanatory remarks Suppose we wanted to learn a dependency where P x y P x P y ie where the pattern x contains no information about the label y with uniform P y Given a training sample of xed size we can then surely come up with a learning machine which achieves zero training error provided we have no examples contradicting each other However in order to reproduce the random labellings this machine will necessarily require a large VC dimension h Thus the con dence term increasing monotonically with h will be large and the bound will not support possible hopes that due to the small training error we should expect a small test error This makes it understandable how can hold independent of assumptions about the underlying distribution P x y it always holds provided that h but it does not always make a nontrivial prediction a bound on an error rate becomes void if it is larger than the maximum error rate In order to get nontrivial predictions from the function space must be restricted such that the capacity eg VC dimension is small enough in relation to the available amount of data Hyperplane Classiers To design learning algorithms one thus needs to come up with a class of functions whose capacity can be computed Vapnik and Lerner and Vapnik and
oa Hyperplane Classi/ers Chervonenkis(1964)considered the cass of hyperplanes (w:X)+b=0w∈Rveb∈Re (1.6) corresponding to decision functions f(x)=gn(w·x)+b)0 (1.7) and proposed a learning algorithm for separable problems,termed the Generalized Portrait,for constructing f from empirical data.It is based on two facts.First, among all hyperplanes separating the data,there exists a unique one yielding the Optimal maximum margin of separ ation between the classes, Hy perplane minfx-为l:x∈Rvc(w-xy+b=0ei=1..t0. (1.8) Second,the capacity decreases with increasing margin. {在(w.x)+b=+1} {x|(wx)+b=-1} Note: (w.x)+b=+1 X =+1 (w.x2+b=-1 => (w·(1-x2》=2 红 => (同动)=奇 {x|(wx)+b=0} Figure 1.1 A binary cassification toy problem:separate balls from diamonds.The optimal hyperplane is orthogonal to the shortest line connecting the convex hulls of the two classes (dotted),and intersects it half-way between the two casses.The problem being separ able,there exists a weight vector w and a threshold b such that i((w.x)+b)>0(i=1,...,e).Rescaling w and b such that the point(s)closest to the hyperplane satisfy (wxi)+b=1,we obt ain a canomical form(w,b)of the hy perplane,satisfying y((wx:)+b)>1.Note that in this case,the margin,measured perpendicularly to the hyperplane,equals 2/wll.This can be seen by considering two points x x9on opposite sides of the margin,i.e.(wx6=1,(w-x9+=-1, and projecting them onto the hyperplane normal vector w/wll. 10,1,91l
Hyperplane Classiers Chervonenkis considered the class of hyperplanes w x b w RN b R corresponding to decision functions f x sgnw x b and proposed a learning algorithm for separable problems termed the Generalized Portrait for constructing f from empirical data It is based on two facts First among all hyperplanes separating the data there exists a unique one yielding the Optimal maximum margin of separation between the classes Hyperplane max wb minfkx xik x RN w x b i g Second the capacity decreases with increasing margin . w {x | (w x) + . b = 0} {x | (w x) + . b = −1} {x | (w x) + . b = +1} x2 x1 Note: (w x1) + b = +1 (w x2) + b = −1 => (w (x1−x2)) = 2 => (x1−x2) = w (||w|| ) . . . . 2 ||w|| yi = −1 yi ❍ = +1 ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆ Figure A binary classi cation toy problem separate balls from diamonds The optimal hyperplane is orthogonal to the shortest line connecting the convex hulls of the two classes dotted and intersects it halfway between the two classes The problem being separable there exists a weight vector w and a threshold b such that yi w xi b i Rescaling w and b such that the points closest to the hyperplane satisfy jw xi bj we obtain a canonical form w b of the hyperplane satisfying yi wxib Note that in this case the margin measured perpendicularly to the hyperplane equals kwk This can be seen by considering two points x x on opposite sides of the margin ie wxb wxb and pro jecting them onto the hyperplane normal vector wkwk
4 Introduction to Support Vector Learning To construct this Optimal Hyperplane (cf.figure 1.1),one solves the following optimization problem: minimize r(w)-lwl2 (1.9) subject to y·(w·x)+b)≥1,i=1,,. (1.10) This constrained optimization problem is dealt with by introducing Lagrange Lagrangian multipliers ai>0 and a Lagrangian L(w,6a)=wP-∑a(kw)+)-1). (1.11) =1 The Lagrangian L has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables ai (i.e.a saddle point has to be found).Let us try to get some intuition for this.If a constraint (1.10)is violated,then yi.((w.xi)+b)-1 0,in which case L can be increased by increasing the corresponding ai.At the same time,w and b will have to change such that L decreases.To prevent -a;(y;((w.x)+b)-1)from becoming arbitrarily large,the change in w and b will ensure that,provided the problem is separable,the constr aint will eventually be satisfied.Similarly,one can understand that for all constr aints which are not precisely met as equalities,i.e. for which yi((wxi)+b)-1 >0,the corresponding a;must be 0:this is the value of oi that maximizes L.The latter is the statement of the Karush-Kuhn- KKT Tucker complementarity conditions of optimization theory (Karush,1939;Kuhn Conditions and Tucker,1951;Bertsekas,1995). The condition that at the saddle point,the derivatives of L with respect to the primal variables must vanish, 品,6o)=0是a=0 dw (1.12) leads to ∑a:=0 (1.13) =1 and w=∑aUx (1.14) The solution vector thus has an expansion in terms of a subset of the training Support Vector patterns,namely those patterns whose a;is non-zero,called Support Vectors.By the Karush-Kuhn-Tucker complementarity conditions a·:(x·w)+b)-1=0,i=1,,, (1.15) the Support Vectors lie on the margin (cf.figure 1.1).All remaining examples of the training set are irrelevant:their constraint (1.10)does not play a role in the optimization,and they do not appear in the expansion (1.14).This nicely 1998/08/251631
Introduction to Support Vector Learning To construct this Optimal Hyperplane cf gure one solves the following optimization problem minimize w kwk sub ject to yi w xi b i This constrained optimization problem is dealt with by introducing Lagrange Lagrangian multipliers i and a Lagrangian Lw b kwk X i i yi xi w b The Lagrangian L has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables i ie a saddle point has to be found Let us try to get some intuition for this If a constraint is violated then yi w xi b in which case L can be increased by increasing the corresponding i At the same time w and b will have to change such that L decreases To prevent i yi w xi b from becoming arbitrarily large the change in w and b will ensure that provided the problem is separable the constraint will eventually be satis ed Similarly one can understand that for all constraints which are not precisely met as equalities ie for which yi w xi b the corresponding i must be this is the value of i that maximizes L The latter is the statement of the KarushKuhn KKT Tucker complementarity conditions of optimization theory Karush Kuhn Conditions and Tucker Bertsekas The condition that at the saddle point the derivatives of L with respect to the primal variables must vanish bLw b w Lw b leads to X i iyi and w X i iyixi The solution vector thus has an expansion in terms of a subset of the training patterns namely those patterns whose i Support Vector is nonzero called Support Vectors By the KarushKuhnTucker complementarity conditions i yixi w b i the Support Vectors lie on the margin cf gure All remaining examples of the training set are irrelevant their constraint does not play a role in the optimization and they do not appear in the expansion This nicely
9 Feature Spaces and Kernels captures our intuition of the problem:as thehyperplane(cf.figure1.1)is complet ely determined by the patterns closest to it,the solution should not depend on the oth er examples. By substituting (1.13)and (1.14)into L,one eliminates the primal variables and arrives at the Wolfe dual of the optimiation problem (eg.Bertsckas,1995):find Dual multipliers ai whid Optimiz ation Problem maximt e W(a)= 1 aiailili(Xi·Xi) (1.16) subject to a:≥01i=11..11amd>a:=0. (1.17) 讹, The hyperplane decision function can thus be writt en as 4 f(x)=sgn :a·(x·x)+b (1.18 where b is computed using (1.15). The structure of the optimi ation problem cosely resembles those that typically arise in Logranges formulation of mecanics (eg.Goldstein,1986).Also there often only a subset of the constraints become active For instance f we keep a ball in a b ox,then it will typically roll into one of the corners.The constraints corresponding to the walls whic are not touded by the ball are irrelevant,the walls could just as well be removed. Seen in this light,it is not too surprising that it is possible to give a mecanical interpret ation of optimal margin hyperplanes(Burges and Sdolkopf,1997):If we assume that ead support vector xi exerts a perp endicular force of sie ai and sign yi on a solid plane sheet lying along the hyperplane,then the solution satisfies the requirements of medanical stability.The constraint(1.13)states that the forces on the sheet sum to zero;and(1.14)implies that the torques also sum to zero,via ∑:x:×yaw/lw‖=w×w/lw‖=0. There are several theoretical arguments supporting the good generaliz ation per- formance of the optimal hyp erplane (Vapnik and Chervonenkis (1974);Vapnik (1979),cf.chapters 3 and 4).In addition,it is computationally attractive,since it can be constructed by solving a quadratic programming problem.But how can this begeneralized to the case of decision functions whic,unlike (1.7),are nonlinear in the data? 1.3 Feature Spaces and Kernels To construct SVmadines,the optimal hyperplane algorithm had to be augment ed by a method for computing dot products in feature spaces nonlinearly related to input space (Aizerman et al.,1964;Boser et al.,1992).The b asic idea is to map the Feature space data into some other dot product space (called the feature space)F via a nonlinear .(0,1),9.-6
Feature Spaces and Kernels captures our intuition of the problem as the hyperplane cf gure is completely determined by the patterns closest to it the solution should not depend on the other examples By substituting and into L one eliminates the primal variables and arrives at the Wolfe dual of the optimization problem eg Bertsekas nd Dual multipliers i which Optimization Problem maximize W X i i X ij ij yiyj xi xj sub ject to i i and X i iyi The hyperplane decision function can thus be written as f x sgn X i yii x xi b where b is computed using The structure of the optimization problem closely resembles those that typically arise in Lagranges formulation of mechanics eg Goldstein Also there often only a subset of the constraints become active For instance if we keep a ball in a box then it will typically roll into one of the corners The constraints corresponding to the walls which are not touched by the ball are irrelevant the walls could just as well be removed Seen in this light it is not too surprising that it is possible to give a mechanical interpretation of optimal margin hyperplanes Burges and Scholkopf If we assume that each support vector xi exerts a perpendicular force of size i and sign yi on a solid plane sheet lying along the hyperplane then the solution satis es the requirements of mechanical stability The constraint states that the forces on the sheet sum to zero and implies that the torques also sum to zero via P i xi yii wkwk w wkwk There are several theoretical arguments supporting the good generalization per formance of the optimal hyperplane Vapnik and Chervonenkis Vapnik cf chapters and In addition it is computationally attractive since it can be constructed by solving a quadratic programming problem But how can this be generalized to the case of decision functions which unlike are nonlinear in the data Feature Spaces and Kernels To construct SV machines the optimal hyperplane algorithm had to be augmented by a method for computing dot products in feature spaces nonlinearly related to input space Aizerman et al Boser et al The basic idea is to map the Feature Space data into some other dot product space called the feature space F via a nonlinear
Introduction to Support Vector Learning map 重:RN,F (1.19× and per orm the above linear algorithm in Fi For instance2 suppose we are given patterns x e RN where most in ormation is contained in the dith or der products.monomials=ofentries zj ofxziixj.,.rj2 where j.e...id e {1e.,.N)I In that case2 we might prefer to extract these monomial eatures fir st2 and work in the eature space F o fall products ofd entriesi This approach2 however2 fails fr realistically sized problems:fr N/dimensional input patterns2there exist.N+d-14/.d!.N-14=different monomialsI Already 1602 we can take into account all product o for der up to d.iel including those o forder smaller than d- More gener ally2 the pllowing theorem of finctional analysis shows that kernels llofpositive integral operators give rise to maps such that.120-holds.Mercer2 1a0a:Aizerman et al2 1064:Boser et al121002= Theorem 1.1 (Mercer) If is a continuous symmetric kernel ofa positive integral operator T2 ie .Tf-y-= ‖.xeyf.x= (1.23× with 厂xty-f.x-f.y-w>0 (1.2-× or all f e L2.C=.C being a compact subset of RN=it can be expanded in a uniprmly convergent series.on C<C-in termsofT's eigen finctions j and positive .(0,1),9.-6
Introduction to Support Vector Learning map RN F and perform the above linear algorithm in F For instance suppose we are given patterns x RN where most information is contained in the dth order products monomials of entries xj of x ie xj xjd where jjd fNg In that case we might prefer to extract these monomial features rst and work in the feature space F of all products of d entries This approach however fails for realistically sized problems for Ndimensional input patterns there exist N d d N di erent monomials Already pixel input images eg in character recognition and a monomial degree d yield a dimensionality of This problem can be overcome by noticing that both the construction of the optimal hyperplane in F cf and the evaluation of the corresponding decision function only require the evaluation of dot products x y and never the mapped patterns x in explicit form This is crucial since in some cases the Mercer Kernel dot products can be evaluated by a simple kernel kx y x y For instance the polynomial kernel kx yx yd can be shown to correspond to a map into the space spanned by all products of exactly d dimensions of RN Poggio Boser et al Burges for a proof see chapter For d and x y R eg we have Vapnik x y x x p xxy y p yy x y de ning xx x p xx By using kx yx y cd with c we can take into account all product of order up to d ie including those of order smaller than d More generally the following theorem of functional analysis shows that kernels k of positive integral operators give rise to maps such that holds Mercer Aizerman et al Boser et al Theorem Mercer If k is a continuous symmetric kernel of a positive integral operator T ie T f y Z C kx yf x dx with Z CC kx yf xf y dx dy for all f LC C being a compact subset of RN it can be expanded in a uniformly convergent series on CC in terms of T s eigenfunctions j and positive
(1 Feature Spaces and Kernels 7 eigenvalues入y2 k.xiy== ∑入xy (1.25) j= where NF<oo is the number ofpositive eigenvalues Note that originally proven fr the case where C=[ab.a/bE R2 this theorem also holds true or gener al compact spaces.Dun ord and Schwartz2 1063-1 An equivalent way to characterize Mer cer kernels is that they give rise to positive matrices K:=k.xxj-fr all {1x}.Saitoh21088-One ofthe implications that need to be proven to show this equivalence pllows fom the fact that Kij is a Gram matrix:ra∈R2 we have.a·Ka==‖-12④.x:‖2≥0i Fom.125=2it is str aight prward to construct a map into a potentially infinite/ dimensional l2 space which satisfies.120-For instance2 we may use Φ.x==.Vh吻.x4VA22.xH999 (1.26) Rather than thinking of the eature space as an 12 space2 we can alternatively represent it as the Hilbert space H containing all linear combinations of the finctions f.s==k.xlg=.x;∈C=To ensure that the map重:C→H2 which in this case is defined as Φ.X==k.X19曰 (1.27) satisfies.1120-2 we need to endow H with a suitable dot product (1In view of the definition ofthis dot product needs to satis (k.xigk.yio)=k.xiy (1.28 Repro ducing which amounts to saying that k is a reprodu cing kerndl fr Hi For a Mer cer kernel Kernel .1125-2such a dot product does existi Since k is symmetric2 the i.i=119991 NF- can be chosen to be orthogonal with respect to the dot product in L2.C=2ie .jnL(c)=6jn2 using the Kronecker 6in1 Fom this2 we can construct (os) such that (V451VAnn)=dn9 (1.29) Substituting.1125=into.1128=then proves the desired equality.fr firther details2 see chapter 6 and Aronszajn.1050 Wahba.1a73 Girosi.1a08 Scholkopf.1007-1 Besides.1121-2SV practicioners use sigmoid kernels k.xy==tanh.k.x·y=+日= (1.30) 6r suitable values o fgain K and threshold cf chapter 7-2andradial basis finction kernels2as fr inst ance.Aizerman et al n1064;Boser et al 21002;Scholkopfet al n2 1a07b= kxy==ep-x-y2a.22_× (1.31) .(0,1),9.-6
Feature Spaces and Kernels eigenvalues j kx y X NF j jj xj y where NF is the number of positive eigenvalues Note that originally proven for the case where C a b ab R this theorem also holds true for general compact spaces Dunford and Schwartz An equivalent way to characterize Mercer kernels is that they give rise to positive matrices Kij kxi xj for all fx xg Saitoh One of the implications that need to be proven to show this equivalence follows from the fact that Kij is a Gram matrix for R we have K kP i ixik From it is straightforward to construct a map into a potentially in nite dimensional l space which satis es For instance we may use xp x p x Rather than thinking of the feature space as an l space we can alternatively represent it as the Hilbert space Hk containing all linear combinations of the functions f kxi xi C To ensure that the map C Hk which in this case is de ned as x kx satis es we need to endow Hk with a suitable dot product h i In view of the de nition of this dot product needs to satisfy hkx ky i kx y Reproducing which amounts to saying that k is a reproducing kernel for Hk For a Mercer kernel Kernel such a dot product does exist Since k is symmetric the i i NF can be chosen to be orthogonal with respect to the dot product in LC ie j nL C jn using the Kronecker jn From this we can construct h i such that hp jj p nni jn Substituting into then proves the desired equality for further details see chapter and Aronsza jn Wahba Girosi Scholkopf Besides SV practicioners use sigmoid kernels kx y tanhx y ! for suitable values of gain and threshold ! cf chapter and radial basis function kernels as for instance Aizerman et al Boser et al Scholkopf et al b kx y exp kx yk
3 Introduction to Support Vector Learning input space feature space O ◆ O Figure.The idea of SV machines:map the training data nonlinearly into a higher/dimensional eature space via/zand construct a separating hyperplane with maximum margin there This yields a nonlinear decision boundary in input spacer By the use of a kernel fnction.1120-2 it is possible to compute the separating hyperplane without explicitly carrying out the map into the eature spacer with o >OI Note that when using Gaussian kernels2 fr inst ancezthe fature space H thus contains all superpositions ofGaussians on C.plus limit points-2 whereas by definition of.1127-2only single bumps k.x,.=do have pre/images under 1.4 Support Vector Machines To construct SV machines2 one computes an optimal hyperplane in eature spacer To this end2 we substitute .x;=6r each training example xi The weight vector cf.1114-then becomes an expansion in eature space2 and will thus typically no more correspond to the image ofa single vector fom input space.ci Scholkopf et al1.1aa8c=fr a frmula how to compute the pre/image ifit exists=Since all patterns only occur in dot products2 one can substitute Mercer kernels k or the dot products Boser et al2 1002;Guyon et al121a03-2leading to decision finctions Decision ofthe more general 6rm.cf.1118- Function f.x==sgn ∑y4x.重.x=r重.x=+b =1 sgn yi4i Tk.x,xi=+b (1.32) and the 6llowing quadratic program cf.1116 m axim ize (1.33) subject to 4>=1…6m空4=0 (1.34) .(0,1),9-6
Introduction to Support Vector Learning input space feature space Φ ◆ ◆ ◆ ◆ ❍ ❍ ❍ ❍ ❍ ❍ Figure The idea of SV machines map the training data nonlinearly into a higherdimensional feature space via and construct a separating hyperplane with maximum margin there This yields a nonlinear decision boundary in input space By the use of a kernel function it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space with Note that when using Gaussian kernels for instance the feature space Hk thus contains all superpositions of Gaussians on C plus limit points whereas by de nition of only single bumps kx do have preimages under Support Vector Machines To construct SV machines one computes an optimal hyperplane in feature space To this end we substitute xi for each training example xi The weight vector cf then becomes an expansion in feature space and will thus typically no more correspond to the image of a single vector from input space cf Scholkopf et al c for a formula how to compute the preimage if it exists Since all patterns only occur in dot products one can substitute Mercer kernels k for the dot products Boser et al Guyon et al leading to decision functions Decision of the more general form cf Function f x sgn X i yii x xi b sgn X i yii kx xi b and the following quadratic program cf maximize W X i i X ij ij yiyjkxi xj sub ject to i i and X i iyi
9 Support Vector Machines 9 0 0 0 0 @ 0 Figure.(Example of a Support Vector dlassifier found by using a radial basis function kernel kx.y.1 expi-x-y4..Both coordinate axes range from-1 to +1. Circles and disks are two classes of training examples;the middle line is the decision surface;the outer lines precisely meet the constr aint(1.10).Note that the Support Vectors found by the algorithm(marked by extra circles)are not centers of dlusters, but examples which are critical for the given classification task.Grey values code the modulus of the argument-bof the decision function (1.32). (From Scholkopf et al.(1996a),see also Burges (1998).) In pr actice,a separating hy perplane may not exist,e.g.if a high noise level causes a Soft Margin large overlap of the classes.To allow for the possibility of examples violating (1.10), Hy perplane one introduces slack variables (Cortes and Vapnik,1995;Vapnik,1995) i≥0,i=1,,0, 1135) along with relaxed constr aints yi(w·xi)+b)≥1-i,i=1,,0. 136) A classifier which generalizes well is then found by controlling both the classifier capacity (via w)and the number of training errors,minimizing the objective function T(w,1)= 2w2+c∑i 137) subject to the constr aints (1.35)and(1.36),for some value of the constant C>0 determining the trade-off.Here and below,we use boldface greek letters as a shorthand for corresponding vectors 1 =(61,...,1.Incorporating kernels,and .//6,63
Support Vector Machines Figure Example of a Support Vector classi er found by using a radial basis function kernel kx y expkxyk Both coordinate axes range from to Circles and disks are two classes of training examples the middle line is the decision surface the outer lines precisely meet the constraint Note that the Support Vectors found by the algorithm marked by extra circles are not centers of clusters but examples which are critical for the given classi cation task Grey values code the modulus of the argument P i yii kx xi b of the decision function From Scholkopf et al a see also Burges In practice a separating hyperplane may not exist eg if a high noise level causes a Soft Margin large overlap of the classes To allow for the possibility of examples violating Hyperplane one introduces slack variables Cortes and Vapnik Vapnik i i along with relaxed constraints yi w xi b i i A classi er which generalizes well is then found by controlling both the classi er capacity via kwk and the number of training errors minimizing the ob jective function w kwk CX i i sub ject to the constraints and for some value of the constant C determining the tradeo Here and below we use boldface greek letters as a shorthand for corresponding vectors Incorporating kernels and