T R E ND S C O N T R O V E R S I E S Support vector machines By Marti A.Hearst University of California,Berkeley hearst@sims.berkeley.edu My first exposure to Support Vector Machines came this spring when I heard Sue Dumais present impressive results on text categorization using this analysis technique. This issue's collection of essays should help familiarize our readers with this interest- mate a function f:RN→{±l}using training ing new racehorse in the Machine Learning stable.Bernhard Scholkopf,in an intro- data-that is,N-dimensional patterns x ductory overview,points out that a particular advantage of SVMs over other learning and class labelsy algorithms is that it can be analyzed theoretically using concepts from computational learning theory,and at the same time can achieve good performance when applied to (31乃,,ke,e)∈RWX{仕1h, real problems.Examples of these real-world applications are provided by Sue Dumais. who describes the aforementioned text-categorization problem,yielding the best re- such that /will correctly classify new exam- sults to date on the Reuters collection,and Edgar Osuna,who presents strong results ples (x,y)-that is,Ax)=y for examples (x,y), on application to face detection.Our fourth author,John Platt,gives us a practical which were generated from the same under- guide and a new technique for implementing the algorithm efficiently. lying probability distribution P(x)as the training data.If we put no restriction on the -Marti Hearst class of functions that we choose our estimate ffrom,however,even a function that does well on the training data-for example by satisfyingx=y(here and below,the index basis function(RBF)nets.and polynomial SVMs-a practical consequence of iis understood to run over 1,.e)need classifiers as special cases.Yet it is simple not generalize well to unseen examples.Sup- learning theory enough to be analyzed mathematically, pose we know nothing additional about f(for Bernhard Scholkopf.GMD First because it can be shown to correspond to a example,about its smoothness).Then the Is there anything worthwhile to learn inear method in a high-dimensional fea- values on the training patterns carry no infor about the new SVM algorithm,or does it ture space nonlinearly related to input mation whatsoever about values on novel fall into the category of"yet-another-algo- space.Moreover,even though we can think patterns.Hence leamning is impossible,and rithm."in which case readers should stop of it as a linear algorithm in a high-dimen- minimizing the training error does not imply here and save their time for something sional space,in practice,it does not involve a small expected test error. more useful?In this short overview,I will any computations in that high-dimensional Statistical learning theory,2 or VC(Vap- try to argue that studying support-vector space.By the use of kernels,all necessary nik-Chervonenkis)theory,shows that it is learning is very useful in two respects. computations are performed directly in crucial to restrict the class of functions First,it is quite satisfying from a theoreti- input space.This is the characteristic twist that the learning machine can implement cal point of view:SV learning is based on of SV methods-we are dealing with com- to one with a capacity that is suitable for some beautifully simple ideas and provides plex algorithms for nonlinear pattern the amount of available training data. a clear intuition of what learning from ex- recognition,regression,2 or feature extrac- amples is about.Second,it can lead to high tion,3 but for the sake of analysis and algo- Hyperplane classifiers performances in practical applications. rithmics,we can pretend that we are work- To design learning algorithms,we thus In the following sense can the SV algo- ing with a simple linear algorithm. must come up with a class of functions rithm be considered as lying at the intersec- I will explain the gist of SV methods by whose capacity can be computed.SV clas- tion of learning theory and practice:for describing their roots in learning theory, sifiers are based on the class of hyperplanes certain simple types of algorithms,statisti- the optimal hyperplane algorithm,the ker- cal learning theory can identify rather pre- nel trick,and SV function estimation.For (wx+b=0w∈Rb∈R (2) cisely the factors that need to be taken into details and further references,see Vladimir account to learn successfully.Real-world Vapnik's authoritative treatment,2 the col- corresponding to decision functions applications,however,often mandate the lection my colleagues and I have put to- use of more complex models and algori- gether,and the SV Web page at http://svm. a=sign(wx刈+. 3) thms-such as neural networks-that are first.gmd.de. much harder to analyze theoretically.The We can show that the optimal hyper- SV algorithm achieves both.It constructs Learning pattern recognition from plane,defined as the one with the maximal models that are complex enough:it con- examples margin of separation between the two tains a large class of neural nets,radial For pattern recognition,we try to esti- classes(see Figure 1),has the lowest ca- 18 IEEE INTELLIGENT SYSTEMS
SVMs—a practical consequence of learning theory Bernhard Schölkopf, GMD First Is there anything worthwhile to learn about the new SVM algorithm, or does it fall into the category of “yet-another-algorithm,” in which case readers should stop here and save their time for something more useful? In this short overview, I will try to argue that studying support-vector learning is very useful in two respects. First, it is quite satisfying from a theoretical point of view: SV learning is based on some beautifully simple ideas and provides a clear intuition of what learning from examples is about. Second, it can lead to high performances in practical applications. In the following sense can the SV algorithm be considered as lying at the intersection of learning theory and practice: for certain simple types of algorithms, statistical learning theory can identify rather precisely the factors that need to be taken into account to learn successfully. Real-world applications, however, often mandate the use of more complex models and algorithms—such as neural networks—that are much harder to analyze theoretically. The SV algorithm achieves both. It constructs models that are complex enough: it contains a large class of neural nets, radial basis function (RBF) nets, and polynomial classifiers as special cases. Yet it is simple enough to be analyzed mathematically, because it can be shown to correspond to a linear method in a high-dimensional feature space nonlinearly related to input space. Moreover, even though we can think of it as a linear algorithm in a high-dimensional space, in practice, it does not involve any computations in that high-dimensional space. By the use of kernels, all necessary computations are performed directly in input space. This is the characteristic twist of SV methods—we are dealing with complex algorithms for nonlinear pattern recognition,1 regression,2 or feature extraction,3 but for the sake of analysis and algorithmics, we can pretend that we are working with a simple linear algorithm. I will explain the gist of SV methods by describing their roots in learning theory, the optimal hyperplane algorithm, the kernel trick, and SV function estimation. For details and further references, see Vladimir Vapnik’s authoritative treatment,2 the collection my colleagues and I have put together,4 and the SV Web page at http://svm. first.gmd.de. Learning pattern recognition from examples For pattern recognition, we try to estimate a function f:RN→{±1} using training data—that is, N-dimensional patterns xi and class labels yi , (x1,y1), …,(x`, y`) ∈ RN × {±1}, (1) such that f will correctly classify new examples (x,y)—that is, f(x) = y for examples (x,y), which were generated from the same underlying probability distribution P(x,y) as the training data. If we put no restriction on the class of functions that we choose our estimate f from, however, even a function that does well on the training data—for example by satisfying f(xi ) = yi (here and below, the index i is understood to run over 1, …, `)—need not generalize well to unseen examples. Suppose we know nothing additional about f (for example, about its smoothness). Then the values on the training patterns carry no information whatsoever about values on novel patterns. Hence learning is impossible, and minimizing the training error does not imply a small expected test error. Statistical learning theory,2 or VC (Vapnik-Chervonenkis) theory, shows that it is crucial to restrict the class of functions that the learning machine can implement to one with a capacity that is suitable for the amount of available training data. Hyperplane classifiers To design learning algorithms, we thus must come up with a class of functions whose capacity can be computed. SV classifiers are based on the class of hyperplanes (w⋅x) + b = 0 w ∈RN, b ∈R, (2) corresponding to decision functions f(x) = sign((w⋅x) + b). (3) We can show that the optimal hyperplane, defined as the one with the maximal margin of separation between the two classes (see Figure 1), has the lowest ca- 18 IEEE INTELLIGENT SYSTEMS Support vector machines TRENDS & CONTROVERSIES TRENDS & CONTROVERSIES By Marti A. Hearst University of California, Berkeley hearst@sims.berkeley.edu My first exposure to Support Vector Machines came this spring when I heard Sue Dumais present impressive results on text categorization using this analysis technique. This issue’s collection of essays should help familiarize our readers with this interesting new racehorse in the Machine Learning stable. Bernhard Schölkopf, in an introductory overview, points out that a particular advantage of SVMs over other learning algorithms is that it can be analyzed theoretically using concepts from computational learning theory, and at the same time can achieve good performance when applied to real problems. Examples of these real-world applications are provided by Sue Dumais, who describes the aforementioned text-categorization problem, yielding the best results to date on the Reuters collection, and Edgar Osuna, who presents strong results on application to face detection. Our fourth author, John Platt, gives us a practical guide and a new technique for implementing the algorithm efficiently. –Marti Hearst
区|w-x)+b=+1) x|wx)+b=-1) Note: radial basis function(RBF)kernels such as (w.X,)+b=+1 1 、y=+1 (w·x2)+b=1 红y)=ep(-k-y1(2o2, ® ● => (w(1-x2》=2 W >(网)商 and sigmoid kernels(with gain K and offset 0) (x,y)=tanh(x(x-y)+). (9) (x(w-x)+b=0) SVMs We now have all the tools to construct nonlinear classifiers(see Figure 2).To this Figure 1.A separable classification toy problem:separate balls from diamonds.The optimal hyperplane is orthogonal end,we substitute(x)for each training to the shortest line connecting the convex hulls of the two classes(dotted),and intersects it half way.There is a weight example x,and perform the optimal hyper- vector w and a threshold bsuch that y;((w-x)+b)>0.Rescaling w and bsuch that the point(s)closest to the plane algorithm in F.Because we are using hyperplane satisfy (wx)+=1,we obtain a form(w.b)of the hyperplane with y((w-x+b)1.Note that the kernels,we will thus end up with nonlinear margin,measured perpendicularly to the hyperplane,equals 2/wTo maximize the margin,we thus have to decision function of the form minimize |w]subject to y(w-x)+b)1. ()-sign(.)+). =1 (10) other dot product space (called the feature Input space Feature space space)Fvia a nonlinear map The parameters v,are computed as the so- ◆ lution of a quadratic programming D:RNF ( problem. In input space,the hyperplane corres- and perform the above linear algorithm in ponds to a nonlinear decision function F.As I've noted,this only requires the whose form is determined by the kernel evaluation of dot products. (see Figures 3 and 4). The algorithm I've described thus far has k(x,y=(x)(y), (5) a number of astonishing properties: Fiqure 2.The idea of SV machines:map the training data nonlinearly into a higher-dimensional feature space via Clearly,if Fis high-dimensional,the right- It is based on statistical learning theory, and construct a separating hyperplane with maximum margin there.This yields a nonlinear decision boundary in hand side of Equation 5 will be very expen- It is practical (as it reduces to a quad- input space.By the use of a kemnel function,it is possible sive to compute.In some cases,however, ratic programming problem with a to ompute the separating hyperplane without explicitly there is a simple kernelk that can be evalu- unique solution),and carrying out the map into the feature space. ated efficiently.For instance,the polyno- It contains a number of more or less mial kernel heuristic algorithms as special cases:by the choice of different kernel functions, pacity.It can be uniquely constructed by k(x,y)=(x-y)d (6) we obtain different architectures(Fig- solving a constrained quadratic optimiza- ure 4),such as polynomial classifiers tion problem whose solution w has an ex- can be shown to correspond to a map (Equation 6).RBF classifiers(Equation pansion wa=∑,in terms of a subset of into the space spanned by all products of 8 and Figure 3),and three-layer neural training patterns that lie on the margin(see exactly d dimensions of RV.For d=2 and x, nets(Equation 9). Figure 1).These training patterns,called ye R2,for example,we have support vectors,carry all relevant informa- The most important restriction up to now tion about the classification problem. has been that we were only considering the Omitting the details of the calcu- ar-图月 case of classification.However,a general- lations,there is just one crucial property ization to regression estimation-that is,to of the algorithm that we need to empha- ye R,can be given.In this case,the algo- size:both the quadratic programming (7 rithm tries to construct a linear function in problem and the final decision function the feature space such that the training f(x)=sign(v(xx)+b)depend only on =((x)y) points lie within a distance s>0.Similar to dot products between patterns.This is the pattern-recognition case,we can write precisely what lets us generalize to the defining (x)=(xx).More gen- this as a quadratic programming problem nonlinear case. erally,we can prove that for every kernel in terms of kernels.The nonlinear regres- that gives rise to a positive matrix( sion estimate takes the form Feature spaces and kernels we can construct a map d such that Equa- Figure 2 shows the basic idea of SV ma- tion 5 holds. fx)=) vik(xx)+b (11) chines,which is to map the data into some Besides Equation 6,SV practitioners use -1 JULY/AUGUST 1998 19
JULY/AUGUST 1998 19 pacity. It can be uniquely constructed by solving a constrained quadratic optimization problem whose solution w has an expansion in terms of a subset of training patterns that lie on the margin (see Figure 1). These training patterns, called support vectors, carry all relevant information about the classification problem. Omitting the details of the calculations, there is just one crucial property of the algorithm that we need to emphasize: both the quadratic programming problem and the final decision function depend only on dot products between patterns. This is precisely what lets us generalize to the nonlinear case. Feature spaces and kernels Figure 2 shows the basic idea of SV machines, which is to map the data into some other dot product space (called the feature space) F via a nonlinear map Φ:RN→F, (4) and perform the above linear algorithm in F. As I’ve noted, this only requires the evaluation of dot products. (5) Clearly, if F is high-dimensional, the righthand side of Equation 5 will be very expensive to compute. In some cases, however, there is a simple kernel k that can be evaluated efficiently. For instance, the polynomial kernel (6) can be shown to correspond to a map Φ into the space spanned by all products of exactly d dimensions of RN. For d=2 and x, y∈R2, for example, we have (7) defining More generally, we can prove that for every kernel that gives rise to a positive matrix (k(xi ,xj ))ij, we can construct a map Φ such that Equation 5 holds. Besides Equation 6, SV practitioners use radial basis function (RBF) kernels such as (8) and sigmoid kernels (with gain κ and offset Θ) (9) SVMs We now have all the tools to construct nonlinear classifiers (see Figure 2). To this end, we substitute Φ(xi ) for each training example xi , and perform the optimal hyperplane algorithm in F. Because we are using kernels, we will thus end up with nonlinear decision function of the form (10) The parameters vi are computed as the solution of a quadratic programming problem. In input space, the hyperplane corresponds to a nonlinear decision function whose form is determined by the kernel (see Figures 3 and 4). The algorithm I’ve described thus far has a number of astonishing properties: • It is based on statistical learning theory, • It is practical (as it reduces to a quadratic programming problem with a unique solution), and • It contains a number of more or less heuristic algorithms as special cases: by the choice of different kernel functions, we obtain different architectures (Figure 4), such as polynomial classifiers (Equation 6), RBF classifiers (Equation 8 and Figure 3), and three-layer neural nets (Equation 9). The most important restriction up to now has been that we were only considering the case of classification. However, a generalization to regression estimation—that is, to y∈R, can be given. In this case, the algorithm tries to construct a linear function in the feature space such that the training points lie within a distance ε > 0. Similar to the pattern-recognition case, we can write this as a quadratic programming problem in terms of kernels. The nonlinear regression estimate takes the form (11) f vi k i b i (x)= ⋅ (x , x) + = ∑ 1 l f vi k i b i (x)= ( ⋅ (x, x ) + ). = sign ∑ 1 l k(x, y)= tanh(κ(x ⋅ y) + Θ). k(x, y)= exp(− x − y / ( )), 2 2 2σ Φ(x) = (x , x x , x ). 1 2 1 2 2 2 2 ( ) ( ( ) ( )), x y x y ⋅ = ⋅ = ⋅ = ⋅ 2 1 2 1 2 2 1 2 1 2 2 2 1 2 1 2 2 2 2 2 x x y y x x x x y y y y Φ Φ k d (x, y)=(x⋅ y) k(x, y):=(Φ(x)⋅Φ(y)). f vi i b i (x)=sign(∑ (x ⋅ x ) + ) w =∑ vixi i . . {x | (w x) + b = –1 } . {x | (w x) + b = + 1} . x 1 Note: (w x 1) + b = +1 (w x 2) + b = 1 => (w (x 1 x 2 – 2 )) = => (x 1– x2 ) ) = w ( ||w|| . . . . 2 ||w|| y y i = +1 w {x | (w x) + b = 0 } x 2 i = – 1 Figure 1. A separable classification toy problem: separate balls from diamonds. The optimal hyperplane is orthogonal to the shortest line connecting the convex hulls of the two classes (dotted), and intersects it half way. There is a weight vector w and a threshold bsuch that yi ⋅ ((w⋅xi ) + b) > 0. Rescaling w and bsuch that the point(s) closest to the hyperplane satisfy |(w⋅xi ) + b| = 1, we obtain a form (w,b) of the hyperplane with yi ((w⋅xi ) + b) ≥ 1. Note that the margin, measured perpendicularly to the hyperplane, equals 2/|| w ||. To maximize the margin, we thus have to minimize |w| subject to yi ((w⋅xi ) + b) ≥ 1. Input space Feature space Φ Figure 2. The idea of SV machines: map the training data nonlinearly into a higher-dimensional feature space via Φ, and construct a separating hyperplane with maximum margin there. This yields a nonlinear decision boundary in input space. By the use of a kernel function, it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space.
respect,several fields have emerged Training methods for speeding up the quadratic program,such as the one de- scribed later in this installment of Trends Controversies by John Platt. Speeding up the evaluation of the deci- sion function is of interest in a variety of applications,such as optical-charac- ter recognition.6 The choice of kernel functions,and hence of the feature space to work in,is of both theoretical and practical inter- est.It determines both the functional form of the estimate and,via the objec- tive function of the quadratic program, the type of regularization that is used to constrain the estimate.7,8 However,even though different kernels lead to differ- Figure 3.Example of an SVclassifier found by using a radial basisfunion kemel (Equation).Cirles and disks are two casses of training examples;thesoidine is the decisionsurface:the suppor vtors found by the aorithm lie ent types of learning machines,the choice of kernel seems to be less crucial on,or between,the dashed lines.Colors code the modulus of the argument+of the decision function in Equation 10. than it may appear at first sight.In OCR applications,the kernels(Equations 6, 9,and 8)lead to very similar perfor- mance and to strongly overlapping sets of support vectors. Output(EkK,x)》 Although the use of SV methods in ap- plications has only recently begun,ap- plication developers have already re- Weights ported state-of-the-art performances in a variety of applications in pattern recog- 水) nition,regression estimation,and time Dot product他(x)Φ(x)=k(xX) series prediction.However,it is proba- bly fair to say that we are still missing Mapped vectorsΦ),b) an application where SV methods sig- nificantly outperform any other avail- able algorithm or solve a problem that 7 Support vectors x...x has so far been impossible to tackle.For the latter,SV methods for solving in- verse problems are a promising candi- 1 Test vector x date.Sue Dumais and Edgar Osuna describe promising applications in this discussion. Figure 4.Architecture of SVmethods.The inputx and the support vectorsx(in this example:digits)are nonlinearly mapped (by into a feature space F,where dot products are computed.By the use of the kernel k,these two layers Using kernels for other algorithms are in practice computed in one single step.The results are linearly combined by weights v found by solving a qua- emerges as an exciting opportunity for dratic program(in pattern recognition,=yin regression estimation,v=or an eigenvalue problem developing new learning techniques. (in kernel PCA3).The linear combination is fed into the function o(in pattern recognition,(x)=sign(x+b):in The kernel method for computing dot regression stimation,o(x)=x+b:in kernel PCA,o(x)=x. products in feature spaces is not re- stricted to SV machines.We can use it to derive nonlinear generalizations of To apply the algorithm,we either Current developments and open any algorithm that can be cast in terms specify e a priori,or we specify an issues of dot products.As a mere start,we upper bound on the fraction of training decided to apply this idea to one of the points allowed to lie outside of a dis- Chances are that those readers who are most widely used algorithms for data tance g from the regression estimate still with me might be interested to hear analysis,principal component analysis. (asymptotically,the number of SVs) how researchers have built on the above. This leads to kernel PCA,3 an algorithm and the corresponding E is computed applied the algorithm to real-world prob- that performs nonlinear PCA by carry- automatically.> lems,and developed extensions.In this ing out linear PCA in feature space.The 20 IEEE INTELLIGENT SYSTEMS
To apply the algorithm, we either specify ε a priori, or we specify an upper bound on the fraction of training points allowed to lie outside of a distance ε from the regression estimate (asymptotically, the number of SVs) and the corresponding ε is computed automatically. 5 Current developments and open issues Chances are that those readers who are still with me might be interested to hear how researchers have built on the above, applied the algorithm to real-world problems, and developed extensions. In this respect, several fields have emerged. • Training methods for speeding up the quadratic program, such as the one described later in this installment of Trends & Controversies by John Platt. • Speeding up the evaluation of the decision function is of interest in a variety of applications, such as optical-character recognition.6 • The choice of kernel functions, and hence of the feature space to work in, is of both theoretical and practical interest. It determines both the functional form of the estimate and, via the objective function of the quadratic program, the type of regularization that is used to constrain the estimate. 7,8 However, even though different kernels lead to different types of learning machines, the choice of kernel seems to be less crucial than it may appear at first sight. In OCR applications, the kernels (Equations 6, 9, and 8) lead to very similar performance and to strongly overlapping sets of support vectors. • Although the use of SV methods in applications has only recently begun, application developers have already reported state-of-the-art performances in a variety of applications in pattern recognition, regression estimation, and time series prediction. However, it is probably fair to say that we are still missing an application where SV methods significantly outperform any other available algorithm or solve a problem that has so far been impossible to tackle. For the latter, SV methods for solving inverse problems are a promising candidate. 9 Sue Dumais and Edgar Osuna describe promising applications in this discussion. • Using kernels for other algorithms emerges as an exciting opportunity for developing new learning techniques. The kernel method for computing dot products in feature spaces is not restricted to SV machines. We can use it to derive nonlinear generalizations of any algorithm that can be cast in terms of dot products. As a mere start, we decided to apply this idea to one of the most widely used algorithms for data analysis, principal component analysis. This leads to kernel PCA,3 an algorithm that performs nonlinear PCA by carrying out linear PCA in feature space. The 20 IEEE INTELLIGENT SYSTEMS Figure 3. Example of an SV classifier found by using a radial basis function kernel (Equation 8). Circles and disks are two classes of training examples; the solid line is the decision surface; the support vectors found by the algorithm lie on, or between, the dashed lines. Colors code the modulus of the argument of the decision function in Equation 10. vi k i b i ∑ ⋅ (x, x ) + Σ . . . Output σ(Σ υi k (x,xi )) υ Weights 1 υ2 υn . . . Test vector x Support vectors x 1 ... xn Mapped vectors Φ(xi Φ(x) Φ(x ), Φ(x) n) Dot product (Φ(x).Φ(xi ))= k(x,xi ( ) . ) ( . ) ( . ) Φ(x1 ) Φ(x2) σ( ) 7 4 1 1 Figure 4. Architecture of SV methods. The input x and the support vectors xi (in this example: digits) are nonlinearly mapped (by Φ) into a feature space F, where dot products are computed. By the use of the kernel k, these two layers are in practice computed in one single step. The results are linearly combined by weights νi , found by solving a quadratic program (in pattern recognition, νi = yi αi ; in regression estimation, νi = α*i – αi ) 2 or an eigenvalue problem (in kernel PCA3 ). The linear combination is fed into the function σ (in pattern recognition, σ(x) = sign(x + b); in regression stimation, σ(x) = x + b; in kernel PCA, σ(x) = x.
Susan T.Dumais is a senior researcher in the Decision Theory and Adap tive Systems Group at Microsoft Research.Her research interests include algorithms and interfaces for improved information retrieval and classifica- tion,human-computer interaction,combining search and navigation,user method consists of solving a linear modeling,individual differences,collaborative filtering,and organizational eigenvalue problem for a matrix whose impacts of new technology.She received a BA in mathematics and psychol- ogy from Bates College and a PhD in cognitive psychology from Indiana elements are computed using the kernel University.She is a member of the ACM,the ASIS,the Human Factors and function.The resulting feature extrac- Ergonomic Society,and the Psychonomic Society.Contact her at Microsoft tors have the same architecture as SV Research,1 Microsoft Way,Redmond,WA 98052;sdumais@microsoft. machines(see Figure 4).A number of com;http://research.microsoft.com/-sdumais. researchers have since started to"ker- Edgar Osuna has just returned to his native Venezuela after receiving his nelize"various other linear algorithms. PhD in operations research from the Massachusetts Institute of Technology His research interests include the study of different aspects and properties of Vapnik's SVM.He received his BS in computer engineering from the References Universidad Simon Bolivar,Caracas,Venezuela.Contact him at IESA, 1.B.E.Boser,I.M.Guyon,and V.N.Vapnik, POBA International #646,PO Box 02-5255.Miami,FL 33102-5255: "A Training Algorithm for Optimal Margin eosuna/@usb.ve. Classifiers,"Proc.Fifth Ann.Workshop Computational Learning Theory,ACM Press,New York,1992,pp.144-152. 2. V.Vapnik,The Nature of Statistical Learn- John Platt is a senior researcher in the Signal Processing Group at Micro- ing Theory,Springer-Verlag,New York, 1995. soft Research.Recently,he has concentrated on fast general-purpose ma- chine-learing algorithms for use in processing signals.More generally,his 3. B.Scholkopf,A.Smola,and K.-R.Muller, research interests include neural networks,machine learning,computer "Nonlinear Component Analysis as a Ker- nel Eigenvalue Problem,"Neural Computa- vision,and signal processing.He received his PhD in computer science at Caltech in 1989,where he studied computer graphics and neural networks. tion,Vol.10,1998,pp.1299-1319. His received his BS in chemistry from California State University at Long 4. B.Scholkopf,C.J.C.Burges,and A.J. Beach.Contact him at Microsoft Research,I Microsoft Way,Redmond, Smola,Advances in Kernel Methods-Sup- WA 98005;jplatt @microsoft.com;http://research.microsoft.com/-jplatt. port Vector Learning,to appear,MIT Press, Cambridge,Mass,1998. Bernhard Scholkopf is a researcher at GMD First,Berlin.His research 5. B.Scholkopfetal,"Support Vector Regres- interests are in machine leaming and perception,in particular using SVMs sion with Automatic Accuracy Control,"to be and Kemel PCA.He has an MSc in mathematics from the University of published in Proc.Eighth Int'I Conf.Artifi- London,and a Diploma in physics and a PhD in computer science,both cial Neural Networks,Perspectives in Neural from the Max Planck Institute.Contact him at GMD-First,Rm.208, Computing,Springer-Verlag,Berlin,1998. Rudower Chaussee 5,D-12489 Berlin;bs @first.gmd.de:http://www.first. 6. C.J.C.Burges,"Simplified Support Vector gmd.de/-bs. Decision Rules,"Proc.13th Int'l Conf. Machine Learning,Morgan Kaufmann,San Francisco,1996,pp.71-77. 1. A.Smola and B.Scholkopf,"From Regu- larization Operators to Support Vector Ker- nels,"Advances in Neural Information Pro- cessing Systems 10,M.Jordan,M.Kearns, methods,including SVMs,have tremendous bility-especially for large or rapidly and S.Solla,eds.,MIT Press,1998. potential for helping people more effectively changing collections.Consequently,inter- 8.F.Girosi.An Equivalence benveen Sparse organize electronic resources. est is growing in developing technologies Approximation and Support Vector Machines, Today,most text categorization is done by for(semi)automatic text categorization. AI Memo No.1606,MIT,Cambridge,Mass. 1997. people.We all save hundreds of files,e-mail Rule-based approaches similar to those 9. J.Weston et al.,Density Estimation Using messages,and URLs in folders every day. employed in expert systems have been used Support Vector Machines.Tech.Report We are often asked to choose keywords but they generally require manual construc- CSD-TR-97-23,Royal Holloway,Univ.of from an approved set of indexing terms for tion of the rules.make rigid binary deci- London,1997. describing our technical publications.On a sions about category membership,and are much larger scale,trained specialists assign typically difficult to modify.Another strat- new items to categories in large taxonomies egy is to use inductive-learning techniques Using SVMs for text categorization such as the Dewey Decimal or Library of to automatically construct classifiers using Congress subject headings,Medical Subject labeled training data.Researchers have ap- Susan Dumais.Decision Theory and Adap- Headings(MeSH),or Yahoo!'s Internet di- plied a growing number of learning tech- tive Systems Group,Microsoft Research rectory.Between these two extremes,people niques to text categorization,including As the volume of electronic information organize objects into categories to support a multivariate regression,nearest-neighbor increases,there is growing interest in devel- wide variety of information-management classifiers,probabilistic Bayesian models, oping tools to help people better find,filter, tasks,including information routing/filter- decision trees,and neural networks.2Re- and manage these resources.Text categoriza- ing/push,identification of objectionable cently,my colleagues and I and others have tion-the assignment of natural-language materials or junk mail,structured search and used SVMs for text categorization with texts to one or more predefined categories browsing,and topic identification for topic- very promising results.3.4 In this essay,I based on their content-is an important com- specific processing operations. briefly describe the results of experiments ponent in many information organization Human categorization is very time-con- in which we use SVMs to classify newswire and management tasks.Machine-leaming suming and costly,thus limiting its applica- stories from Reuters.4 JULY/AUGUST 1998 21
JULY/AUGUST 1998 21 method consists of solving a linear eigenvalue problem for a matrix whose elements are computed using the kernel function. The resulting feature extractors have the same architecture as SV machines (see Figure 4). A number of researchers have since started to “kernelize” various other linear algorithms. References 1. B.E. Boser, I.M. Guyon, and V.N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” Proc. Fifth Ann. Workshop Computational Learning Theory, ACM Press, New York, 1992, pp. 144–152. 2. V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. 3. B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear Component Analysis as a Kernel Eigenvalue Problem,” Neural Computation, Vol. 10, 1998, pp. 1299–1319. 4. B. Schölkopf, C.J.C. Burges, and A.J. Smola, Advances in Kernel Methods—Support Vector Learning, to appear, MIT Press, Cambridge, Mass, 1998. 5. B. Schölkopf et al., “Support Vector Regression with Automatic Accuracy Control,” to be published in Proc. Eighth Int’l Conf. Artificial Neural Networks, Perspectives in Neural Computing, Springer-Verlag, Berlin, 1998. 6. C.J.C. Burges, “Simplified Support Vector Decision Rules,” Proc. 13th Int’l Conf. Machine Learning, Morgan Kaufmann, San Francisco, 1996, pp. 71–77. 7. A. Smola and B. Schölkopf, “From Regularization Operators to Support Vector Kernels,” Advances in Neural Information Processing Systems 10, M. Jordan, M. Kearns, and S. Solla, eds., MIT Press, 1998. 8. F. Girosi, An Equivalence between Sparse Approximation and Support Vector Machines, AI Memo No. 1606, MIT, Cambridge, Mass., 1997. 9. J. Weston et al., Density Estimation Using Support Vector Machines, Tech. Report CSD-TR-97-23, Royal Holloway, Univ. of London, 1997. Using SVMs for text categorization Susan Dumais, Decision Theory and Adaptive Systems Group, Microsoft Research As the volume of electronic information increases, there is growing interest in developing tools to help people better find, filter, and manage these resources. Text categorization—the assignment of natural-language texts to one or more predefined categories based on their content—is an important component in many information organization and management tasks. Machine-learning methods, including SVMs, have tremendous potential for helping people more effectively organize electronic resources. Today, most text categorization is done by people. We all save hundreds of files, e-mail messages, and URLs in folders every day. We are often asked to choose keywords from an approved set of indexing terms for describing our technical publications. On a much larger scale, trained specialists assign new items to categories in large taxonomies such as the Dewey Decimal or Library of Congress subject headings, Medical Subject Headings (MeSH), or Yahoo!’s Internet directory. Between these two extremes, people organize objects into categories to support a wide variety of information-management tasks, including information routing/filtering/push, identification of objectionable materials or junk mail, structured search and browsing, and topic identification for topicspecific processing operations. Human categorization is very time-consuming and costly, thus limiting its applicability—especially for large or rapidly changing collections. Consequently, interest is growing in developing technologies for (semi)automatic text categorization. Rule-based approaches similar to those employed in expert systems have been used, but they generally require manual construction of the rules, make rigid binary decisions about category membership, and are typically difficult to modify. Another strategy is to use inductive-learning techniques to automatically construct classifiers using labeled training data. Researchers have applied a growing number of learning techniques to text categorization, including multivariate regression, nearest-neighbor classifiers, probabilistic Bayesian models, decision trees, and neural networks.1,2 Recently, my colleagues and I and others have used SVMs for text categorization with very promising results.3,4 In this essay, I briefly describe the results of experiments in which we use SVMs to classify newswire stories from Reuters.4 Susan T. Dumais is a senior researcher in the Decision Theory and Adaptive Systems Group at Microsoft Research. Her research interests include algorithms and interfaces for improved information retrieval and classification, human-computer interaction, combining search and navigation, user modeling, individual differences, collaborative filtering, and organizational impacts of new technology. She received a BA in mathematics and psychology from Bates College and a PhD in cognitive psychology from Indiana University. She is a member of the ACM, the ASIS, the Human Factors and Ergonomic Society, and the Psychonomic Society. Contact her at Microsoft Research, 1 Microsoft Way, Redmond, WA 98052; sdumais@microsoft. com; http://research.microsoft.com/~sdumais. Edgar Osuna has just returned to his native Venezuela after receiving his PhD in operations research from the Massachusetts Institute of Technology. His research interests include the study of different aspects and properties of Vapnik’s SVM. He received his BS in computer engineering from the Universidad Simon Bolivar, Caracas, Venezuela. Contact him at IESA, POBA International #646, PO Box 02-5255, Miami, FL 33102-5255; eosuna@usb.ve. John Platt is a senior researcher in the Signal Processing Group at Microsoft Research. Recently, he has concentrated on fast general-purpose machine-learning algorithms for use in processing signals. More generally, his research interests include neural networks, machine learning, computer vision, and signal processing. He received his PhD in computer science at Caltech in 1989, where he studied computer graphics and neural networks. His received his BS in chemistry from California State University at Long Beach. Contact him at Microsoft Research, 1 Microsoft Way, Redmond, WA 98005; jplatt@microsoft.com; http://research.microsoft.com/~jplatt. Bernhard Schölkopf is a researcher at GMD First, Berlin. His research interests are in machine learning and perception, in particular using SVMs and Kernel PCA. He has an MSc in mathematics from the University of London, and a Diploma in physics and a PhD in computer science, both from the Max Planck Institute. Contact him at GMD-First, Rm. 208, Rudower Chaussee 5, D-12489 Berlin; bs@first.gmd.de; http://www.first. gmd.de/~bs.
Table 1.Break-even perfommance for five learing algorithms. FINDSIM NAIVE BAYES BAYESNETS TREES LINEARSVM (%) (%) (%) (%) (%) earn 92.9 95.9 95.8 978 98.0 classification,simpler binary feature values acq 64.7 87.8 88.3 89.7 93.6 (a word either occurs or does not occur in a money-fx 46.7 56.6 58.8 66.2 74.5 document)are often used instead. grain 67.5 78. 81.4 85.0 94.6 Text collections containing millions of crude 70.1 5 79.6 85.0 88.9 trade 63. 69.0 72.5 75.9 unique terms are quite common.Thus,for 65.1 interest 63.4 77.7 both efficiency and efficacy,feature selection ship 49.2 85.6 is widely used when applying machine- wheat 68.9 69.7 82.7 92.5 91.8 learning methods to text categorization.To corn 48.2 65.3 76.4 91.8 90.3 reduce the number of features,we first re- Avg.top 10 64.6 81.5 85.0 88.4 92.0 move features based on overall frequency Avg.all 61.7 75.2 80.0 N/A 87.0 counts,and then select a small number of features based on their fit to categories.We use the mutual information between each feature and a category to further reduce the LSVM feature space.These much smaller document Naive Bayes descriptions then serve as input to the SVM. 0.8 Learning SVMs.We used simple linear SVMs because they provide good general- 0.6 Decision tree ization accuracy and are fast to learn. Thorsten Joachims has explored two classes of nonlinear SVMs-polynomial classifiers and radial basis functions-and observed only small benefits compared to linear models.3 We used John Platt's Sequential 0.2 Find similar Minimal Optimization method(described in a later essay)to learn the vector of fea- ture weights,w.Once the weights are 0 learned,new items are classified by com- 0 0.2 0.4 0.6 0.8 puting w where w is the vector of Recall learned weights,and is the binary vector Figure5.ROC curve. representing a new document.We also learned two parameters of a sigmoid func- tion to transform the output of the SVM to Learning text categorizers confidence(“interest'”category)= probabilities. 0.3*interest+0.4*rate+0.7*quarterly The goal of automatic text-categoriza- An example-Reuters tion systems is to assign new items to one The key idea behind SVMs and other in- The Reuters collection is a popular one or more of a set of predefined categories on ductive-learning approaches is to use a train- for text-categorization research and is pub- the basis of their textual content.Optimal ing set of labeled instances to learn the clas- licly available at http://ww.research.att. categorization functions can be learned sification function automatically.SVM com/-lewis/reuters21578.himl.We used from labeled training examples. classifiers resemble the second example the 12.902 Reuters stories that have been above-a vector of leamned feature weights. classified into 118 categories.Following Inductive learning of classifiers.A classi- The resulting classifiers are easy to construct the ModApte split,we used 75%of the fier is a function that maps an input attri- and update,depend only on information that stories(9.603 stories)to build classifiers bute vector,=.),to the is easy for people to provide (that is,exam- and the remaining 25%(3,299 stories)to confidence that the input belongs to a ples of items that are in or out of categories) test the accuracy of the resulting models in class-that is,)=confidence(class).In and allow users to smoothly trade off preci- reproducing the manual category assign- the case of text classification,the attributes sion and recall depending on their task. ments.Stories can be assigned to more than are words in the document and the classes one category. are the categories of interest(for example, Text representation and feature selec- Text files are automatically processed to Reuters categories include "interest," tion.Each document is represented as a produce a vector of words for each docu- earnings,”and“grain'"). vector of words,as is typically done in in- ment.Eliminating words that appear in only Example classifiers for the Reuters cate- formation retrieval.5 For most text-retrieval one document and then selecting the 300 gory interest are applications,the entries in the vector are words with highest mutual information with weighted to reflect the frequency of terms each category reduces the number of fea- if(interest AND rate)OR(quarterly),then in documents and the distribution of terms tures.These 300-element binary feature vec- confidence(“interest'”category)=0.9 across the collection as a whole.For text tors serve as input to the SVM.A separate 22 IEEE INTELLIGENT SYSTEMS
Learning text categorizers The goal of automatic text-categorization systems is to assign new items to one or more of a set of predefined categories on the basis of their textual content. Optimal categorization functions can be learned from labeled training examples. Inductive learning of classifiers. A classifier is a function that maps an input attribute vector, →x = (x1, x2, x3, …, xn), to the confidence that the input belongs to a class—that is, f( →x ) = confidence(class). In the case of text classification, the attributes are words in the document and the classes are the categories of interest (for example, Reuters categories include “interest,” “earnings,” and “grain”). Example classifiers for the Reuters category interest are • if (interest AND rate) OR (quarterly), then confidence (“interest” category) = 0.9 • confidence (“interest” category) = 0.3*interest + 0.4*rate + 0.7*quarterly The key idea behind SVMs and other inductive-learning approaches is to use a training set of labeled instances to learn the classification function automatically. SVM classifiers resemble the second example above—a vector of learned feature weights. The resulting classifiers are easy to construct and update, depend only on information that is easy for people to provide (that is, examples of items that are in or out of categories), and allow users to smoothly trade off precision and recall depending on their task. Text representation and feature selection. Each document is represented as a vector of words, as is typically done in information retrieval.5 For most text-retrieval applications, the entries in the vector are weighted to reflect the frequency of terms in documents and the distribution of terms across the collection as a whole. For text classification, simpler binary feature values (a word either occurs or does not occur in a document) are often used instead. Text collections containing millions of unique terms are quite common. Thus, for both efficiency and efficacy, feature selection is widely used when applying machinelearning methods to text categorization. To reduce the number of features, we first remove features based on overall frequency counts, and then select a small number of features based on their fit to categories. We use the mutual information between each feature and a category to further reduce the feature space. These much smaller document descriptions then serve as input to the SVM. Learning SVMs. We used simple linear SVMs because they provide good generalization accuracy and are fast to learn. Thorsten Joachims has explored two classes of nonlinear SVMs—polynomial classifiers and radial basis functions—and observed only small benefits compared to linear models.3 We used John Platt’s Sequential Minimal Optimization method6 (described in a later essay) to learn the vector of feature weights, w →. Once the weights are learned, new items are classified by computing x → ⋅w → where w → is the vector of learned weights, and x → is the binary vector representing a new document. We also learned two parameters of a sigmoid function to transform the output of the SVM to probabilities. An example—Reuters The Reuters collection is a popular one for text-categorization research and is publicly available at http://www.research.att. com/~lewis/reuters21578.html. We used the 12,902 Reuters stories that have been classified into 118 categories. Following the ModApte split, we used 75% of the stories (9,603 stories) to build classifiers and the remaining 25% (3,299 stories) to test the accuracy of the resulting models in reproducing the manual category assignments. Stories can be assigned to more than one category. Text files are automatically processed to produce a vector of words for each document. Eliminating words that appear in only one document and then selecting the 300 words with highest mutual information with each category reduces the number of features. These 300-element binary feature vectors serve as input to the SVM. A separate 22 IEEE INTELLIGENT SYSTEMS Table 1. Break-even performance for five learning algorithms. FINDSIM NAIVE BAYES BAYESNETS TREES LINEARSVM (%) (%) (%) (%) (%) earn 92.9 95.9 95.8 97.8 98.0 acq 64.7 87.8 88.3 89.7 93.6 money-fx 46.7 56.6 58.8 66.2 74.5 grain 67.5 78.8 81.4 85.0 94.6 crude 70.1 79.5 79.6 85.0 88.9 trade 65.1 63.9 69.0 72.5 75.9 interest 63.4 64.9 71.3 67.1 77.7 ship 49.2 85.4 84.4 74.2 85.6 wheat 68.9 69.7 82.7 92.5 91.8 corn 48.2 65.3 76.4 91.8 90.3 Avg. top 10 64.6 81.5 85.0 88.4 92.0 Avg. all 61.7 75.2 80.0 N/A 87.0 1 0.8 0.6 0.4 0.2 0 0.4 0.6 0.8 1 Find similar Decision tree LSVM Naive Bayes Recall Precision 0 0.2 Figure 5. ROC curve.
A binary classlficktion Washington D.C. problem: solution: classifier(w)is 101st Congressional learned for each District. image,and if there are,return category.Using missile range. an encoding of SMO to train the national park. their location. linear SVM takes an illegal inmigrant. The encoding in average of 0.26 this system is to CPU seconds per fit each face in a category (averaged bounding box over 118 categories) defined by the on a 266-MHz Pentium II running Windows offer great potential to support flexible,dy- image coordinates of the corners. NT.Other learning methods are 20 to 50 namic,and personalized information access Face detection as a computer-vision task times slower.New instances are classified by and management in a wide variety of tasks. has many applications.It has direct rele- computing a score for each document ( vance to the face-recognition problem,be- and comparing the score with a leamed cause the first important step of a fully au- threshold.New documents exceeding the References tomatic human face recognizer is usually threshold are said to belong to the category. 1. D.D.Lewis and P.Hayes,special issue of identifying and locating faces in an un- The learned SVM classifiers are intu- ACM Trans.Information Systems,Vol.12. known image.Face detection also has po- itively reasonable.The weight vector for No.1.July 1994. tential application in human-computer in- the category“interest'”includes the words Y.Yang,"An Evaluation of Statistical Ap- proaches to Text Categorization,"to be terfaces,surveillance systems,and census prime (.70),rate (.67),interest (.63),rates published in.J.Information Retrieval,1998 systems,for example. (.60),and discount(.46),with large posi- T.Joachims,"Text Categorization with Sup- For this discussion.face detection is also tive weights,and the words group (-24). port Vector Machines:Leaming with Many nteresting as an example of a natural and year (-25),sees(-.33)world (-.35),and Relevant Features,"to be published in Proc. challenging problem for demonstrating and dlrs(-71),with large negative weights. 10th European Conf.Machine Leaming ECML),Springer-Verlag,1998;http:// testing the potentials of SVMs.Many other As is typical in evaluating text catego- www-ai.cs.uni-dortmund.de/PERSONAL/ real-world object classes and phenomena rization,we measure classification accu- joachims.html/Joachims 97b.ps.gz. share similar characteristics-for example, racy using the average of precision and S.Dumais et al.,"Inductive Learning Algo- tumor anomalies in MRI scans and struc- recall (the so-called breakeven point).Pre- rithms and Representations for Text Catego- tural defects in manufactured parts.A suc- cision is the proportion of items placed in rization,to be published in Proc.Conf.Infor mation and Knowledge Management,1998; cessful and general methodology for find- the category that are really in the category, http://research.microsoft.com/-sdumais/ ing faces using SVMs should generalize and recall is the proportion of items in the cikm98.doc. well for other spatially well-defined pat- category that are actually placed in the cat- G.Salton and M.McGill,Introduction to tern-and feature-detection problems. egory.Table 1 summarizes microaveraged Modern Information Retrieval,McGraw Face detection.like most obiect-detection Hill.New York.1983. breakeven performance for five learning problems,is a difficult task because of the J.Platt,"Fast Training of SVMs Using Se- algorithms explored by my colleagues and quential Minimal Optimization,"to be pub significant pattern variations that are hard to I explored for the 10 most frequent cate- lished in Advances in Kernel Methods- parameterize analytically.Some common gories,as well as the overall score for all Support Vector Machine Learning,B. sources of pattern variations are facial ap- 118 categories.4 Scholkpf,C.Burges,and A.Smola,eds. pearance,expression,presence or absence of Linear SVMs were the most accurate MIT Press,Cambridge,Mass.,1998. common structural features such as glasses or method,averaging 91.3%for the 10 most a moustache,and light-source distribution. frequent categories and 85.5%over all 118 This system works by testing candidate categories.These results are consistent Applying SVMs to face detection image locations for local patterns that ap- with Joachims'results in spite of substan- pear like faces,using a classification proce- tial differences in text preprocessing,term Edgar Osuna.MIT Center for Biological and dure that determines whether a given local weighting,and parameter selection,sug- Computational Learning and Operations Re- image pattern is a face.Therefore.our ap- gesting that the SVM approach is quite search Center proach comes at the face-detection problem robust and generally applicable for text- This essay introduces an SMV applica- as a classification problem given by exam- categorization problems.3 tion for detecting vertically oriented and ples of two classes:faces and nonfaces. Figure 5 shows a representative ROC unoccluded frontal views of human faces in curve for the category“grain.”We generate gray-level images.This application handles Previous systems this curve by varying the decision threshold faces over a wide range of scales and works Researchers have approached the face- to produce higher precision or higher re- under different lighting conditions,even detection problem with different techniques call,depending on the task.The advantages with moderately strong shadows. in the last few years,including neural net- of the SVM can be seen over the entire We can define the face-detection prob- works,2 detection of face features and use recall-precision space lem as follows.Given as input an arbitrary of geometrical constraints,3 density estima- image,which could be a digitized video tion of the training data,labeled graphs,s Summary signal or a scanned photograph,determine and clustering and distribution-based mod- In summary,inductive learning methods whether there are any human faces in the eling.6.7 The results of Kah-Kay Sung and JULY/AUGUST 1998 23
JULY/AUGUST 1998 23 classifier (→w ) is learned for each category. Using SMO to train the linear SVM takes an average of 0.26 CPU seconds per category (averaged over 118 categories) on a 266-MHz Pentium II running Windows NT. Other learning methods are 20 to 50 times slower. New instances are classified by computing a score for each document ( →x ⋅ →w ) and comparing the score with a learned threshold. New documents exceeding the threshold are said to belong to the category. The learned SVM classifiers are intuitively reasonable. The weight vector for the category “interest” includes the words prime (.70), rate (.67), interest (.63), rates (.60), and discount (.46), with large positive weights, and the words group (–.24), year (–.25), sees (–.33) world (–.35), and dlrs (–.71), with large negative weights. As is typical in evaluating text categorization, we measure classification accuracy using the average of precision and recall (the so-called breakeven point). Precision is the proportion of items placed in the category that are really in the category, and recall is the proportion of items in the category that are actually placed in the category. Table 1 summarizes microaveraged breakeven performance for five learning algorithms explored by my colleagues and I explored for the 10 most frequent categories, as well as the overall score for all 118 categories.4 Linear SVMs were the most accurate method, averaging 91.3% for the 10 most frequent categories and 85.5% over all 118 categories. These results are consistent with Joachims’results in spite of substantial differences in text preprocessing, term weighting, and parameter selection, suggesting that the SVM approach is quite robust and generally applicable for textcategorization problems.3 Figure 5 shows a representative ROC curve for the category “grain.” We generate this curve by varying the decision threshold to produce higher precision or higher recall, depending on the task. The advantages of the SVM can be seen over the entire recall-precision space. Summary In summary, inductive learning methods offer great potential to support flexible, dynamic, and personalized information access and management in a wide variety of tasks. References 1. D.D. Lewis and P. Hayes, special issue of ACM Trans. Information Systems, Vol. 12, No. 1, July 1994. 2. Y. Yang, “An Evaluation of Statistical Approaches to Text Categorization,” to be published in J. Information Retrieval, 1998. 3. T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” to be published in Proc. 10th European Conf. Machine Learning (ECML), Springer-Verlag, 1998; http:// www-ai.cs.uni-dortmund.de/PERSONAL/ joachims.html/Joachims_97b.ps.gz. 4. S. Dumais et al., “Inductive Learning Algorithms and Representations for Text Categorization, to be published in Proc. Conf. Information and Knowledge Management, 1998; http://research.microsoft.com/~sdumais/ cikm98.doc. 5. G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw Hill, New York, 1983. 6. J. Platt, “Fast Training of SVMs Using Sequential Minimal Optimization,” to be published in Advances in Kernel Methods— Support Vector Machine Learning, B. Schölkpf, C. Burges, and A. Smola, eds., MIT Press, Cambridge, Mass., 1998. Applying SVMs to face detection Edgar Osuna, MIT Center for Biological and Computational Learning and Operations Research Center This essay introduces an SMV application for detecting vertically oriented and unoccluded frontal views of human faces in gray-level images. This application handles faces over a wide range of scales and works under different lighting conditions, even with moderately strong shadows. We can define the face-detection problem as follows. Given as input an arbitrary image, which could be a digitized video signal or a scanned photograph, determine whether there are any human faces in the image, and if there are, return an encoding of their location. The encoding in this system is to fit each face in a bounding box defined by the image coordinates of the corners. Face detection as a computer-vision task has many applications. It has direct relevance to the face-recognition problem, because the first important step of a fully automatic human face recognizer is usually identifying and locating faces in an unknown image. Face detection also has potential application in human-computer interfaces, surveillance systems, and census systems, for example. For this discussion, face detection is also interesting as an example of a natural and challenging problem for demonstrating and testing the potentials of SVMs. Many other real-world object classes and phenomena share similar characteristics—for example, tumor anomalies in MRI scans and structural defects in manufactured parts. A successful and general methodology for finding faces using SVMs should generalize well for other spatially well-defined pattern- and feature-detection problems. Face detection, like most object-detection problems, is a difficult task because of the significant pattern variations that are hard to parameterize analytically. Some common sources of pattern variations are facial appearance, expression, presence or absence of common structural features such as glasses or a moustache, and light-source distribution. This system works by testing candidate image locations for local patterns that appear like faces, using a classification procedure that determines whether a given local image pattern is a face. Therefore, our approach comes at the face-detection problem as a classification problem given by examples of two classes: faces and nonfaces. Previous systems Researchers have approached the facedetection problem with different techniques in the last few years, including neural networks,1,2 detection of face features and use of geometrical constraints,3 density estimation of the training data,4 labeled graphs,5 and clustering and distribution-based modeling. 6,7 The results of Kah-Kay Sung and
Non-faces principled way to choose some important free parameters such as the number of clus- ters it uses. Similarly,Rowley and his colleagues have used problem information in the design ofa retinally connected neural network trained to classify face and nonface patterns.Their ap- proach relies on training several neural net- works emphasizing sets of the training data to obtain different sets of weights.Then,their approach uses different schemes of arbitra- 国 tion between them to reach a final answer. Our SVM approach to the face-detection system uses no prior information to obtain the decision surface.this being an interest- ing property that can be exploited in using the same approach for detecting other ob- 四 jects in digital images. The SVM face-detection system This system detects faces by exhaus- tively scanning an image for face-like pat- Faces terns at many possible scales,by dividing the original image into overlapping subim- ages and classifying them using an SVM to Figure 6.Geometrical interpretation of how the SVM separates the face and nonface classes.The patters are real determine the appropriate class-face or support vectors obtained after training the system.Notice the small number of total support vectors and the fact that a nonface.The system handles multiple higher proportion of them correspond to nonfaces. scales by examining windows taken from scaled versions of the original image. Clearly,the major use of SVMs is in the classification step,which is the most criti- cal part of this work.Figure 6 gives a geo- metrical interpretation of the way SVMs work in the context of face detection. More specifically,this system works as follows.We train on a database of face and nonface 19x19 pixel patterns,assigned to classes +1 and-1,respectively,using the support vector algorithm.This process uses a second-degree homogeneous polynomial kernel function and an upper bound C= 200 to obtain a perfect training error. To compensate for certain sources of 293 image variation,we perform some prepro- cessing of the data: Figure 7.False detections obtained with the first version of the system.These false positives later served asnonface Masking:A binary pixel mask removes examples in the training process. some pixels close to the window-pattern boundary,allowing a reduction in the dimensionality of the input space from Tomaso Poggio5,7 and Henry Rowley2re- their result is the clustering and use of 19x 19 361 to 283.This step reduces flect systems with very high detection rates combined Mahalanobis and Euclidean met- background patterns that introduce un- and low false-positive detection rates. rics to measure the distance from a new necessary noise in the training process. Sung and Poggio use clustering and dis- pattern and the clusters.Other important Illumination gradient correction:The tance metrics to model the distribution of features of their approach are the use of process subtracts a best-fit brightness the face and nonface manifold and a neural nonface clusters and a bootstrapping tech- plane from the unmasked window pixel network to classify a new pattern given the nique to collect important nonface patterns values,allowing reduction of light and measurements.The key to the quality of However,this approach does not provide a heavy shadows. 24 IEEE INTELLIGENT SYSTEMS
Tomaso Poggio6,7 and Henry Rowley2 reflect systems with very high detection rates and low false-positive detection rates. Sung and Poggio use clustering and distance metrics to model the distribution of the face and nonface manifold and a neural network to classify a new pattern given the measurements. The key to the quality of their result is the clustering and use of combined Mahalanobis and Euclidean metrics to measure the distance from a new pattern and the clusters. Other important features of their approach are the use of nonface clusters and a bootstrapping technique to collect important nonface patterns. However, this approach does not provide a principled way to choose some important free parameters such as the number of clusters it uses. Similarly, Rowley and his colleagues have used problem information in the design of a retinally connected neural network trained to classify face and nonface patterns. Their approach relies on training several neural networks emphasizing sets of the training data to obtain different sets of weights. Then, their approach uses different schemes of arbitration between them to reach a final answer. Our SVM approach to the face-detection system uses no prior information to obtain the decision surface, this being an interesting property that can be exploited in using the same approach for detecting other objects in digital images. The SVM face-detection system This system detects faces by exhaustively scanning an image for face-like patterns at many possible scales, by dividing the original image into overlapping subimages and classifying them using an SVM to determine the appropriate class—face or nonface. The system handles multiple scales by examining windows taken from scaled versions of the original image. Clearly, the major use of SVMs is in the classification step, which is the most critical part of this work. Figure 6 gives a geometrical interpretation of the way SVMs work in the context of face detection. More specifically, this system works as follows. We train on a database of face and nonface 19×19 pixel patterns, assigned to classes +1 and –1, respectively, using the support vector algorithm. This process uses a second-degree homogeneous polynomial kernel function and an upper bound C = 200 to obtain a perfect training error. To compensate for certain sources of image variation, we perform some preprocessing of the data: • Masking: A binary pixel mask removes some pixels close to the window-pattern boundary, allowing a reduction in the dimensionality of the input space from 19 × 19 = 361 to 283. This step reduces background patterns that introduce unnecessary noise in the training process. • Illumination gradient correction: The process subtracts a best-fit brightness plane from the unmasked window pixel values, allowing reduction of light and heavy shadows. 24 IEEE INTELLIGENT SYSTEMS Figure 6. Geometrical interpretation of how the SVM separates the face and nonface classes. The patterns are real support vectors obtained after training the system. Notice the small number of total support vectors and the fact that a higher proportion of them correspond to nonfaces. Figure 7. False detections obtained with the first version of the system. These false positives later served as nonface examples in the training process. Non-faces Faces
Input image Extracted window Light Histogram Classification using pyramid (19x19 pixels)correction equalization support vector machines SVM quick discard possible-face/non-face If possible face SVM complete classifier face/nonface Preprocessing Figure 8.System architecture at runtime.(Used with permission.2) .Histogram equalization:Our process Rescale the input image several times; performs a histogram equalization over Cut 19x19 window patterns out of the the patterns to compensate for differences scaled image; in illumination brightness and different Preprocess the window using masking, cameras'response curves,and so on. light correction and histogran equaliza- tion; Once the process obtains a decision sur- Classify the pattern using the SVM:and face through training,it uses the runtime If the class corresponds to a face,draw system over images that do not contain a regtangle aroung the face in the output faces,storing misclassifications so that image they can be used as negative examples in subsequent training phases.Images of Figure 8 reflects the system's architec- landscapes,trees,buildings,and rocks,for ture at runtime example,are good sources of false posi- tives because of the many different textured Experimental results on static patterns they contain.This bootstrapping images step,which Sung and Poggio successfully To test the runtime system,we used two used,is very important in the context of a sets of images.Set A contained 313 high- face detector that learns from examples: quality images with the same number of faces.Set B contained 23 images of mixed Although negative examples are abun- quality,with a total of 155 faces.We tested dant,negative examples that are useful both sets,first using our system and then from a learning standpoint are very dif- the one by Sung and Poggio.5.6 To give Fiqure 9.Results from our face-detection system. ficult to characterize and define. true meaning to the number of false posi- By approaching the problem of object tives obtained,note that set A involved detection,and in this case of face de- 4.669.960 pattern windows,while set B Table 2.Performance of the SVM face-detection system tection,by using the paradigm of bi- involved 5,383,682.Table 2 compares the nary pattern classification,the two two systems. TEST SET A TEST SET B classes-object and nonobject-are not Figure 9 presents some output images of DETECT DETECT equally complex.The nonobject class our system,which were not used during the RATE FALSE RATE FALSE is broader and richer,and therefore training phase of the system. (ALARMS (ALARMS needs more examples to get an accurate SVM 97.14 74.220 definition that separates it from the Extension to a real-time system Sung 94.62 74.2 11 object class.Figure 7 shows an image The system I've discussed so far spends used for bootstrapping with some mis- approximately 6 seconds(SparcStation 20) classifications that later served as nega- on a 320x240 pixels gray-level image.Al- tive examples. though this is faster than most previous a Matrox RGB frame grabber and a systems,it is not fast enough for use as a Hitachi three-chip color camera.We After training the SVM,using an imple- runtime system.To build a runtime version used no special hardware to speed up mentation of the algorithm my colleagues of the system,we took the following steps: the computational burden of the system. and I describe elsewhere,8 we incorporate it ● We collected several color images with as the classifier in a runtime system very We ported the C code developed on the faces,from which we extracted areas similar to the one used by Sung and Pog- Sun environment to a Windows NT with skin and nonskin pixels.We col- gio.5.6 It performs the following operations: Pentium 200-MHz computer and added lected a dataset of6,000 examples. JULY/AUGUST 1998 25
JULY/AUGUST 1998 25 • Histogram equalization: Our process performs a histogram equalization over the patterns to compensate for differences in illumination brightness and different cameras’response curves, and so on. Once the process obtains a decision surface through training, it uses the runtime system over images that do not contain faces, storing misclassifications so that they can be used as negative examples in subsequent training phases. Images of landscapes, trees, buildings, and rocks, for example, are good sources of false positives because of the many different textured patterns they contain. This bootstrapping step, which Sung and Poggio6 successfully used, is very important in the context of a face detector that learns from examples: • Although negative examples are abundant, negative examples that are useful from a learning standpoint are very difficult to characterize and define. • By approaching the problem of object detection, and in this case of face detection, by using the paradigm of binary pattern classification, the two classes—object and nonobject—are not equally complex. The nonobject class is broader and richer, and therefore needs more examples to get an accurate definition that separates it from the object class. Figure 7 shows an image used for bootstrapping with some misclassifications that later served as negative examples. After training the SVM, using an implementation of the algorithm my colleagues and I describe elsewhere, 8 we incorporate it as the classifier in a runtime system very similar to the one used by Sung and Poggio.5,6 It performs the following operations: • Rescale the input image several times; • Cut 19×19 window patterns out of the scaled image; • Preprocess the window using masking, light correction and histogran equalization; • Classify the pattern using the SVM; and • If the class corresponds to a face, draw a regtangle aroung the face in the output image. Figure 8 reflects the system’s architecture at runtime. Experimental results on static images To test the runtime system, we used two sets of images. Set A contained 313 highquality images with the same number of faces. Set B contained 23 images of mixed quality, with a total of 155 faces. We tested both sets, first using our system and then the one by Sung and Poggio. 5,6 To give true meaning to the number of false positives obtained, note that set A involved 4,669,960 pattern windows, while set B involved 5,383,682. Table 2 compares the two systems. Figure 9 presents some output images of our system, which were not used during the training phase of the system. Extension to a real-time system The system I’ve discussed so far spends approximately 6 seconds (SparcStation 20) on a 320×240 pixels gray-level image. Although this is faster than most previous systems, it is not fast enough for use as a runtime system. To build a runtime version of the system, we took the following steps: • We ported the C code developed on the Sun environment to a Windows NT Pentium 200-MHz computer and added a Matrox RGB frame grabber and a Hitachi three-chip color camera. We used no special hardware to speed up the computational burden of the system. • We collected several color images with faces, from which we extracted areas with skin and nonskin pixels. We collected a dataset of 6,000 examples. Figure 8. System architecture at runtime. (Used with permission.2 ) Input image pyramid Extracted window (19×19 pixels) Light correction Histogram equalization Classification using support vector machines SVM quick discard possible-face/non-face SVM complete classifier face/nonface If possible face Preprocessing Subsampling Figure 9. Results from our face-detection system. Table 2. Performance of the SVM face-detection system. TEST SET A TEST SET B DETECT DETECT RATE FALSE RATE FALSE (%) ALARMS (%) ALARMS SVM 97.1 4 74.2 20 Sung 94.6 2 74.2 11
sponding skin-detection output. We coded a very primitive motion de- How to implement SVMs tector based on thresholded frame dif John Platt.Microsoft Research ferencing to identify areas of movement In the past few years,SVMs have proven and use them as the focus of attention. to be very effective in real-world classifica- Motion was not a requirement to be de- tion tasks.I This installment of Trends tected by the system because every so Controversies describes two of these tasks: many frames(20 in the current imple- face recognition and text categorization. mentation),we skipped this step and However,many people have found the nu- scanned the whole image. merical implementation of SVMs to be We put together a hierarchical system intimidating.In this essay,I will attempt to using as a first step the motion-detection demystify the implementation of SVMs.As module.We used the SVM skin-detec- a first step,if you are interested in imple- tion system as second layer to identify menting an SVM,I recommend reading candidate locations of faces.We used the Chris Burges'tutorial on SVMs,2 available face/nonface SVM classifier described I at http://svm.research.bell-labs.com/ described earlier over the gray-level ver- SVMdoc.himl. sion of the candidate locations. An SVM is a parameterized function whose functional form is defined before The whole system achieves rates of 4 to training.Training an SVM requires a la- Figure 10.An example of the skin detection module 5 frames per second.Figure 11 presents a beled training set,because the SVM will fit implemented using SVMs. couple of images captured by our PC-based the function from a set of examples.The Color Real-Time face-detection system. training set consists of a set of N examples. Each example consists of an input vector, x,and a label,y.which describes whether the input vector is in a predefined category. There are N free parameters in an SVM trained with Nexamples.These parameters are called o.To find these parameters,you References must solve a quadratic programming (QP) 1.G.Burel and D.Carel,"Detection and Lo- problem: calization of Faces on Digital Images," Pattern Recognition Letters,Vol.15,1994. pp.963-967. minimize 2 2. H.Rowley,S.Baluja,and T.Kanade,Human Face Detection in Visual Scenes,Tech.Re- N port 95-158,Computer Science Dept., subjectto0≤a,≤Cand∑y,=0 Carnegie Mellon Univ.,Pittsburgh,1995. 3. G.Yang and T.Huang,"Human Face Detec- tion in a Complex Background,"Pattern where O is an NxN matrix that depends Recognition,Vol.27,1994,pp.53-63. 4. on the training inputsxthe labelsy and B.Moghaddam and A.Pentland,Proba- bilistic Visual Learning for Object Detec- the functional form of the SVM.We call tion,Tech.Report 326,MIT Media Labora- this problem quadratic programming be- tory,Cambridge,Mass.,1995. cause the function to be minimized(called 5. N.Kruger,M.Potzsch,and C.v.d.Mals- the objective function)depends on the a burg,Determination of Face Position and quadratically,while o,only appears lin- Pose with Learned Representation Based on Labeled Graphs,Tech.Report 96-03,Ruhr- early in the constraints (see http://www- Universitat,1996. c.mcs.anl. 6. K.Sung,Learning and Example Selection gov/home/otc/Guide/OptWeb/continuous/ Figure 11.Face detcion thePC-based Color Real- for Object and Pattern Detection,PhD the. constrained/gprog).Definitions and appli- Time system sis,MIT AI Lab and Center for Biological and Computational Learning.1995. cations ofxand appear in the tu- 7. torial by Burges.2 K.Sung and T.Poggio,Example-Based Learning for View-Based Human Face De Conceptually,the SVM QP problem is to We trained a SVM classifier using the tection.A.I.Memo 1521,C.B.C.L Paper find a minimum of a bowl-shaped objective skin and nonskin data.The input vari- 112,Dec.1994. function.The search for the minimum is ables were normalized green and red E.Osuna,R.Freund,and F.Girosi,"An constrained to lie within a cube and on a values-g/(r+g+b)and r/(r+g+b),re- Improved Training Algorithm for Support Vector Machines,"Proc.IEEE Workshop on plane.The search occurs in a high-dimen- spectively.Figure 10 presents an image Neural Networks and Signal Processing, sional space,so that the bowl is high dimen- captured by the system and its corre- IEEE Press,Piscataway,N.J.,1997. sional,the cube is a hypercube,and the 26 IEEE INTELLIGENT SYSTEMS
• We trained a SVM classifier using the skin and nonskin data. The input variables were normalized green and red values—g/(r+g+b) and r/(r+g+b), respectively. Figure 10 presents an image captured by the system and its corresponding skin-detection output. • We coded a very primitive motion detector based on thresholded frame differencing to identify areas of movement and use them as the focus of attention. Motion was not a requirement to be detected by the system because every so many frames (20 in the current implementation), we skipped this step and scanned the whole image. • We put together a hierarchical system using as a first step the motion-detection module. We used the SVM skin-detection system as second layer to identify candidate locations of faces. We used the face/nonface SVM classifier described I described earlier over the gray-level version of the candidate locations. The whole system achieves rates of 4 to 5 frames per second. Figure 11 presents a couple of images captured by our PC-based Color Real-Time face-detection system. References 1. G. Burel and D. Carel, “Detection and Localization of Faces on Digital Images,” Pattern Recognition Letters, Vol. 15, 1994, pp. 963–967. 2. H. Rowley, S. Baluja, and T. Kanade, Human Face Detection in Visual Scenes, Tech. Report 95–158, Computer Science Dept., Carnegie Mellon Univ., Pittsburgh, 1995. 3. G. Yang and T. Huang, “Human Face Detection in a Complex Background,” Pattern Recognition, Vol. 27, 1994, pp. 53–63. 4. B. Moghaddam and A. Pentland, Probabilistic Visual Learning for Object Detection, Tech. Report 326, MIT Media Laboratory, Cambridge, Mass., 1995. 5. N. Krüger, M. Pötzsch, and C. v.d. Malsburg, Determination of Face Position and Pose with Learned Representation Based on Labeled Graphs, Tech. Report 96-03, RuhrUniversität, 1996. 6. K. Sung, Learning and Example Selection for Object and Pattern Detection, PhD thesis, MIT AI Lab and Center for Biological and Computational Learning, 1995. 7. K. Sung and T. Poggio, Example-Based Learning for View-Based Human Face Detection, A.I. Memo 1521, C.B.C.L Paper 112, Dec. 1994. 8. E. Osuna, R. Freund, and F. Girosi, “An Improved Training Algorithm for Support Vector Machines,” Proc. IEEE Workshop on Neural Networks and Signal Processing, IEEE Press, Piscataway, N.J., 1997. How to implement SVMs John Platt, Microsoft Research In the past few years, SVMs have proven to be very effective in real-world classification tasks. 1 This installment of Trends & Controversies describes two of these tasks: face recognition and text categorization. However, many people have found the numerical implementation of SVMs to be intimidating. In this essay, I will attempt to demystify the implementation of SVMs. As a first step, if you are interested in implementing an SVM, I recommend reading Chris Burges’tutorial on SVMs,2 available at http://svm.research.bell-labs.com/ SVMdoc.html. An SVM is a parameterized function whose functional form is defined before training. Training an SVM requires a labeled training set, because the SVM will fit the function from a set of examples. The training set consists of a set of N examples. Each example consists of an input vector, xi , and a label, yi , which describes whether the input vector is in a predefined category. There are N free parameters in an SVM trained with N examples. These parameters are called αi . To find these parameters, you must solve a quadratic programming (QP) problem: where Q is an N×N matrix that depends on the training inputs xi , the labels yi , and the functional form of the SVM. We call this problem quadratic programming because the function to be minimized (called the objective function) depends on the αi quadratically, while αi only appears linearly in the constraints (see http://wwwc.mcs.anl. gov/home/otc/Guide/OptWeb/continuous/ constrained/qprog). Definitions and applications of xi , yi , αi , and Q appear in the tutorial by Burges.2 Conceptually, the SVM QP problem is to find a minimum of a bowl-shaped objective function. The search for the minimum is constrained to lie within a cube and on a plane. The search occurs in a high-dimensional space, so that the bowl is high dimensional, the cube is a hypercube, and the minimize subject to and 1 2 0 0 1 1 1 α α α α α i ij j i j N i i N i i i i N Q C y − ≤ ≤ = = = = ∑ ∑ ∑ , ; 26 IEEE INTELLIGENT SYSTEMS Figure 10. An example of the skin detection module implemented using SVMs. Figure 11. Face detection on the PC-based Color RealTime system
plane is a hyperplane.For most typical SVM functional forms,the matrix Q has special properties,so that the objective function is either bowl-shaped(positive definite)or has flat-bottomed troughs(posi- tive semidefinite).but is never saddle- shaped(indefinite).Thus,there is either a unique minimum or a connected set of equivalent minima.An SVM QP has a defi- (c) nite termination (or optimality)condition that describes these minima.We call these Fiqure 12.Three alternative methods for training SVMs:(a)Chunking,(b)Osuna's algorithm,and (c)SMO.There are optimality conditions the Karush-Kuhn- three steps for each method.The horizontal thin line at every step represents the training set,while the thick boxes Tucker(KKT)conditions,and they simply represent the o,being optimized at that step.A given group of three lines corresponds to three training iterations, describe the set of o,that are constrained with the first iteration at the top. minima.3 The values of c,also have an intuitive explanation.There is one o,for each train- move the rows and columns of the matrix Q 12b).Using a constant-size matrix allows ing example.Each o,determines how that correspond to zero o.Therefore,the the training of arbitrarily sized datasets. much each training example influences the large QP problem can break down into a The algorithm in Osuna's paper suggests SVM function.Most of the training exam- series of smaller OP problems,whose ulti- adding one example and deleting one ex- ples do not affect the SVM function,so mate goal is to identify all of the nonzero o ample at every step.Such an algorithm most of the o,are 0. and discard all of the zero o.At every step, converges,although it might not be the Because of its simple form,you might chunking solves a QP problem that consists fastest possible algorithm.In practice,re- expect the solution to the SVM QP problem of the following every nonzero from searchers add and delete multiple examples to be quite simple.Unfortunately,for real- the last step,and the that correspond to according to various unpublished heuris- world problems,the matrix Q can be enor- the Mworst violations of the KKT condi- tics.Typically,these heuristics add KKT mous:it has a dimension equal to the num- tions.for some value of M(see Figure 12a) violators at each step and delete those a ber of training examples.A training set of The size of the OP subproblem tends to that are either 0 or C.Joachims has pub- 60,000 examples will yield a Q matrix with grow with time.At the last step,the chunk- lished an algorithm for adding and deleting 3.6 billion elements.which cannot easily fit ing approach has identified the entire set of examples from the OP steps,which rapidly into the memory of a standard computer. nonzero o;hence,the last step solves the decreases the objective function.5 We have at least two different ways of overall OP problem. All of these decomposition methods solving such gigantic QP problems.First. Chunking reduces the Q matrix's dimen- require a numerical QP package.Such there are QP methods that use sophisticated sion from the number of training examples packages might be expensive for commer- data structures.4 These QP methods do not to approximately the number of nonzero cial users(see the"Where to get the pro- require the storage of the entire Qmatrix, However,chunking still might not handle grams"section).Writing your own efficient because they do not need to access the rows large-scale training problems,because even QP package is difficult without a numeri- or columns of Q that correspond to those o this reduced matrix might not fit into mem- cal-analysis background that are at 0 or at C.Deep in the inner loop, ory.Of course,we can combine chunking these methods only perform dot products with the sophisticated OP methods des- Sequential minimal optimization between rows or columns of Q and a vec- cribed above,which do not require full Sequential minimal optimization is an tor,rather than performing an entire ma- storage of a matrix. alternative method that can decompose the trix-vector multiplication. In 1997,Edgar Osuna and his colleagues SVM OP problem without any extra matrix suggested a new strategy for solving the storage and without using numerical QP Decomposing the OP problem SVM QP problem.5 Osuna showed that the optimization steps.3,7SMO decomposes The other method for attacking the large- large QP problem can be broken down into the overall QP problem into QP subprob- scale SVM QP problem is to decompose a series of smaller QP subproblems.As long lems,identically to Osuna's method.Un- the large OP problem into a series of as at least one d,that violates the KKT con- like the previous decomposition heuristics. smaller QP problems.Thus,the selection ditions is added to the previous subproblem. SMO chooses to solve the smallest possible of submatrices of Q happens outside of the each step reduces the objective function and optimization problem at every step.For the OP package,rather than inside.Conse- maintains all of the constraints.Therefore,a standard SVM OP problem,the smallest quently,the decomposition method is com- sequence of QP subproblems that always possible optimization problem involves patible with standard QP packages. add at least one KKT violator will asymp- two elements of o because the a,must Vapnik first suggested the decomposition totically converge. obey one linear equality constraint.At approach in a method that has since been Osuna suggests keeping a constant size every step,SMO chooses two a to jointly known as chunking.The chunking algo- matrix for every QP subproblem,which optimize,finds the optimal values for these rithm exploits the fact that the value of the implies adding and deleting the same num- and updates the SVM to reflect the new objective function is the same if you re- ber of examples at every step(see Figure optimal values(see Figure 12c). JULY/AUGUST 1998 27
JULY/AUGUST 1998 27 plane is a hyperplane. For most typical SVM functional forms, the matrix Q has special properties, so that the objective function is either bowl-shaped (positive definite) or has flat-bottomed troughs (positive semidefinite), but is never saddleshaped (indefinite). Thus, there is either a unique minimum or a connected set of equivalent minima. An SVM QP has a definite termination (or optimality) condition that describes these minima. We call these optimality conditions the Karush-KuhnTucker (KKT) conditions, and they simply describe the set of αi that are constrained minima.3 The values of αi also have an intuitive explanation. There is one αi for each training example. Each αi determines how much each training example influences the SVM function. Most of the training examples do not affect the SVM function, so most of the αi are 0. Because of its simple form, you might expect the solution to the SVM QP problem to be quite simple. Unfortunately, for realworld problems, the matrix Q can be enormous: it has a dimension equal to the number of training examples. A training set of 60,000 examples will yield a Q matrix with 3.6 billion elements, which cannot easily fit into the memory of a standard computer. We have at least two different ways of solving such gigantic QP problems. First, there are QP methods that use sophisticated data structures.4 These QP methods do not require the storage of the entire Q matrix, because they do not need to access the rows or columns of Q that correspond to those αi that are at 0 or at C. Deep in the inner loop, these methods only perform dot products between rows or columns of Q and a vector, rather than performing an entire matrix-vector multiplication. Decomposing the QP problem The other method for attacking the largescale SVM QP problem is to decompose the large QP problem into a series of smaller QP problems. Thus, the selection of submatrices of Q happens outside of the QP package, rather than inside. Consequently, the decomposition method is compatible with standard QP packages. Vapnik first suggested the decomposition approach in a method that has since been known as chunking. 1 The chunking algorithm exploits the fact that the value of the objective function is the same if you remove the rows and columns of the matrix Q that correspond to zero αi . Therefore, the large QP problem can break down into a series of smaller QP problems, whose ultimate goal is to identify all of the nonzero αi and discard all of the zero αi . At every step, chunking solves a QP problem that consists of the following αi : every nonzero αi from the last step, and the αi that correspond to the M worst violations of the KKT conditions, for some value of M (see Figure 12a). The size of the QP subproblem tends to grow with time. At the last step, the chunking approach has identified the entire set of nonzero αi ; hence, the last step solves the overall QP problem. Chunking reduces the Q matrix’s dimension from the number of training examples to approximately the number of nonzero αi . However, chunking still might not handle large-scale training problems, because even this reduced matrix might not fit into memory. Of course, we can combine chunking with the sophisticated QP methods described above, which do not require full storage of a matrix. In 1997, Edgar Osuna and his colleagues suggested a new strategy for solving the SVM QP problem.5 Osuna showed that the large QP problem can be broken down into a series of smaller QP subproblems. As long as at least one αi that violates the KKT conditions is added to the previous subproblem, each step reduces the objective function and maintains all of the constraints. Therefore, a sequence of QP subproblems that always add at least one KKT violator will asymptotically converge. Osuna suggests keeping a constant size matrix for every QP subproblem, which implies adding and deleting the same number of examples at every step5 (see Figure 12b). Using a constant-size matrix allows the training of arbitrarily sized datasets. The algorithm in Osuna’s paper suggests adding one example and deleting one example at every step. Such an algorithm converges, although it might not be the fastest possible algorithm. In practice, researchers add and delete multiple examples according to various unpublished heuristics. Typically, these heuristics add KKT violators at each step and delete those αi that are either 0 or C. Joachims has published an algorithm for adding and deleting examples from the QP steps, which rapidly decreases the objective function.6 All of these decomposition methods require a numerical QP package. Such packages might be expensive for commercial users (see the “Where to get the programs” section). Writing your own efficient QP package is difficult without a numerical-analysis background. Sequential minimal optimization Sequential minimal optimization is an alternative method that can decompose the SVM QP problem without any extra matrix storage and without using numerical QP optimization steps.3,7 SMO decomposes the overall QP problem into QP subproblems, identically to Osuna’s method. Unlike the previous decomposition heuristics, SMO chooses to solve the smallest possible optimization problem at every step. For the standard SVM QP problem, the smallest possible optimization problem involves two elements of αi , because the αi must obey one linear equality constraint. At every step, SMO chooses two αi to jointly optimize, finds the optimal values for these αi , and updates the SVM to reflect the new optimal values (see Figure 12c). (a) (b) (c) Figure 12. Three alternative methods for training SVMs: (a) Chunking, (b) Osuna’s algorithm, and (c) SMO. There are three steps for each method. The horizontal thin line at every step represents the training set, while the thick boxes represent the αibeing optimized at that step. A given group of three lines corresponds to three training iterations, with the first iteration at the top.