Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines John C.Platt Microsoft Research iplatt@microsoft.com Technical Report MSR-TR-98-14 April 21,1998 1998 John Platt ABSTRACT This paper proposes a new algorithm for training support vector machines:Sequential Minimal Optimization,or SMO.Training a support vector machine requires the solution of a very large quadratic programming(QP)optimization problem.SMO breaks this large QP problem into a series of smallest possible QP problems.These small QP problems are solved analytically,which avoids using a time-consuming numerical QP optimization as an inner loop.The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets.Because matrix computation is avoided,SMO scales somewhere between linear and quadratic in the training set size for various test problems,while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size.SMO's computation time is dominated by SVM evaluation,hence SMO is fastest for linear SVMs and sparse data sets.On real- world sparse data sets,SMO can be more than 1000 times faster than the chunking algorithm. 1.INTRODUCTION In the last few years,there has been a surge of interest in Support Vector Machines(SVMs)[19] [20][4].SVMs have empirically been shown to give good generalization performance on a wide variety of problems such as handwritten character recognition [12],face detection [15],pedestrian detection [14],and text categorization [9]. However,the use of SVMs is still limited to a small group of researchers.One possible reason is that training algorithms for SVMs are slow,especially for large problems.Another explanation is that SVM training algorithms are complex,subtle,and difficult for an average engineer to implement. This paper describes a new SVM learning algorithm that is conceptually simple,easy to implement,is generally faster,and has better scaling properties for difficult SVM problems than the standard SVM training algorithm.The new SVM learning algorithm is called Sequential Minimal Optimization(or SMO).Instead of previous SVM learning algorithms that use numerical quadratic programming(QP)as an inner loop,SMO uses an analytic QP step This paper first provides an overview of SVMs and a review of current SVM training algorithms. The SMO algorithm is then presented in detail,including the solution to the analytic QP step
1 Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines John C. Platt Microsoft Research jplatt@microsoft.com Technical Report MSR-TR-98-14 April 21, 1998 © 1998 John Platt ABSTRACT This paper proposes a new algorithm for training support vector machines: Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets. Because matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size. SMO’s computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets. On realworld sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm. 1. INTRODUCTION In the last few years, there has been a surge of interest in Support Vector Machines (SVMs) [19] [20] [4]. SVMs have empirically been shown to give good generalization performance on a wide variety of problems such as handwritten character recognition [12], face detection [15], pedestrian detection [14], and text categorization [9]. However, the use of SVMs is still limited to a small group of researchers. One possible reason is that training algorithms for SVMs are slow, especially for large problems. Another explanation is that SVM training algorithms are complex, subtle, and difficult for an average engineer to implement. This paper describes a new SVM learning algorithm that is conceptually simple, easy to implement, is generally faster, and has better scaling properties for difficult SVM problems than the standard SVM training algorithm. The new SVM learning algorithm is called Sequential Minimal Optimization (or SMO). Instead of previous SVM learning algorithms that use numerical quadratic programming (QP) as an inner loop, SMO uses an analytic QP step. This paper first provides an overview of SVMs and a review of current SVM training algorithms. The SMO algorithm is then presented in detail, including the solution to the analytic QP step
heuristics for choosing which variables to optimize in the inner loop,a description of how to set the threshold of the SVM,some optimizations for special cases,the pseudo-code of the algorithm, and the relationship of SMO to other algorithms. SMO has been tested on two real-world data sets and two artificial data sets.This paper presents the results for timing SMO versus the standard "chunking"algorithm for these data sets and presents conclusions based on these timings.Finally,there is an appendix that describes the derivation of the analytic optimization. 1.1 Overview of Support Vector Machines Positive Examples Maximize distances to nearest points Negative Examples Space of possible inputs Figure 1 A linear Support Vector Machine Vladimir Vapnik invented Support Vector Machines in 1979 [19].In its simplest,linear form,an SVM is a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin(see figure 1).In the linear case,the margin is defined by the distance of the hyperplane to the nearest of the positive and negative examples.The formula for the output of a linear SVM is u=币.元-b, (1) where w is the normal vector to the hyperplane and x is the input vector.The separating hyperplane is the plane u=0.The nearest points lie on the planes u=+1.The margin m is thus m= (2) Iwlb Maximizing margin can be expressed via the following optimization problem [4]: min subject to y,(m.元-b)≥l,i, (3) wb 2
2 heuristics for choosing which variables to optimize in the inner loop, a description of how to set the threshold of the SVM, some optimizations for special cases, the pseudo-code of the algorithm, and the relationship of SMO to other algorithms. SMO has been tested on two real-world data sets and two artificial data sets. This paper presents the results for timing SMO versus the standard “chunking” algorithm for these data sets and presents conclusions based on these timings. Finally, there is an appendix that describes the derivation of the analytic optimization. 1.1 Overview of Support Vector Machines Vladimir Vapnik invented Support Vector Machines in 1979 [19]. In its simplest, linear form, an SVM is a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin (see figure 1). In the linear case, the margin is defined by the distance of the hyperplane to the nearest of the positive and negative examples. The formula for the output of a linear SVM is u wx b = ⋅− r r , (1) where w is the normal vector to the hyperplane and x is the input vector. The separating hyperplane is the plane u=0. The nearest points lie on the planes u = ±1. The margin m is thus m w = 1 2 || || . (2) Maximizing margin can be expressed via the following optimization problem [4]: min || || ( ) , , , r r rr w b w y wx b i i i 1 2 2 subject to ⋅ − ≥∀1 (3) Positive Examples Negative Examples Maximize distances to nearest points Space of possible inputs Figure 1 A linear Support Vector Machine
where x;is the ith training example,and y;is the correct output of the SVM for the ith training example.The value y,is+1 for the positive examples in a class and-1 for the negative examples. Using a Lagrangian,this optimization problem can be converted into a dual form which is a QP problem where the objective function is solely dependent on a set of Lagrange multipliers a, mna=m22(-,a,-立a (4) (where N is the number of training examples),subject to the inequality constraints, a,≥0,i, (5) and one linear equality constraint, y,0,=0 (6) There is a one-to-one relationship between each Lagrange multiplier and each training example. Once the Lagrange multipliers are determined,the normal vector w and the threshold b can be derived from the Lagrange multipliers: p=∑ya,,b=币-元-y.for some>0. (7) Because w can be computed via equation(7)from the training data before use,the amount of computation required to evaluate a linear SVM is constant in the number of non-zero support vectors. Of course,not all data sets are linearly separable.There may be no hyperplane that splits the positive examples from the negative examples.In the formulation above,the non-separable case would correspond to an infinite solution.However,in 1995,Cortes Vapnik [7]suggested a modification to the original optimization statement(3)which allows,but penalizes,the failure of an example to reach the correct margin.That modification is: mi四f+c25,subject to(--b≥1-5u, (8) where are slack variables that permit margin failure and C is a parameter which trades off wide margin with a small number of margin failures.When this new optimization problem is transformed into the dual form,it simply changes the constraint(5)into a box constraint: 0≤u,≤C,i. (9) The variables do not appear in the dual formulation at all. SVMs can be even further generalized to non-linear classifiers [2].The output of a non-linear SVM is explicitly computed from the Lagrange multipliers: 2
3 where xi is the ith training example, and yi is the correct output of the SVM for the ith training example. The value yi is +1 for the positive examples in a class and –1 for the negative examples. Using a Lagrangian, this optimization problem can be converted into a dual form which is a QP problem where the objective function Ψ is solely dependent on a set of Lagrange multipliers αi, min ( ) min ( ) , r r r r r α α Ψ= ⋅ − α αα α = = = ∑ ∑ ∑ 1 2 1 1 1 yy x x i j N i N j i j ij i i N (4) (where N is the number of training examples), subject to the inequality constraints, α i ≥ 0, , ∀i (5) and one linear equality constraint, yi i N i = ∑ = 1 α 0. (6) There is a one-to-one relationship between each Lagrange multiplier and each training example. Once the Lagrange multipliers are determined, the normal vector r w and the threshold b can be derived from the Lagrange multipliers: r r rr w y x b wx y i i N = =⋅ − > ii k k = ∑ 1 α α , . for some 0 k (7) Because r w can be computed via equation (7) from the training data before use, the amount of computation required to evaluate a linear SVM is constant in the number of non-zero support vectors. Of course, not all data sets are linearly separable. There may be no hyperplane that splits the positive examples from the negative examples. In the formulation above, the non-separable case would correspond to an infinite solution. However, in 1995, Cortes & Vapnik [7] suggested a modification to the original optimization statement (3) which allows, but penalizes, the failure of an example to reach the correct margin. That modification is: min || || ( ) , , , , r r r rr w b i i N w C y wx b i ii i ξ ξ ξ 1 2 2 1 + ⋅ − ≥− ∀ 1 = ∑ subject to (8) where ξi are slack variables that permit margin failure and C is a parameter which trades off wide margin with a small number of margin failures. When this new optimization problem is transformed into the dual form, it simply changes the constraint (5) into a box constraint: 0 ≤≤∀ α i C, .i (9) The variables ξi do not appear in the dual formulation at all. SVMs can be even further generalized to non-linear classifiers [2]. The output of a non-linear SVM is explicitly computed from the Lagrange multipliers:
立,aK,-b, u= (10) where K is a kernel function that measures the similarity or distance between the input vector X and the stored training vector ,Examples of K include Gaussians,polynomials,and neural network non-linearities [4].If K is linear,then the equation for the linear SVM(1)is recovered The Lagrange multipliers o are still computed via a quadratic program.The non-linearities alter the quadratic form,but the dual objective function Y is still quadratic in o: mna=mn之立yK民,天)a,-立a i=l jl 0≤a,≤C,i, (11) 立ya=0 The QP problem in equation(11),above,is the QP problem that the SMO algorithm will solve. In order to make the QP problem above be positive definite,the kernel function K must obey Mercer's conditions [4]. The Karush-Kuhn-Tucker(KKT)conditions are necessary and sufficient conditions for an optimal point of a positive definite QP problem.The KKT conditions for the QP problem(11) are particularly simple.The QP problem is solved when,for all i: 0,=0台y,4≥1, 0<C,<C台y4,=1, (12) ,=C台y,4≤1. where 4;is the output of the SVM for the ith training example.Notice that the KKT conditions can be evaluated on one example at a time,which will be useful in the construction of the SMO algorithm. 1.2 Previous Methods for Training Support Vector Machines Due to its immense size,the QP problem (11)that arises from SVMs cannot be easily solved via standard QP techniques.The quadratic form in (11)involves a matrix that has a number of elements equal to the square of the number of training examples.This matrix cannot be fit into 128 Megabytes if there are more than 4000 training examples. Vapnik [19]describes a method to solve the SVM QP,which has since been known as "chunking."The chunking algorithm uses the fact that the value of the quadratic form is the same if you remove the rows and columns of the matrix that corresponds to zero Lagrange multipliers. Therefore,the large QP problem can be broken down into a series of smaller QP problems,whose ultimate goal is to identify all of the non-zero Lagrange multipliers and discard all of the zero Lagrange multipliers.At every step,chunking solves a QP problem that consists of the following examples:every non-zero Lagrange multiplier from the last step,and the M worst examples that violate the KKT conditions(12)[4],for some value of M(see figure 2).If there are fewer than M examples that violate the KKT conditions at a step,all of the violating examples are added in. Each QP sub-problem is initialized with the results of the previous sub-problem.At the last step
4 u y Kx x b j j N = − j j = ∑ 1 α ( ,) , r r (10) where K is a kernel function that measures the similarity or distance between the input vector r x and the stored training vector r x j . Examples of K include Gaussians, polynomials, and neural network non-linearities [4]. If K is linear, then the equation for the linear SVM (1) is recovered. The Lagrange multipliers αi are still computed via a quadratic program. The non-linearities alter the quadratic form, but the dual objective function Ψ is still quadratic in α: min ( ) min ( , ) , , , . r r r r r α α α αα α α α Ψ= − ≤ ≤∀ = = = = = ∑ ∑ ∑ ∑ 1 2 1 1 1 1 0 0 yyK x x C i y ij i j i j j N i N i i N i i i i N (11) The QP problem in equation (11), above, is the QP problem that the SMO algorithm will solve. In order to make the QP problem above be positive definite, the kernel function K must obey Mercer’s conditions [4]. The Karush-Kuhn-Tucker (KKT) conditions are necessary and sufficient conditions for an optimal point of a positive definite QP problem. The KKT conditions for the QP problem (11) are particularly simple. The QP problem is solved when, for all i: α α α i ii i ii i ii y u C yu C yu =⇔ ≥ < <⇔ = =⇔ ≤ 0 1 0 1 1 , , . (12) where ui is the output of the SVM for the ith training example. Notice that the KKT conditions can be evaluated on one example at a time, which will be useful in the construction of the SMO algorithm. 1.2 Previous Methods for Training Support Vector Machines Due to its immense size, the QP problem (11) that arises from SVMs cannot be easily solved via standard QP techniques. The quadratic form in (11) involves a matrix that has a number of elements equal to the square of the number of training examples. This matrix cannot be fit into 128 Megabytes if there are more than 4000 training examples. Vapnik [19] describes a method to solve the SVM QP, which has since been known as “chunking.” The chunking algorithm uses the fact that the value of the quadratic form is the same if you remove the rows and columns of the matrix that corresponds to zero Lagrange multipliers. Therefore, the large QP problem can be broken down into a series of smaller QP problems, whose ultimate goal is to identify all of the non-zero Lagrange multipliers and discard all of the zero Lagrange multipliers. At every step, chunking solves a QP problem that consists of the following examples: every non-zero Lagrange multiplier from the last step, and the M worst examples that violate the KKT conditions (12) [4], for some value of M (see figure 2). If there are fewer than M examples that violate the KKT conditions at a step, all of the violating examples are added in. Each QP sub-problem is initialized with the results of the previous sub-problem. At the last step
the entire set of non-zero Lagrange multipliers has been identified,hence the last step solves the large QP problem. Chunking seriously reduces the size of the matrix from the number of training examples squared to approximately the number of non-zero Lagrange multipliers squared.However,chunking still cannot handle large-scale training problems,since even this reduced matrix cannot fit into memory. Chunking Osuna SMO Figure 2.Three alternative methods for training SVMs:Chunking,Osuna's algorithm,and SMO.For each method,three steps are illustrated.The horizontal thin line at every step represents the training set,while the thick boxes represent the Lagrange multipliers being optimized at that step.For chunking,a fixed number of examples are added every step,while the zero Lagrange multipliers are discarded at every step.Thus,the number of examples trained per step tends to grow.For Osuna's algorithm,a fixed number ofexamples are optimized every step:the same number of examples is added to and discarded from the problem at every step.For SMO,only two examples are analytically optimized at every step,so that each step is very fast. In 1997,Osuna,et al.[16]proved a theorem which suggests a whole new set of QP algorithms for SVMs.The theorem proves that the large QP problem can be broken down into a series of smaller QP sub-problems.As long as at least one example that violates the KKT conditions is added to the examples for the previous sub-problem,each step will reduce the overall objective function and maintain a feasible point that obeys all of the constraints.Therefore,a sequence of QP sub-problems that always add at least one violator will be guaranteed to converge.Notice that the chunking algorithm obeys the conditions of the theorem,and hence will converge. Osuna,et al.suggests keeping a constant size matrix for every QP sub-problem,which implies adding and deleting the same number of examples at every step [16](see figure 2).Using a constant-size matrix will allow the training on arbitrarily sized data sets.The algorithm given in Osuna's paper [16]suggests adding one example and subtracting one example every step. Clearly this would be inefficient,because it would use an entire numerical QP optimization step to cause one training example to obey the KKT conditions.In practice,researchers add and subtract multiple examples according to unpublished heuristics [17].In any event,a numerical QP solver is required for all of these methods.Numerical QP is notoriously tricky to get right; there are many numerical precision issues that need to be addressed
5 the entire set of non-zero Lagrange multipliers has been identified, hence the last step solves the large QP problem. Chunking seriously reduces the size of the matrix from the number of training examples squared to approximately the number of non-zero Lagrange multipliers squared. However, chunking still cannot handle large-scale training problems, since even this reduced matrix cannot fit into memory. In 1997, Osuna, et al. [16] proved a theorem which suggests a whole new set of QP algorithms for SVMs. The theorem proves that the large QP problem can be broken down into a series of smaller QP sub-problems. As long as at least one example that violates the KKT conditions is added to the examples for the previous sub-problem, each step will reduce the overall objective function and maintain a feasible point that obeys all of the constraints. Therefore, a sequence of QP sub-problems that always add at least one violator will be guaranteed to converge. Notice that the chunking algorithm obeys the conditions of the theorem, and hence will converge. Osuna, et al. suggests keeping a constant size matrix for every QP sub-problem, which implies adding and deleting the same number of examples at every step [16] (see figure 2). Using a constant-size matrix will allow the training on arbitrarily sized data sets. The algorithm given in Osuna’s paper [16] suggests adding one example and subtracting one example every step. Clearly this would be inefficient, because it would use an entire numerical QP optimization step to cause one training example to obey the KKT conditions. In practice, researchers add and subtract multiple examples according to unpublished heuristics [17]. In any event, a numerical QP solver is required for all of these methods. Numerical QP is notoriously tricky to get right; there are many numerical precision issues that need to be addressed. Chunking Osuna SMO Figure 2. Three alternative methods for training SVMs: Chunking, Osuna’s algorithm, and SMO. For each method, three steps are illustrated. The horizontal thin line at every step represents the training set, while the thick boxes represent the Lagrange multipliers being optimized at that step. For chunking, a fixed number of examples are added every step, while the zero Lagrange multipliers are discarded at every step. Thus, the number of examples trained per step tends to grow. For Osuna’s algorithm, a fixed number of examples are optimized every step: the same number of examples is added to and discarded from the problem at every step. For SMO, only two examples are analytically optimized at every step, so that each step is very fast
2.SEQUENTIAL MINIMAL OPTIMIZATION Sequential Minimal Optimization(SMO)is a simple algorithm that can quickly solve the SVM QP problem without any extra matrix storage and without using numerical QP optimization steps at all.SMO decomposes the overall QP problem into QP sub-problems,using Osuna's theorem to ensure convergence. Unlike the previous methods,SMO chooses to solve the smallest possible optimization problem at every step.For the standard SVM QP problem,the smallest possible optimization problem involves two Lagrange multipliers,because the Lagrange multipliers must obey a linear equality constraint.At every step,SMO chooses two Lagrange multipliers to jointly optimize,finds the optimal values for these multipliers,and updates the SVM to reflect the new optimal values(see figure 2). The advantage of SMO lies in the fact that solving for two Lagrange multipliers can be done analytically.Thus,numerical QP optimization is avoided entirely.The inner loop of the algorithm can be expressed in a short amount of C code,rather than invoking an entire QP library routine.Even though more optimization sub-problems are solved in the course of the algorithm, each sub-problem is so fast that the overall QP problem is solved quickly. In addition,SMO requires no extra matrix storage at all.Thus,very large SVM training problems can fit inside of the memory of an ordinary personal computer or workstation.Because no matrix algorithms are used in SMO,it is less susceptible to numerical precision problems. There are two components to SMO:an analytic method for solving for the two Lagrange multipliers,and a heuristic for choosing which multipliers to optimize. a,=C a,=C a1=0 a=C 0x1=0 =C 02=0 x2=0 y≠y3→0x1-02=k 片=2→01+0x2=k Figure 1.The two Lagrange multipliers must fulfill all of the constraints of the full problem. The inequality constraints cause the Lagrange multipliers to lie in the box.The linear equality constraint causes them to lie on a diagonal line.Therefore,one step of SMO must find an optimum of the objective function on a diagonal line segment. 6
6 2. SEQUENTIAL MINIMAL OPTIMIZATION Sequential Minimal Optimization (SMO) is a simple algorithm that can quickly solve the SVM QP problem without any extra matrix storage and without using numerical QP optimization steps at all. SMO decomposes the overall QP problem into QP sub-problems, using Osuna’s theorem to ensure convergence. Unlike the previous methods, SMO chooses to solve the smallest possible optimization problem at every step. For the standard SVM QP problem, the smallest possible optimization problem involves two Lagrange multipliers, because the Lagrange multipliers must obey a linear equality constraint. At every step, SMO chooses two Lagrange multipliers to jointly optimize, finds the optimal values for these multipliers, and updates the SVM to reflect the new optimal values (see figure 2). The advantage of SMO lies in the fact that solving for two Lagrange multipliers can be done analytically. Thus, numerical QP optimization is avoided entirely. The inner loop of the algorithm can be expressed in a short amount of C code, rather than invoking an entire QP library routine. Even though more optimization sub-problems are solved in the course of the algorithm, each sub-problem is so fast that the overall QP problem is solved quickly. In addition, SMO requires no extra matrix storage at all. Thus, very large SVM training problems can fit inside of the memory of an ordinary personal computer or workstation. Because no matrix algorithms are used in SMO, it is less susceptible to numerical precision problems. There are two components to SMO: an analytic method for solving for the two Lagrange multipliers, and a heuristic for choosing which multipliers to optimize. Figure 1. The two Lagrange multipliers must fulfill all of the constraints of the full problem. The inequality constraints cause the Lagrange multipliers to lie in the box. The linear equality constraint causes them to lie on a diagonal line. Therefore, one step of SMO must find an optimum of the objective function on a diagonal line segment. a2 = C a2 = C a2 = 0 a1 = 0 a1 = C yy k 12 12 ¡Æ- = a a a2 = 0 a1 = 0 a1 = C yy k 12 12 =Æ+ = a a
2.1 Solving for Two Lagrange Multipliers In order to solve for the two Lagrange multipliers,SMO first computes the constraints on these multipliers and then solves for the constrained minimum.For convenience,all quantities that refer to the first multiplier will have a subscript 1,while all quantities that refer to the second multiplier will have a subscript 2.Because there are only two multipliers,the constraints can be easily be displayed in two dimensions(see figure 3).The bound constraints(9)cause the Lagrange multipliers to lie within a box,while the linear equality constraint(6)causes the Lagrange multipliers to lie on a diagonal line.Thus,the constrained minimum of the objective function must lie on a diagonal line segment(as shown in figure 3).This constraint explains why two is the minimum number of Lagrange multipliers that can be optimized:if SMO optimized only one multiplier,it could not fulfill the linear equality constraint at every step. The ends of the diagonal line segment can be expressed quite simply.Without loss of generality, the algorithm first computes the second Lagrange multiplier o and computes the ends of the diagonal line segment in terms of o2.If the target yi does not equal the target y2,then the following bounds apply to L=max(0,a,-a),H=min(C,C+a2). (13) If the target y equals the target y2,then the following bounds apply to L=max(0,a,+a-C),H=min(C,a,+a) (14) The second derivative of the objective function along the diagonal line can be expressed as: n=K(1,x1)+K(2,x2)-2K(1,x2) (15) Under normal circumstances,the objective function will be positive definite,there will be a minimum along the direction of the linear equality constraint,and n will be greater than zero.In this case,SMO computes the minimum along the direction of the constraint: "=g+2(E-E2) (16) n where E,=u,-y,is the error on the ith training example.As a next step,the constrained minimum is found by clipping the unconstrained minimum to the ends of the line segment: H if anew≥H; w.clipped if L
7 2.1 Solving for Two Lagrange Multipliers In order to solve for the two Lagrange multipliers, SMO first computes the constraints on these multipliers and then solves for the constrained minimum. For convenience, all quantities that refer to the first multiplier will have a subscript 1, while all quantities that refer to the second multiplier will have a subscript 2. Because there are only two multipliers, the constraints can be easily be displayed in two dimensions (see figure 3). The bound constraints (9) cause the Lagrange multipliers to lie within a box, while the linear equality constraint (6) causes the Lagrange multipliers to lie on a diagonal line. Thus, the constrained minimum of the objective function must lie on a diagonal line segment (as shown in figure 3). This constraint explains why two is the minimum number of Lagrange multipliers that can be optimized: if SMO optimized only one multiplier, it could not fulfill the linear equality constraint at every step. The ends of the diagonal line segment can be expressed quite simply. Without loss of generality, the algorithm first computes the second Lagrange multiplier α2 and computes the ends of the diagonal line segment in terms of α2. If the target y1 does not equal the target y2, then the following bounds apply to α2: L H = − = +− max( , ), 0 α 21 21 α min(C, ). C α α (13) If the target y1 equals the target y2, then the following bounds apply to α2: L = +− = + max( , ), min( , ). 0 α 21 21 α C H C α α (14) The second derivative of the objective function along the diagonal line can be expressed as: η =+− Kx x Kx x Kx x ( , ) ( , ) ( , ). rr rr rr 11 2 2 12 2 (15) Under normal circumstances, the objective function will be positive definite, there will be a minimum along the direction of the linear equality constraint, and η will be greater than zero. In this case, SMO computes the minimum along the direction of the constraint : α α η 2 2 new 21 2 = + yE E ( ) − , (16) where E u y iii = - is the error on the ith training example. As a next step, the constrained minimum is found by clipping the unconstrained minimum to the ends of the line segment: α α α α α 2 2 2 2 2 new,clipped new new new new if if L if = ≥ < < ≤ % & K ' K H H L H L ; ; . (17) Now, let s yy = 1 2 . The value of α1 is computed from the new, clipped, α2: α 1 1 22 α α α new new,clipped =+ − s( ). (18) Under unusual circumstances, η will not be positive. A negative η will occur if the kernel K does not obey Mercer’s condition, which can cause the objective function to become indefinite. A zero η can occur even with a correct kernel, if more than one training example has the same input
vector x.In any event,SMO will work even when n is not positive,in which case the objective function y should be evaluated at each end of the line segment: f=y(E1+b)-C1K(1,x)-S2K(元1,x2), 3=2(E2+b)-S0K(x,x)-x2K(2,x2), L=01+s(02-L), (19) H1=1+s(2-H, 平=L+L+K(元,)+PK(2,)+sLLK(民,元) ΨH=Hf+H所2+HK(x,x)+H2K(E2,x2)+sHH,K(,x2) SMO will move the Lagrange multipliers to the end point that has the lowest value of the objective function.If the objective function is the same at both ends(within a small g for round- off error)and the kernel obeys Mercer's conditions,then the joint minimization cannot make progress.That scenario is described below. 2.2 Heuristics for Choosing Which Multipliers To Optimize As long as SMO always optimizes and alters two Lagrange multipliers at every step and at least one of the Lagrange multipliers violated the KKT conditions before the step,then each step will decrease the objective function according to Osuna's theorem [16].Convergence is thus guaranteed.In order to speed convergence,SMO uses heuristics to choose which two Lagrange multipliers to jointly optimize. There are two separate choice heuristics:one for the first Lagrange multiplier and one for the second.The choice of the first heuristic provides the outer loop of the SMO algorithm.The outer loop first iterates over the entire training set,determining whether each example violates the KKT conditions(12).If an example violates the KKT conditions,it is then eligible for optimization After one pass through the entire training set,the outer loop iterates over all examples whose Lagrange multipliers are neither 0 nor C(the non-bound examples).Again,each example is checked against the KKT conditions and violating examples are eligible for optimization.The outer loop makes repeated passes over the non-bound examples until all of the non-bound examples obey the KKT conditions within 8.The outer loop then goes back and iterates over the entire training set.The outer loop keeps alternating between single passes over the entire training set and multiple passes over the non-bound subset until the entire training set obeys the KKT conditions within g,whereupon the algorithm terminates. The first choice heuristic concentrates the CPU time on the examples that are most likely to violate the KKT conditions:the non-bound subset.As the SMO algorithm progresses,examples that are at the bounds are likely to stay at the bounds,while examples that are not at the bounds will move as other examples are optimized.The SMO algorithm will thus iterate over the non- bound subset until that subset is self-consistent,then SMO will scan the entire data set to search for any bound examples that have become KKT violated due to optimizing the non-bound subset. Notice that the KKT conditions are checked to be within e of fulfillment.Typically,is set to be 103.Recognition systems typically do not need to have the KKT conditions fulfilled to high accuracy:it is acceptable for examples on the positive margin to have outputs between 0.999 and 1.001.The SMO algorithm (and other SVM algorithms)will not converge as quickly if it is required to produce very high accuracy output
8 vector x. In any event, SMO will work even when η is not positive, in which case the objective function Ψ should be evaluated at each end of the line segment: f y E b Kx x s K x x f y E b s Kx x Kx x L sL H sH L f Lf L K x x L K x x sLL K x x H f Hf H K x x L H 1 1 1 1 11 2 12 2 2 2 1 12 2 22 11 2 11 2 11 2 1 2 1 2 1 1 1 2 2 22 1 12 11 2 1 2 1 2 1 1 = +− − = +− − =+ − =+ − = ++ + + = ++ ( ) ( , ) ( , ), ( ) ( , ) ( , ), ( ), ( ), ( , ) ( , ) ( , ), ( , α α α α α α α α rr rr rr rr rr rr rr r r Ψ Ψ ) ( , ) ( , ). + + 1 2 2 H K x x sHH K x x 22 1 12 rr rr (19) SMO will move the Lagrange multipliers to the end point that has the lowest value of the objective function. If the objective function is the same at both ends (within a small ε for roundoff error) and the kernel obeys Mercer’s conditions, then the joint minimization cannot make progress. That scenario is described below. 2.2 Heuristics for Choosing Which Multipliers To Optimize As long as SMO always optimizes and alters two Lagrange multipliers at every step and at least one of the Lagrange multipliers violated the KKT conditions before the step, then each step will decrease the objective function according to Osuna’s theorem [16]. Convergence is thus guaranteed. In order to speed convergence, SMO uses heuristics to choose which two Lagrange multipliers to jointly optimize. There are two separate choice heuristics: one for the first Lagrange multiplier and one for the second. The choice of the first heuristic provides the outer loop of the SMO algorithm. The outer loop first iterates over the entire training set, determining whether each example violates the KKT conditions (12). If an example violates the KKT conditions, it is then eligible for optimization. After one pass through the entire training set, the outer loop iterates over all examples whose Lagrange multipliers are neither 0 nor C (the non-bound examples). Again, each example is checked against the KKT conditions and violating examples are eligible for optimization. The outer loop makes repeated passes over the non-bound examples until all of the non-bound examples obey the KKT conditions within ε. The outer loop then goes back and iterates over the entire training set. The outer loop keeps alternating between single passes over the entire training set and multiple passes over the non-bound subset until the entire training set obeys the KKT conditions within ε, whereupon the algorithm terminates. The first choice heuristic concentrates the CPU time on the examples that are most likely to violate the KKT conditions: the non-bound subset. As the SMO algorithm progresses, examples that are at the bounds are likely to stay at the bounds, while examples that are not at the bounds will move as other examples are optimized. The SMO algorithm will thus iterate over the nonbound subset until that subset is self-consistent, then SMO will scan the entire data set to search for any bound examples that have become KKT violated due to optimizing the non-bound subset. Notice that the KKT conditions are checked to be within ε of fulfillment. Typically, ε is set to be 10-3. Recognition systems typically do not need to have the KKT conditions fulfilled to high accuracy: it is acceptable for examples on the positive margin to have outputs between 0.999 and 1.001. The SMO algorithm (and other SVM algorithms) will not converge as quickly if it is required to produce very high accuracy output
Once a first Lagrange multiplier is chosen,SMO chooses the second Lagrange multiplier to maximize the size of the step taken during joint optimization.Now,evaluating the kernel function K is time consuming,so SMO approximates the step size by the absolute value of the numerator in equation (16):E,-E,.SMO keeps a cached error value E for every non-bound example in the training set and then chooses an error to approximately maximize the step size.If E is positive,SMO chooses an example with minimum error E2.If E is negative,SMO chooses an example with maximum error E2. Under unusual circumstances,SMO cannot make positive progress using the second choice heuristic described above.For example,positive progress cannot be made if the first and second training examples share identical input vectors x,which causes the objective function to become semi-definite.In this case,SMO uses a hierarchy of second choice heuristics until it finds a pair of Lagrange multipliers that can be make positive progress.Positive progress can be determined by making a non-zero step size upon joint optimization of the two Lagrange multipliers.The hierarchy of second choice heuristics consists of the following.If the above heuristic does not make positive progress,then SMO starts iterating through the non-bound examples,searching for an second example that can make positive progress.If none of the non-bound examples make positive progress,then SMO starts iterating through the entire training set until an example is found that makes positive progress.Both the iteration through the non-bound examples and the iteration through the entire training set are started at random locations,in order not to bias SMO towards the examples at the beginning of the training set.In extremely degenerate circumstances, none of the examples will make an adequate second example.When this happens,the first example is skipped and SMO continues with another chosen first example. 2.3 Computing the Threshold The threshold b is re-computed after each step,so that the KKT conditions are fulfilled for both optimized examples.The following threshold b is valid when the new o is not at the bounds, because it forces the output of the SVM to be y when the input is x: b=E+y(arw-a)K()+y (a2ew.chped-az)K()+b. (20) The following threshold b2 is valid when the new o is not at bounds,because it forces the output of the SVM to be y2 when the input is x2: b2=E2+y(a"-a1)K(不1,i2)+乃2(p-2)K(民2,i2)+b (21) When both b and b2 are valid,they are equal.When both new Lagrange multipliers are at bound and if L is not equal to H,then the interval between b and b2 are all thresholds that are consistent with the KKT conditions.SMO chooses the threshold to be halfway in between bi and b2. 2.4 An Optimization for Linear SVMs To compute a linear SVM,only a single weight vector w needs to be stored,rather than all of the training examples that correspond to non-zero Lagrange multipliers.If the joint optimization succeeds,the stored weight vector needs to be updated to reflect the new Lagrange multiplier values.The weight vector update is easy,due to the linearity of the SVM: 币aew=p+y(aew-a1)x+y,(agew,dpcd-a2)元2 (22) 2.5 Code Details The pseudo-code below describes the entire SMO algorithm: 9
9 Once a first Lagrange multiplier is chosen, SMO chooses the second Lagrange multiplier to maximize the size of the step taken during joint optimization. Now, evaluating the kernel function K is time consuming, so SMO approximates the step size by the absolute value of the numerator in equation (16): | | E1 2 − E . SMO keeps a cached error value E for every non-bound example in the training set and then chooses an error to approximately maximize the step size. If E1 is positive, SMO chooses an example with minimum error E2. If E1 is negative, SMO chooses an example with maximum error E2. Under unusual circumstances, SMO cannot make positive progress using the second choice heuristic described above. For example, positive progress cannot be made if the first and second training examples share identical input vectors x, which causes the objective function to become semi-definite. In this case, SMO uses a hierarchy of second choice heuristics until it finds a pair of Lagrange multipliers that can be make positive progress. Positive progress can be determined by making a non-zero step size upon joint optimization of the two Lagrange multipliers . The hierarchy of second choice heuristics consists of the following. If the above heuristic does not make positive progress, then SMO starts iterating through the non-bound examples, searching for an second example that can make positive progress. If none of the non-bound examples make positive progress, then SMO starts iterating through the entire training set until an example is found that makes positive progress. Both the iteration through the non-bound examples and the iteration through the entire training set are started at random locations, in order not to bias SMO towards the examples at the beginning of the training set. In extremely degenerate circumstances, none of the examples will make an adequate second example. When this happens, the first example is skipped and SMO continues with another chosen first example. 2.3 Computing the Threshold The threshold b is re-computed after each step, so that the KKT conditions are fulfilled for both optimized examples. The following threshold b1 is valid when the new α1 is not at the bounds, because it forces the output of the SVM to be y1 when the input is x1: b E y Kx x y Kx x b 1 1 1 1 1 11 2 2 2 12 =+ − + − + ( )(,) ( )(, ) . α α α α new new,clipped rr rr (20) The following threshold b2 is valid when the new α2 is not at bounds, because it forces the output of the SVM to be y2 when the input is x2: b E y Kx x y Kx x b 2 2 1 1 1 12 2 2 2 22 =+ − + − + ( )(, ) ( )( , ) . α α α α new new,clipped rr rr (21) When both b1 and b2 are valid, they are equal. When both new Lagrange multipliers are at bound and if L is not equal to H, then the interval between b1 and b2 are all thresholds that are consistent with the KKT conditions. SMO chooses the threshold to be halfway in between b1 and b2. 2.4 An Optimization for Linear SVMs To compute a linear SVM, only a single weight vector r w needs to be stored, rather than all of the training examples that correspond to non-zero Lagrange multipliers. If the joint optimization succeeds, the stored weight vector needs to be updated to reflect the new Lagrange multiplier values. The weight vector update is easy, due to the linearity of the SVM: rr r r w wy x y x new new new,clipped =+ − + − 1 1 11 2 2 2 2 ( ) ( ). α α α α (22) 2.5 Code Details The pseudo-code below describes the entire SMO algorithm:
target desired output vector point training point matrix procedure takestep(i1,i2) if (i1 ==i2)return 0 alph1=Lagrange multiplier for il y1 target[i1] E1 SVM output on point[i1]-y1 (check in error cache) s yl*y2 Compute L,H via equations (13)and (14) if (L ==H) return 0 k11 kernel(point [i1],point [i1]) k12 kernel(point[i1],point [i2]) k22 kernel(point [i2],point [i2]) eta=k11+k22-2*k12 if (eta 0) a2 alph2 y2*(E1-E2)/eta f(a20)) if (number of non-zero non-c alpha 1) { il result of second choice heuristic (section 2.2) if takestep(i1,i2) return 1 10
10 target = desired output vector point = training point matrix procedure takeStep(i1,i2) if (i1 == i2) return 0 alph1 = Lagrange multiplier for i1 y1 = target[i1] E1 = SVM output on point[i1] – y1 (check in error cache) s = y1*y2 Compute L, H via equations (13) and (14) if (L == H) return 0 k11 = kernel(point[i1],point[i1]) k12 = kernel(point[i1],point[i2]) k22 = kernel(point[i2],point[i2]) eta = k11+k22-2*k12 if (eta > 0) { a2 = alph2 + y2*(E1-E2)/eta if (a2 H) a2 = H } else { Lobj = objective function at a2=L Hobj = objective function at a2=H if (Lobj Hobj+eps) a2 = H else a2 = alph2 } if (|a2-alph2| tol && alph2 > 0)) { if (number of non-zero & non-C alpha > 1) { i1 = result of second choice heuristic (section 2.2) if takeStep(i1,i2) return 1 }