Artificial Intelligence 215 (2014)55-78 Contents lists available at ScienceDirect rtificial Intelligence Artificial Intelligence ELSEVIER www.elsevier.com/locate/artint Multicategory large margin classification methods: CrossMark Hinge losses vs.coherence functions Zhihua Zhang3.*,Cheng Chen,Guang Daia,Wu-Jun Lib,Dit-Yan Yeung a Key Laboratory of Shanghai Education Commission for Intelligence Interaction Cognition Engineering.Department of Computer Science and Engineering.Shanghai Jiao Tong University,800 Dong Chuan Road,Shanghai,200240.China b National Key Laboratory for Novel Software Technology,Department of Computer Science and Technology.Nanjing University.Nanjing. 210023.China Department of Computer Science and Engineering.Hong Kong University of Science and Technology.Hong Kong.China ARTICLE INFO ABSTRACT Article history: Generalization of large margin classification methods from the binary classification setting Received 3 August 2013 to the more general multicategory setting is often found to be non-trivial.In this paper,we Received in revised form 9 May 2014 study large margin classification methods that can be seamlessly applied to both settings, Accepted 16 June 2014 Available online 20 June 2014 with the binary setting simply as a special case.In particular.we explore the Fisher consistency properties of multicategory majorization losses and present a construction Keywords: framework of majorization losses of the 0-1 loss.Under this framework,we conduct an Multiclass margin classification in-depth analysis about three widely used multicategory hinge losses.Corresponding to Fisher consistency the three hinge losses,we propose three multicategory majorization losses based on a Multicategory hinge losses coherence function.The limits of the three coherence losses as the temperature approaches Coherence losses zero are the corresponding hinge losses,and the limits of the minimizers of their expected Multicategory boosting algorithm errors are the minimizers of the expected errors of the corresponding hinge losses. Finally,we develop multicategory large margin classification methods by using a so-called multiclass C-loss. 2014 Elsevier B.V.All rights reserved. 1.Introduction Large margin classification methods have become increasingly popular since the advent of the support vector machine (SVM)[4]and boosting [7,8].Recent developments include the large-margin unified machine of Liu et al.[16]and the flexible assortment machine of Qiao and Zhang [19.Typically,large margin classification methods approximately solve an otherwise intractable optimization problem defined with the 0-1 loss.These algorithms were originally designed for binary classification problems.Unfortunately,generalization of them to the multicategory setting is often found to be non-trivial. The goal of this paper is to solve multicategory classification problems using the same margin principle as that for binary problems. The conventional SVM based on the hinge loss function possesses support vector interpretation (or data sparsity) but does not have uncertainty (that is,the SVM does not directly estimate the conditional class probability).The non- differentiable hinge loss function also makes it non-trivial to extend the conventional SVM from binary classification *Corresponding author. E-mail addresses:zhzhang@gmail.com (Z Zhang).jackchen1990@gmail.com (C.Chen).guang.gdai@gmail.com (G.Dai).liwujun@nju.edu.cn (W.-J.Li). dyyeung@cse.ust.hk (D.-Y.Yeung) http://dx.doi.org/10.1016/j.artint.2014.06.002 0004-3702/2014 Elsevier B.V.All rights reserved
Artificial Intelligence 215 (2014) 55–78 Contents lists available at ScienceDirect Artificial Intelligence www.elsevier.com/locate/artint Multicategory large margin classification methods: Hinge losses vs. coherence functions Zhihua Zhang a,∗, Cheng Chen a, Guang Dai a, Wu-Jun Li b, Dit-Yan Yeung c a Key Laboratory of Shanghai Education Commission for Intelligence Interaction & Cognition Engineering, Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Road, Shanghai, 200240, China b National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing, 210023, China c Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China a r t i c l e i n f o a b s t r a c t Article history: Received 3 August 2013 Received in revised form 9 May 2014 Accepted 16 June 2014 Available online 20 June 2014 Keywords: Multiclass margin classification Fisher consistency Multicategory hinge losses Coherence losses Multicategory boosting algorithm Generalization of large margin classification methods from the binary classification setting to the more general multicategory setting is often found to be non-trivial. In this paper, we study large margin classification methods that can be seamlessly applied to both settings, with the binary setting simply as a special case. In particular, we explore the Fisher consistency properties of multicategory majorization losses and present a construction framework of majorization losses of the 0–1 loss. Under this framework, we conduct an in-depth analysis about three widely used multicategory hinge losses. Corresponding to the three hinge losses, we propose three multicategory majorization losses based on a coherence function. The limits of the three coherence losses as the temperature approaches zero are the corresponding hinge losses, and the limits of the minimizers of their expected errors are the minimizers of the expected errors of the corresponding hinge losses. Finally, we develop multicategory large margin classification methods by using a so-called multiclass C-loss. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Large margin classification methods have become increasingly popular since the advent of the support vector machine (SVM) [4] and boosting [7,8]. Recent developments include the large-margin unified machine of Liu et al. [16] and the flexible assortment machine of Qiao and Zhang [19]. Typically, large margin classification methods approximately solve an otherwise intractable optimization problem defined with the 0–1 loss. These algorithms were originally designed for binary classification problems. Unfortunately, generalization of them to the multicategory setting is often found to be non-trivial. The goal of this paper is to solve multicategory classification problems using the same margin principle as that for binary problems. The conventional SVM based on the hinge loss function possesses support vector interpretation (or data sparsity) but does not have uncertainty (that is, the SVM does not directly estimate the conditional class probability). The nondifferentiable hinge loss function also makes it non-trivial to extend the conventional SVM from binary classification * Corresponding author. E-mail addresses: zhzhang@gmail.com (Z. Zhang), jackchen1990@gmail.com (C. Chen), guang.gdai@gmail.com (G. Dai), liwujun@nju.edu.cn (W.-J. Li), dyyeung@cse.ust.hk (D.-Y. Yeung). http://dx.doi.org/10.1016/j.artint.2014.06.002 0004-3702/© 2014 Elsevier B.V. All rights reserved
56 Z Zhang et aL Artificial Intelligence 215(2014)55-78 problems to multiclass classification problems in the same margin principle [25,3,26,12,5,131.Thus,one seemingly natu- ral approach to constructing a classifier for the binary and multiclass problems is to consider a smooth loss function. For example,regularized logistic regression models based on the negative multinomial log-likelihood function (also called the logit loss)[31.10]are competitive with SVMs.Moreover,it is natural to exploit the logit loss in the development of a multicategory boosting algorithm [9].Recently.Zhang et al.[30]proposed a smooth loss function that called coherence function for developing binary large margin classification methods.The coherence function establishes a bridge between the hinge loss and the logit loss.In this paper,we study the application of the coherence function in the multiclass classification problem. 1.1.Multicategory margin classification methods We are concerned with an m-class(m>2)classification problem with a set of training data points ((xi,ci)where xA'C RP is an input vector and ci e(1.2....,m}is its corresponding class label.We assume that each x belongs to one and only one class.Our goal is to find a classifier中(x):X→c∈{1,..,mh. Let Pe(x)=Pr(C=clX =x),c=1,...,m,be the class conditional probabilities given x.The expected error at x is then defined by ox≠知PcX, c=1 where Ig#)is 1 if is true and 0 otherwise.The empirical error on the training data is thus given by n n i=1 Given that e is equal to its minimum value zero when all training data points are correctly classified,we wish to use e as a basis for devising classification methods. Suppose the classifier is modeled using an m-vector g(x)=(gi(x).....gm(x)),where the induced classifier is obtained via maximization in a manner akin to discriminant analysis:(x)=argmax (gj(x)).For simplicity of our analysis,we assume that for a fixed x,each gi itself lies in a compact set.We also assume that the maximizing argument of maxjgj(x) is unique.Of course this excludes the trivial case that gj=0 for all je(1.....m).However,this assumption does not imply that the maximum value is unique;indeed,adding a constant to each component gj(x)does not change the maximizing argument.To remove this redundancy,it is convenient to impose a sum-to-zero constraint.Thus we define and assume geg in this paper unless otherwise specified.Zou et al.[33]referred to such a g as the margin vector.Liu and Shen [15]referred to maxj(gj(x)-ge(x))as the generalized margin of (x,c)with respect to (w.r.t.)g. Since a margin vector g induces a classifier,we explore the minimization of e w.r.t.g.However,this minimization problem is intractable because is the 0-1 function.A wide variety of margin-based classifiers can be understood as minimizers of a surrogate loss function v(g(x)).which upper bounds the 0-1 loss )That is,various tractable surrogate loss functions ve(g(x))are thus used to upper approximate.The corresponding empirical risk function is given by (g) vc(g(xi)). i=1 If is a positive constant that does not depend on (xc).argmin(g)is equivalent to argmin(g).We thus present the following definition. Definition 1.A surrogate loss ve(g(x))is said to be the majorization of w.r.t.(x.c)if ve(g(x))xwhere a is a positive constant that does not depend on (x,c). In practice,convex majorization functions play an important role in the development of classification algorithms.On one hand,the convexity makes the resulting optimization problems computationally tractable.On the other hand,the classifica- tion methods usually have better statistical properties. Given a majorization function ve(g(x)),the classifier resulted from the minimization of R(g)w.r.t.the margin vector g is called a large margin classifier or a margin-based classification method.In the binary classification setting.a wide variety of classifiers can be understood as minimizers of a majorization loss function of the 0-1 loss.If such functions satisfy
56 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 problems to multiclass classification problems in the same margin principle [25,3,26,12,5,13]. Thus, one seemingly natural approach to constructing a classifier for the binary and multiclass problems is to consider a smooth loss function. For example, regularized logistic regression models based on the negative multinomial log-likelihood function (also called the logit loss) [31,10] are competitive with SVMs. Moreover, it is natural to exploit the logit loss in the development of a multicategory boosting algorithm [9]. Recently, Zhang et al. [30] proposed a smooth loss function that called coherence function for developing binary large margin classification methods. The coherence function establishes a bridge between the hinge loss and the logit loss. In this paper, we study the application of the coherence function in the multiclass classification problem. 1.1. Multicategory margin classification methods We are concerned with an m-class (m > 2) classification problem with a set of training data points {(xi, ci)} n i=1 where xi ∈ X ⊂ Rp is an input vector and ci ∈ {1, 2,...,m} is its corresponding class label. We assume that each x belongs to one and only one class. Our goal is to find a classifier φ(x) : x → c ∈ {1,...,m}. Let Pc (x) = Pr(C = c|X = x), c = 1,...,m, be the class conditional probabilities given x. The expected error at x is then defined by m c=1 I{φ(x)=c} Pc (x), where I{#} is 1 if # is true and 0 otherwise. The empirical error on the training data is thus given by = 1 n n i=1 I{φ(xi)=ci}. Given that is equal to its minimum value zero when all training data points are correctly classified, we wish to use as a basis for devising classification methods. Suppose the classifier is modeled using an m-vector g(x) = (g1(x),..., gm(x))T , where the induced classifier is obtained via maximization in a manner akin to discriminant analysis: φ(x) = argmaxj{g j(x)}. For simplicity of our analysis, we assume that for a fixed x, each g j itself lies in a compact set. We also assume that the maximizing argument of max j g j(x) is unique. Of course this excludes the trivial case that g j = 0 for all j ∈ {1,...,m}. However, this assumption does not imply that the maximum value is unique; indeed, adding a constant to each component g j(x) does not change the maximizing argument. To remove this redundancy, it is convenient to impose a sum-to-zero constraint. Thus we define G = g1(x), . . . , gm(x) T m j=1 g j(x) = 0 and assume g ∈ G in this paper unless otherwise specified. Zou et al. [33] referred to such a g as the margin vector. Liu and Shen [15] referred to maxj(g j(x) − gc (x)) as the generalized margin of (x, c) with respect to (w.r.t.) g. Since a margin vector g induces a classifier, we explore the minimization of w.r.t. g. However, this minimization problem is intractable because I{φ(x)=c} is the 0–1 function. A wide variety of margin-based classifiers can be understood as minimizers of a surrogate loss function ψc (g(x)), which upper bounds the 0–1 loss I{φ(x)=c}. That is, various tractable surrogate loss functions ψc (g(x)) are thus used to upper approximate I{φ(x)=c}. The corresponding empirical risk function is given by Rˆ(g) = 1 n n i=1 ψci g(xi) . If α is a positive constant that does not depend on (x, c), argming(x)∈G 1 α Rˆ (g) is equivalent to argming(x)∈G Rˆ (g). We thus present the following definition. Definition 1. A surrogate loss ψc (g(x)) is said to be the majorization of I{φ(x)=c} w.r.t. (x, c) if ψc (g(x)) ≥ αI{φ(x)=c} where α is a positive constant that does not depend on (x, c). In practice, convex majorization functions play an important role in the development of classification algorithms. On one hand, the convexity makes the resulting optimization problems computationally tractable. On the other hand, the classification methods usually have better statistical properties. Given a majorization function ψc (g(x)), the classifier resulted from the minimization of Rˆ (g) w.r.t. the margin vector g is called a large margin classifier or a margin-based classification method. In the binary classification setting, a wide variety of classifiers can be understood as minimizers of a majorization loss function of the 0–1 loss. If such functions satisfy
Z Zhang et aL Artificial Intelligence 215(2014)55-78 5》 other technical conditions,the resulting classifiers can be shown to be Bayes consistent [1.It seems reasonable to pursue a similar development in the case of multicategory classification,and indeed such a proposal has been made by Zou et al. [33](also see[24.23). Definition 2.A surrogate function ve(g(x))is said to be Fisher-consistent w.r.t.a margin vector g(x)=(g(x).....gm(x))at (x,c)if(i)the following risk minimization problem m (x)=argmin ve(g(x))Pc(x) (1) g(x)EG c=1 has a unique solution g(x)=(g1(x).....gm(x));and (ii) argmax gc(x)=argmax Pc(x) Zou et al.[33]assumed ve(g(x))as an independent and identical setting:that is,ve(g(x))n(ge(x))where n is some loss function.As we see.Definition 2 does not require that the function ve(g(x))depends only on ge(x).Thus,this def- inition refines the definition of Zou et al.33.The definition is related to the notion of infinite-sample consistency (ISC) of Zhang [28].ISC says that an exact solution of Problem(1)leads to a Bayes rule.However,it does not require that the solution of Problem(1)be unique.Additionally.Zhang [28]especially discussed two other settings:pairwise comparison e(g(x)g(x)-gj(x))and constrained comparison ve(g(x))-gj(x)). In this paper,we are concerned with multicategory classification methods in which binary and multicategory problems are solved following the same principle.One of the principled approaches is due to Lee et al.[13].The authors proposed a multicategory SVM(MSVM)which treats the m-class problem simultaneously.Moreover,Lee et al.[13]proved that their MSVM satisfies a Fisher consistency condition.Unfortunately.this desirable property does not hold for many other multiclass SVMs(see,e.g.,[25,3,26,12]).The multiclass SVM of [5]possesses this property only if there is a dominating class(that is, maxj Pj(x)>1/2). Recently.Liu and Shen [15]proposed a so-called multicategory -learning algorithm by using a multicategory loss, and Wu and Liu [27]devised robust truncated-hinge-loss SVMs.These two algorithms are parallel to the multiclass SVM of Crammer and Singer [5]and enjoy a generalized pairwise comparison setting. Additionally.Zhu et al.[32]and Saberian and Vasconcelos [21]devised several multiclass boosting algorithms,which solve binary and multicategory problems under the same principle.Mukherjee and Schapire [17]created a general frame- work for studying multiclass boosting.which formalizes the interaction between the boosting algorithm and the weak learner.We note that Gao and Koller [11]applied the multiclass hinge loss of Crammer and Singer [5]to devise a multiclass boosting algorithm.However,this algorithm is cast under an output coding framework. 1.2.Contributions and outline In this paper,we study the Fisher consistency properties of multicategory surrogate losses.First,assuming that losses are twice differentiable,we present a Fisher consistency property under a more general setting.including the independent and identical,constrained comparison and generalized pairwise comparison settings.We next propose a framework for constructing a majorization function of the 0-1 loss.This framework provides us with a natural and intuitive perspective for construction of three extant multicategory hinge losses.Under this framework,we conduct an in-depth analysis on the Fisher consistency properties of these three extant multicategory hinge losses.In particular,we give a sufficient condition that the multiclass hinge loss used by Vapnik[25],Bredensteiner and Bennett[3.Weston and Watkins [26,Guermeur [12] satisfies the Fisher consistency.Moreover,we constructively derive the minimizers of the expected errors of the multiclass hinge losses of Crammer and Singer 5. The framework also inspires us to propose a class of multicategory majorization functions which are based on the coherence function [30.The coherence function is a smooth and convex majorization of the hinge function.Especially. its limit as the temperature approaches zero gives the hinge loss.Moreover,its relationship with the logit loss is also shown.Zhang et al.[30]originally exploited the coherence function in binary classification problems.We investigate its application in the development of multicategory margin classification methods.Based on the coherence function,we in particular present three multicategory coherence losses which correspond to the three extant multicategory hinge losses. These multicategory coherence losses are infinitely smooth and convex and they satisfy the Fisher consistency condition. The coherence losses have the advantage over the hinge losses that they provide an estimate of the conditional class probability,and over the multicategory logit loss that their limiting versions at zero temperature are just their corresponding multicategory hinge loss functions.Thus they are very appropriate for use in the development of multicategory large margin classification methods,especially boosting algorithms.We propose in this paper a multiclass C learning algorithm and a multiclass GentleBoost algorithm,both based on our multicategory coherence loss functions. The remainder of this paper is organized as follows.Section 2 gives a general result on Fisher consistency.In Section 3. we discuss the methodology for the construction of multicategory majorization losses and present two majorization losses
Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 57 other technical conditions, the resulting classifiers can be shown to be Bayes consistent [1]. It seems reasonable to pursue a similar development in the case of multicategory classification, and indeed such a proposal has been made by Zou et al. [33] (also see [24,23]). Definition 2. A surrogate function ψc (g(x)) is said to be Fisher-consistent w.r.t. a margin vector g(x) = (g1(x),..., gm(x))T at (x, c) if (i) the following risk minimization problem gˆ(x) = argmin g(x)∈G m c=1 ψc g(x) Pc (x) (1) has a unique solution gˆ(x) = (gˆ 1(x),..., gˆm(x))T ; and (ii) argmax c gˆ c (x) = argmax c Pc (x). Zou et al. [33] assumed ψc (g(x)) as an independent and identical setting; that is, ψc (g(x)) η(gc (x)) where η is some loss function. As we see, Definition 2 does not require that the function ψc (g(x)) depends only on gc (x). Thus, this definition refines the definition of Zou et al. [33]. The definition is related to the notion of infinite-sample consistency (ISC) of Zhang [28]. ISC says that an exact solution of Problem (1) leads to a Bayes rule. However, it does not require that the solution of Problem (1) be unique. Additionally, Zhang [28] especially discussed two other settings: pairwise comparison ψc (g(x)) j=c η(gc (x) − g j(x)) and constrained comparison ψc (g(x)) j=c η(−g j(x)). In this paper, we are concerned with multicategory classification methods in which binary and multicategory problems are solved following the same principle. One of the principled approaches is due to Lee et al. [13]. The authors proposed a multicategory SVM (MSVM) which treats the m-class problem simultaneously. Moreover, Lee et al. [13] proved that their MSVM satisfies a Fisher consistency condition. Unfortunately, this desirable property does not hold for many other multiclass SVMs (see, e.g., [25,3,26,12]). The multiclass SVM of [5] possesses this property only if there is a dominating class (that is, maxj P j(x) > 1/2). Recently, Liu and Shen [15] proposed a so-called multicategory ψ-learning algorithm by using a multicategory ψ loss, and Wu and Liu [27] devised robust truncated-hinge-loss SVMs. These two algorithms are parallel to the multiclass SVM of Crammer and Singer [5] and enjoy a generalized pairwise comparison setting. Additionally, Zhu et al. [32] and Saberian and Vasconcelos [21] devised several multiclass boosting algorithms, which solve binary and multicategory problems under the same principle. Mukherjee and Schapire [17] created a general framework for studying multiclass boosting, which formalizes the interaction between the boosting algorithm and the weak learner. We note that Gao and Koller [11] applied the multiclass hinge loss of Crammer and Singer [5] to devise a multiclass boosting algorithm. However, this algorithm is cast under an output coding framework. 1.2. Contributions and outline In this paper, we study the Fisher consistency properties of multicategory surrogate losses. First, assuming that losses are twice differentiable, we present a Fisher consistency property under a more general setting, including the independent and identical, constrained comparison and generalized pairwise comparison settings. We next propose a framework for constructing a majorization function of the 0–1 loss. This framework provides us with a natural and intuitive perspective for construction of three extant multicategory hinge losses. Under this framework, we conduct an in-depth analysis on the Fisher consistency properties of these three extant multicategory hinge losses. In particular, we give a sufficient condition that the multiclass hinge loss used by Vapnik [25], Bredensteiner and Bennett [3], Weston and Watkins [26], Guermeur [12] satisfies the Fisher consistency. Moreover, we constructively derive the minimizers of the expected errors of the multiclass hinge losses of Crammer and Singer [5]. The framework also inspires us to propose a class of multicategory majorization functions which are based on the coherence function [30]. The coherence function is a smooth and convex majorization of the hinge function. Especially, its limit as the temperature approaches zero gives the hinge loss. Moreover, its relationship with the logit loss is also shown. Zhang et al. [30] originally exploited the coherence function in binary classification problems. We investigate its application in the development of multicategory margin classification methods. Based on the coherence function, we in particular present three multicategory coherence losses which correspond to the three extant multicategory hinge losses. These multicategory coherence losses are infinitely smooth and convex and they satisfy the Fisher consistency condition. The coherence losses have the advantage over the hinge losses that they provide an estimate of the conditional class probability, and over the multicategory logit loss that their limiting versions at zero temperature are just their corresponding multicategory hinge loss functions. Thus they are very appropriate for use in the development of multicategory large margin classification methods, especially boosting algorithms. We propose in this paper a multiclass C learning algorithm and a multiclass GentleBoost algorithm, both based on our multicategory coherence loss functions. The remainder of this paper is organized as follows. Section 2 gives a general result on Fisher consistency. In Section 3, we discuss the methodology for the construction of multicategory majorization losses and present two majorization losses
58 Z Zhang et aL Artificial Intelligence 215 (2014)55-78 based on the coherence function.Section 4 develops another multicategory coherence loss that we call the multiclass C-loss.Based on the multiclass C-loss,a multiclass C learning algorithm and a multiclass GentleBoost algorithm are given in Section 5.We conduct empirical analysis for the multicategory large margin algorithms in Section 6,and conclude our work in Section 7.All proofs are deferred to the appendix. 2.A general result on Fisher consistency Using the notion and notation given in Section 1.1,we now consider a more general setting than the pairwise comparison. Let gf (x)=(g1(x)-gc(x).....gc-1(x)-gc(x).gc+1(x)-ge(x).....gm(x)-gc(x)).We define ve(g(x))as a function of g (x) (thereafter denoted f(g(x))).It is clear that the pairwise comparison ve(g(x))=(ge(x)-gj(x)),the multiclass hinge loss of Crammer and Singer [5].the multicategory -loss of Liu and Shen [15].and the truncated hinge loss of Wu and Liu [27]follow this generalized definition.Moreover,for these cases,we note that f(g)is symmetric. Furthermore,we present a unifying definition of(g)=(v(g).....vm(g)):Rm->Rm where we ignore the depen- dency of g on x.Let be a set of mappings(g)satisfying the conditions:(i)when fixed geve(g)is symmetric w.r.t. the remaining arguments and(ii)ve(g)=vj(gi)where gic is obtained by only exchanging ge and gj of g.Obviously,the mapping(g)defined via the independent and identical setting.the constrained comparison,or the generalized pairwise comparison with symmetric f belongs to With this notion,we give an important theorem of this paper as follows. Theorem 3.Let (g)e be a twice differentiable function from Rm to Rm.Assume that the Hessian matrix of ve(g)w.r.t.g is conditionally positive definite for c=1.....m.Then the minimizer=(.....m)ofve(g(x))Pe(x)in g exists and is unique. Furthermore.f where is negative for anythen PP implies agi The proof of the theorem is given in Appendix A.1.Note that an mxm real matrix A is said to be conditionally positive definite if y'Ay for any nonzero real vector y=(y1....ym)with=0.The condition that age agj on G for jc is not necessary for Fisher consistency.For example,in the setting ve(g(x))=n(ge(x)).Zou et al.[33] proved that if n(z)is a twice differentiable function with n(0)0 Vz,then ve(g(x))is Fisher-consistent. In the setting ve(g(x))=jn(-gj).we note that ejn(-gj)Pe=c=1n(-ge)(1-Pe).Based on the proof of Zou et al.33].we have that if n(z)is a twice differentiable function with n(0)0 Vz,then ve(g(x))= ()is Fisher-consistent.That is,in these two cases,we can relax the condition thatfor any dgc agj g∈Ga5-0Pk implies>gk d只c agi Theorem 3 or Corollary 4 shows that ve(g)admits the ISC of Zhang [28].Thus,under the conditions in Theorem 3 or Corollary 4,we also have the relationship between the approximate minimization of the risk based onve and the approximate minimization of the classification error.In particular,if ve((X))Pc(X inf 2 vc(E(X)P:(X 1 1 for some e>0,then there exists an e2 >0 such that Ex ≤Ex +e2 -=1,c≠(X) Lc=1,c≠φ*(X) where()=argmaxj().()=argmaxj(Pj(X))and Ex)Pe(X)]is the optimal error.This result di- rectly follows from Theorem 3 in Zhang [28. 3.Multicategory majorization losses Given x and its label c,we let g(x)be a margin vector at x and the induced classifier be(x)=argmaxj gj(x).In the binary case,it is clear that (x)=c if and only if ge(x)>0,and that ge(x)<0 is a necessary and sufficient condition of 1 A symmetric function of p variables is one whose value at any p-tuple of arguments is the same as its value at any permutation of that p-tuple
58 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 based on the coherence function. Section 4 develops another multicategory coherence loss that we call the multiclass C-loss. Based on the multiclass C-loss, a multiclass C learning algorithm and a multiclass GentleBoost algorithm are given in Section 5. We conduct empirical analysis for the multicategory large margin algorithms in Section 6, and conclude our work in Section 7. All proofs are deferred to the appendix. 2. A general result on Fisher consistency Using the notion and notation given in Section 1.1, we now consider a more general setting than the pairwise comparison. Let gc (x) = (g1(x)− gc (x),..., gc−1(x)− gc (x), gc+1(x)− gc (x),..., gm(x)− gc (x))T . We define ψc (g(x)) as a function of gc (x) (thereafter denoted f (gc (x))). It is clear that the pairwise comparison ψc (g(x)) = j=c η(gc (x)− g j(x)), the multiclass hinge loss of Crammer and Singer [5], the multicategory ψ-loss of Liu and Shen [15], and the truncated hinge loss of Wu and Liu [27] follow this generalized definition. Moreover, for these cases, we note that f (gc ) is symmetric.1 Furthermore, we present a unifying definition of ψ(g) = (ψ1(g),...,ψm(g))T : Rm → Rm where we ignore the dependency of g on x. Let Ψ be a set of mappings ψ(g) satisfying the conditions: (i) when fixed gc ψc (g) is symmetric w.r.t. the remaining arguments and (ii) ψc (g) = ψj(gjc ) where gjc is obtained by only exchanging gc and g j of g. Obviously, the mapping ψ(g) defined via the independent and identical setting, the constrained comparison, or the generalized pairwise comparison with symmetric f belongs to Ψ . With this notion, we give an important theorem of this paper as follows. Theorem 3. Let ψ(g) ∈ Ψ be a twice differentiable function from Rm to Rm. Assume that the Hessian matrix of ψc(g) w.r.t. g is conditionally positive definite for c = 1,...,m. Then the minimizer gˆ = (gˆ 1,..., gˆm)T of c ψc (g(x))Pc (x) in G exists and is unique. Furthermore, if ∂ψc (g) ∂ gc − ∂ψc (g) ∂ g j where j = c is negative for any g ∈ G, then Pl > Pk implies gˆl > gˆk. The proof of the theorem is given in Appendix A.1. Note that an m×m real matrix A is said to be conditionally positive definite if yT Ay > 0 for any nonzero real vector y = (y1,..., ym)T with m j=1 y j = 0. The condition that ∂ψc (g) ∂ gc − ∂ψc (g) ∂ g j 0 ∀z, then ψc (g(x)) is Fisher-consistent. In the setting ψc (g(x)) = j=c η(−g j), we note that c j=c η(−g j)Pc = c=1 η(−gc )(1 − Pc ). Based on the proof of Zou et al. [33], we have that if η(z) is a twice differentiable function with η (0) 0 ∀z, then ψc (g(x)) = j=c η(−g j(x)) is Fisher-consistent. That is, in these two cases, we can relax the condition that ∂ψc (g) ∂ gc − ∂ψc (g) ∂ g j Pk implies gˆl > gˆk. Theorem 3 or Corollary 4 shows that ψc (g) admits the ISC of Zhang [28]. Thus, under the conditions in Theorem 3 or Corollary 4, we also have the relationship between the approximate minimization of the risk based on ψc and the approximate minimization of the classification error. In particular, if EX m c=1 ψc gˆ(X) Pc (X) ≤ inf g∈G EX m c=1 ψc g(X) Pc (X) + 1 for some 1 > 0, then there exists an 2 > 0 such that EX m c=1,c=φ(ˆ X) Pc (X) ≤ EX m c=1,c=φ∗(X) Pc (X) + 2 where φ(ˆ X) = argmaxj{gˆ j(X)}, φ∗(X) = argmaxj{P j(X)} and EX [ m c=1,c=φ∗(X) Pc (X)] is the optimal error. This result directly follows from Theorem 3 in Zhang [28]. 3. Multicategory majorization losses Given x and its label c, we let g(x) be a margin vector at x and the induced classifier be φ(x) = argmax j g j(x). In the binary case, it is clear that φ(x) = c if and only if gc (x) > 0, and that gc (x) ≤ 0 is a necessary and sufficient condition of 1 A symmetric function of p variables is one whose value at any p-tuple of arguments is the same as its value at any permutation of that p-tuple.
Z Zhang et aL Artificial Intelligence 215 (2014)55-78 59 (x)C.Thus,we always have Iige(x)0 but (x)c does not imply ge(x)0 for some jc is a necessary and sufficient condition of (x)c,it in most cases yields an optimization problem which is not easily solved.This is an important reason why it is not trivial to develop multicategory AdaBoost and SVMs with the same principle as binary AdaBoost and SVMs. 3.1.Methodology Recall that g(=0 and there is at least one j(1.....m)such that gj(x).If ge(x)0.then there exists one IE(1.....m)such that Ic and gi(x)>0.As a result,we have (x)c.Therefore,ge(x)0.In addition,it is clear that (x)=c implies gc(x)>0. However,ge(x)>0 does not imply (x)=c. On the other hand,it is obvious that (x)=c is equivalent to (x)j for all jc.In terms of the above discussions, a condition of making (x)=c is gj(x)01≤Uj*8x>0) (b) nk8x≤0≤1n*g0w-gc(x≤0!=ow=d≤gex>0 (c) 4Ukg阀>0,≤∑1g,0>01 j≠c (d1Ukgw-g0,≤∑1g1x-&x>01- j女 Proposition 5 shows that ge(x)0 for some jc is its necessary condition.The following theorem shows that they become sufficient and necessary when g has one and only one positive element. Proposition 6.Under the conditions in Proposition 5.The relationship of g.国s0=1国=lUk8国-g0=Ukeg0W01=∑g0x>01 j≠ holds if and only if the margin vector g(x)has only one positive element. In the binary case,this relationship always holds because gi(x)=-g2(x).Recently.Zou et al.[33]derived multicat- egory boosting algorithms using exp(-ge(x)).In their discrete boosting algorithm,the margin vector g(x)is modeled as an m-vector function with one and only one positive element.In this case,Igx)o)is equal to I Consequently. exp(-ge(x))is a majorization of because exp(-gc(x))is an upper bound of )Therefore,this discrete Ad- aBoost algorithm still approximates the original empirical 0-1 loss function.In the general case,however,Proposition 6 implies that exp(-ge(x))is not the majorization of) 3.2.Approaches Proposition 5 provides us with approaches for constructing majorization functions of the 0-1 loss function). Clearly,andare separable.so they are more tractable respectively thanI andg()0Thus.)>0 and g)are popularly employed in practical applications. In particular,suppose n(gj(x))upper bounds Igj(x):that is.n(gj(x))(x))Note that n(gj(x))Ig)o if and only if n(-gj(x))>Itgj(x)zo).Thus n(-gj(x))upper bounds Ig;(x)zo),and hence n(gj(x)-gi(x))upper bounds It then follows from Proposition 5 thatgj())and(g(x)-gj(x))are majorizations of Consequently,we can define two classes of majorizations for I)The first one is
Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 59 φ(x) = c. Thus, we always have I{gc (x)≤0} = I{φ(x)=c}. Furthermore, let g1(x) = −g2(x) = 1 2 f (x) and encode y = 1 if c = 1 and y = −1 if c = 2. Then the empirical error is = 1 n n i=1 I{φ(xi)=ci} = 1 n n i=1 I{gci (xi)≤0} = 1 n n i=1 I{yi f (xi)≤0}. In the multicategory case, φ(x) = c implies gc (x) > 0 but φ(x) = c does not imply gc (x) ≤ 0. We shall see that gc (x) ≤ 0 is a sufficient but not necessary condition of φ(x) = c. In general, we only have I{gc (x)≤0} ≤ I{φ(x)=c}. Although g j(x) − gc (x) > 0 for some j = c is a necessary and sufficient condition of φ(x) = c, it in most cases yields an optimization problem which is not easily solved. This is an important reason why it is not trivial to develop multicategory AdaBoost and SVMs with the same principle as binary AdaBoost and SVMs. 3.1. Methodology Recall that m j=1 g j(x) = 0 and there is at least one j ∈ {1,...,m} such that g j(x) = 0. If gc (x) ≤ 0, then there exists one l ∈ {1,...,m} such that l = c and gl(x) > 0. As a result, we have φ(x) = c. Therefore, gc (x) ≤ 0 implies φ(x) = c. Unfortunately, if φ(x) = c, gc (x) ≤ 0 does not necessarily hold. For example, consider the case that m = 3 and c = 1. Assume that g(x) = (2, 3,−5). Then we have φ(x) = 2 = 1 and g1(x) = 2 > 0. In addition, it is clear that φ(x) = c implies gc (x) > 0. However, gc (x) > 0 does not imply φ(x) = c. On the other hand, it is obvious that φ(x) = c is equivalent to φ(x) = j for all j = c. In terms of the above discussions, a condition of making φ(x) = c is g j(x) ≤ 0 for j = c. To summarize, we immediately have the following theorem. Proposition 5. For (x, c), let g(x) be a margin vector at x and the induced classifier be φ(x) = arg maxj g j(x). Then (a) I{gc (x)≤0} ≤ I{φ(x)=c} = I{ j=c g j(x)−gc (x)>0} ≤ I{ j=c g j(x)>0} (b) I{ j=c g j(x)≤0} ≤ I{ j=c g j(x)−gc (x)≤0} = I{φ(x)=c} ≤ I{gc (x)>0} (c) I{ j=c g j(x)>0} ≤ j=c I{g j(x)>0} (d) I{ j=c g j(x)−gc (x)>0} ≤ j=c I{g j(x)−gc (x)>0}. Proposition 5 shows that gc (x) ≤ 0 is the sufficient condition of φ(x) = c, while g j(x) > 0 for some j = c is its necessary condition. The following theorem shows that they become sufficient and necessary when g has one and only one positive element. Proposition 6. Under the conditions in Proposition 5. The relationship of I{gc (x)≤0} = I{φ(x)=c} = I{ j=c g j(x)−gc (x)>0} = I{ j=c g j(x)>0} = j=c I{g j(x)>0} holds if and only if the margin vector g(x) has only one positive element. In the binary case, this relationship always holds because g1(x) = −g2(x). Recently, Zou et al. [33] derived multicategory boosting algorithms using exp(−gc (x)). In their discrete boosting algorithm, the margin vector g(x) is modeled as an m-vector function with one and only one positive element. In this case, I{gc (x)≤0} is equal to I{φ(x)=c}. Consequently, exp(−gc (x)) is a majorization of I{φ(x)=c} because exp(−gc (x)) is an upper bound of I{gc (x)≤0}. Therefore, this discrete AdaBoost algorithm still approximates the original empirical 0–1 loss function. In the general case, however, Proposition 6 implies that exp(−gc (x)) is not the majorization of I{φ(x)=c}. 3.2. Approaches Proposition 5 provides us with approaches for constructing majorization functions of the 0–1 loss function I{φ(x)=c}. Clearly, j=c I{g j(x)>0} and j=c I{g j(x)−gc (x)>0} are separable, so they are more tractable respectively than I{ j=c g j(x)>0} and I{ j=c g j(x)−gc (x)>0}. Thus, j=c I{g j(x)>0} and j=c I{g j(x)−gc (x)>0} are popularly employed in practical applications. In particular, suppose η(g j(x)) upper bounds I{g j(x)≤0}; that is, η(g j(x)) ≥ I{g j(x)≤0}. Note that η(g j(x)) ≥ I{g j(x)≤0} if and only if η(−g j(x)) ≥ I{g j(x)≥0}. Thus η(−g j(x)) upper bounds I{g j(x)≥0}, and hence η(g j(x) − gl(x)) upper bounds I{gl(x)−g j(x)>0}. It then follows from Proposition 5 that j=c η(−g j(x)) and j=c η(gc (x) − g j(x)) are majorizations of I{φ(x)=c}. Consequently, we can define two classes of majorizations for I{φ(x)=c}. The first one is
% Z Zhang et aL Artificial Intelligence 215(2014)55-78 vc(g(x))=>n(gc(x)-gn(x)). (2) l≠c while the second one is vc(g(x))=n(-gi(x)). (3) l≠c This leads us to two approaches for constructing majorization ve(g(x))of I Zhang [28]referred to them as pairwise comparison and constrained comparison.A theoretical analysis of these two classes of majorization functions has also been presented by Zhang [28].His analysis mainly focused on consistency of empirical risk minimization and the ISC property of surrogate losses.Our results in Section 3.1 show a direct and intuitive connection of these two approaches with the original 0-1 loss. 3.3.Multicategory hinge losses Using distinct n(gj(x))())in the two approaches,we can construct different multicategory losses for large margin classifiers.For example,let n(gj(x))=(1-gj(x))+which upper bounds Ig;(x)0).Then j(1-ge(x)+gj(x))+ and (1+gj(x))+are candidate majorizations for)which yield two multiclass SVM methods. In the multicategory SVM(MSVM),Lee et al.[13]employed (1+gj(x))+as a multicategory hinge loss.Moreover, Lee et al.proved that this multicategory hinge loss is Fisher-consistent.In particular.the minimizer of(1+ gj(x))+Pc(x)w.r.t.geg is g(x)=m-1 if I=argmaxj (Pj(x))and g(x)=-1 otherwise. The pairwise comparison j(1-ge(x)+gj(x))+was used by Vapnik [25].Weston and Watkins [26],Bredensteiner and Bennett [3].Guermeur [12].Unfortunately,Lee et al.[13].Zhang [28],Liu [14]showed that solutions of the corresponding optimization problem do not always implement the Bayes decision rule.However,we find that it is still Fisher-consistent under certain conditions.In particular,we have the following theorem (the proof is given in Appendix B.1). Theorem 7.Let Pj(x)>0 for j=1....,m,Pi(x)=maxj Pj(x)and Pk(x)=maxj Pj(x),and let m gx)=argmin∑Pc(x)∑(1-gc(x+gj(x)+ g(x)EC c=1 j≠c If P(x)>1/2 or Pk(x)gj(x).so the majorization function (1-ge(x)+gj(x))+is Fisher-consistent when P(x)>1/2 or Pk(x)P>Pk=1/3.Theorem 7 shows that for any m3 the consistency is also satisfied whenever Pk0≤max{8jX)+1-U=c}-8c(X≤1+8jx-gc(x)+ j≠c which implies that e(g(x))is a tighter upper bound of)than (1-ge(x)+gj(x))+.Note that Crammer and Singer [5]did not assume geg,but Liu and Shen [15]argued that this assumption is also necessary.Zhang [28]showed that sc(g(x))is Fisher-consistent only when P(x)>1/2.However,the author did not give an explicit expression of the minimizer of the expected error in question in the literature.Here we present the constructive solution of the corresponding minimization problem in the following theorem (the proof is given in Appendix B.2). Theorem 8.Consider the following optimization problem of m gx)=argmin∑ max(gj(x)+1-Itj=c))-gc(x)Pc(x). (5) g(X)EG C=1 Assume that P(x)=maxj Pi(x)
60 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 ψc g(x) = l=c η gc (x) − gl(x) , (2) while the second one is ψc g(x) = l=c η −gl(x) . (3) This leads us to two approaches for constructing majorization ψc (g(x)) of I{φ(x)=c}. Zhang [28] referred to them as pairwise comparison and constrained comparison. A theoretical analysis of these two classes of majorization functions has also been presented by Zhang [28]. His analysis mainly focused on consistency of empirical risk minimization and the ISC property of surrogate losses. Our results in Section 3.1 show a direct and intuitive connection of these two approaches with the original 0–1 loss. 3.3. Multicategory hinge losses Using distinct η(g j(x)) (≥ I{g j(x)≤0}) in the two approaches, we can construct different multicategory losses for large margin classifiers. For example, let η(g j(x)) = (1 − g j(x))+ which upper bounds I{g j(x)≤0}. Then j=c (1 − gc (x) + g j(x))+ and j=c (1 + g j(x))+ are candidate majorizations for I{φ(x)=c}, which yield two multiclass SVM methods. In the multicategory SVM (MSVM), Lee et al. [13] employed j=c (1 + g j(x))+ as a multicategory hinge loss. Moreover, Lee et al. [13] proved that this multicategory hinge loss is Fisher-consistent. In particular, the minimizer of m c=1 j=c (1 + g j(x))+ Pc (x) w.r.t. g ∈ G is gˆl(x) = m − 1 if l = argmaxj(P j(x)) and gˆl(x) = −1 otherwise. The pairwise comparison j=c (1−gc (x)+g j(x))+ was used by Vapnik [25], Weston and Watkins [26], Bredensteiner and Bennett [3], Guermeur [12]. Unfortunately, Lee et al. [13], Zhang [28], Liu [14] showed that solutions of the corresponding optimization problem do not always implement the Bayes decision rule. However, we find that it is still Fisher-consistent under certain conditions. In particular, we have the following theorem (the proof is given in Appendix B.1). Theorem 7. Let P j(x) > 0 for j = 1,...,m, Pl(x) = maxj P j(x) and Pk(x) = maxj=l P j(x), and let gˆ(x) = argmin g(x)∈G m c=1 Pc (x) j=c 1 − gc (x) + g j(x) +. If Pl(x) > 1/2 or Pk(x) gˆ j(x), so the majorization function j=c (1 − gc (x) + g j(x))+ is Fisher-consistent when Pl(x) > 1/2 or Pk(x) Pl > Pk ≥ 1/3. Theorem 7 shows that for any m ≥ 3 the consistency is also satisfied whenever Pk 0} can be also used as a starting point to construct a majorization of I{φ(x)=c}. Since I{ j=c g j(x)−gc (x)>0} = I{max j=c g j(x)−gc (x)>0}, we call this construction approach the maximum pairwise comparison. In fact, this approach was employed by Crammer and Singer [5], Liu and Shen [15] and Wu and Liu [27]. Especially, Crammer and Singer [5] used the surrogate: ξc g(x) = max g j(x) + 1 − I{j=c} − gc (x). (4) It is easily seen that I{ j=c g j(x)−gc (x)>0} ≤ max j g j(x) + 1 − I{j=c} − gc (x) ≤ j=c 1 + g j(x) − gc (x) +, which implies that ξc (g(x)) is a tighter upper bound of I{φ(x)=c} than j=c (1 − gc (x) + g j(x))+. Note that Crammer and Singer [5] did not assume g ∈ G, but Liu and Shen [15] argued that this assumption is also necessary. Zhang [28] showed that ξc (g(x)) is Fisher-consistent only when Pl(x) > 1/2. However, the author did not give an explicit expression of the minimizer of the expected error in question in the literature. Here we present the constructive solution of the corresponding minimization problem in the following theorem (the proof is given in Appendix B.2). Theorem 8. Consider the following optimization problem of gˆ(x) = argmin g(x)∈G m c=1 max j g j(x) + 1 − I{j=c} − gc (x) Pc (x). (5) Assume that Pl(x) = maxj P j(x).
Z Zhang et aL Artificial Intelligence 215(2014)55-78 67 (1)fP1(x)>1/2,then8x)=j=-品orj=1,,m; (2)fP1(x)=1/2.then0≤gi(x-8gj(x)≤1 and gj(x)=gc(x)for c,j≠L: (3)If P(x)1/2. Otherwise,the solution of (5)degenerates to the trivial point.As we have seen from Theorems 7 and 8,P(x)>1/2 is a sufficient condition for both maxj{gj(x)+1-Ij=c))-ge(x)and j(1-ge(x)+gj(x))+to be Fisher-consistent.Moreover. they satisfy the condition gi(x)=1+gk(x)where k =argmaxj Pj(x).However,as shown in Theorem 7.j(1-ge(x)+ gj(x))+still yields the Fisher-consistent property when Pk0 (6) where T is called the temperature parameter.Clearly.nr(z)>(1-z)+>Izso).Moreover,limron(z)=(1-z)+.Thus, we directly have two majorizations of based on the constrained comparison method and the pairwise comparison method. Using the constrained comparison,we give a smooth approximation to j(1+gj(x))+for the MSVM of Lee et al.[13]. That is, j≠c It is immediate that Lr(g(x,c)之∑jc(1+g(x)+and limr→oLr(g(x,c)=∑j≠e(1+gjx)+.Furthermore,we have the following theorem(the proof is given in Appendix B.3). Theorem 9.Assume that Pe(x)>0for c=1.....m.Consider the optimization problem m max 〉Lr(gx),c)Pc(x) (7) gX)∈G c=1 for a fixed T>and letg(x)=((x).....gm(x))be its solution.Theng(x)is unique.Moreover,if Pi(x)0 for c=1....,m.Let PI=maxj Pj(x)and Pk(x)=maxj Pj(x).Consider the optimization problem m max >Gr(g(x).c)Pc(x) g(x)EC c=1
Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 61 (1) If Pl(x) > 1/2, then gˆ j(x) = I{j=l} − 1 m for j = 1,...,m; (2) If Pl(x) = 1/2, then 0 ≤ gˆl(x) − gˆ j(x) ≤ 1 and gˆ j(x) = gˆ c (x) for c, j = l; (3) If Pl(x) 1/2. Otherwise, the solution of (5) degenerates to the trivial point. As we have seen from Theorems 7 and 8, Pl(x) > 1/2 is a sufficient condition for both max j{g j(x)+1−I{j=c}}− gc (x) and j=c (1− gc (x)+ g j(x))+ to be Fisher-consistent. Moreover, they satisfy the condition gˆl(x) = 1 + gˆk(x) where k = argmaxj=l P j(x). However, as shown in Theorem 7, j=c (1 − gc (x) + g j(x))+ still yields the Fisher-consistent property when Pk 0 (6) where T is called the temperature parameter. Clearly, ηT (z) ≥ (1 − z)+ ≥ I{z≤0}. Moreover, limT→0 ηT (z) = (1 − z)+. Thus, we directly have two majorizations of I{φ(x)=c} based on the constrained comparison method and the pairwise comparison method. Using the constrained comparison, we give a smooth approximation to j=c (1+ g j(x))+ for the MSVM of Lee et al. [13]. That is, LT g(x), c T j=c log 1 + exp1 + g j(x) T . It is immediate that LT (g(x), c) ≥ j=c (1 + g j(x))+ and limT→0 LT (g(x), c) = j=c (1 + g j(x))+. Furthermore, we have the following theorem (the proof is given in Appendix B.3). Theorem 9. Assume that Pc(x) > 0 for c = 1,...,m. Consider the optimization problem max g(x)∈G m c=1 LT g(x), c Pc (x) (7) for a fixed T > 0 and let gˆ(x) = (gˆ 1(x),..., gˆm(x))T be its solution. Then gˆ(x) is unique. Moreover, if Pl(x) 0 for c = 1,...,m. Let Pl = maxj P j(x) and Pk(x) = maxj=l P j(x). Consider the optimization problem max g(x)∈G m c=1 GT g(x), c Pc (x)
公 Z.Zhang et aL Artificial Intelligence 215(2014)55-78 for a fixed T>0 and letg(x)=(g(x).....gm(x))T be its solution.Theng(x)is unique.Moreover,if Pi(x)1/2 or Pk(x)0.This setting will reveal an important connection between the hinge loss and the logit loss.The empirical error on the training data is then E= n In this setting.we can extend the multiclass hinge loss e(g(x))as Hu(g(x).c)max gj(x)+u-ultj=c)-gc(x). (11) It is clear that Hu(g(x).c)>u)To establish the connection among the multiclass C-loss,the multiclass hinge loss and the logit loss,we employ this setting to present the definition of the multiclass C-loss. We now express max(gj(x)+u-ultj=c))as f(x)[gj(x)+u-ulltj=c)]where ω(X)= 1 j=argmax (gi(x)+u-ull=c)} 】0 otherwise. Motivated by the idea behind deterministic annealing [20].we relax this hard function (x).retaining only (x)>0 and(x)=1.With such soft f().we maximize()gj+u-uj-g(under entropy penalization: namely, max Fe∑oj(x[g0x+u-uu=]-8-w-T∑w1ogo5w (12) j=1 j=1 where T>0 is also referred to as the temperature.The maximization of F w.r.t.(x)is straightforward,and it gives rise to the following distribution exp ω(X)= ∑expr+W-u4] (13) based on the Karush-Kuhn-Tucker condition.The corresponding maximum of F is obtained by plugging(13)back into(12): CT.(g(X).c)Tog1+exp-gc(x) T>0,u>0. (14) T j≠c
62 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 for a fixed T > 0 and let gˆ(x) = (gˆ 1(x),..., gˆm(x))T be its solution. Then gˆ(x) is unique. Moreover, if Pi(x) 1/2 or Pk(x) 0. This setting will reveal an important connection between the hinge loss and the logit loss. The empirical error on the training data is then = u n n i=1 I{φ(xi)=ci}. In this setting, we can extend the multiclass hinge loss ξc (g(x)) as Hu g(x), c = max j g j(x) + u − uI{j=c} − gc (x). (11) It is clear that Hu(g(x), c) ≥ uI{φ(x)=c}. To establish the connection among the multiclass C-loss, the multiclass hinge loss and the logit loss, we employ this setting to present the definition of the multiclass C-loss. We now express max{g j(x) + u − uI{j=c}} as m j=1 ωc j (x)[g j(x) + u − uI{j=c}] where ωc j (x) = 1 j = argmaxl{gl(x) + u − uI{l=c}} 0 otherwise. Motivated by the idea behind deterministic annealing [20], we relax this hard function ωc j (x), retaining only ωc j (x) ≥ 0 and m j=1 ωc j (x) = 1. With such soft ωc j (x), we maximize m j=1 ωc j (x)[g j(x)+u−uI{j=c}]− gc (x) under entropy penalization; namely, max {ωc j (x)} F m j=1 ωc j (x) g j(x) + u − uI{j=c} − gc (x) − T m j=1 ωc j (x)logωc j (x) , (12) where T > 0 is also referred to as the temperature. The maximization of F w.r.t. ωc j (x) is straightforward, and it gives rise to the following distribution ωc j (x) = exp[ g j(x)+u−uI{j=c} T ] l exp[ gl(x)+u−uI{l=c} T ] (13) based on the Karush–Kuhn–Tucker condition. The corresponding maximum of F is obtained by plugging (13) back into (12): CT ,u g(x), c T log 1 + j=c exp u + g j(x) − gc (x) T , T > 0, u > 0. (14)
Z Zhang et aL Artificial Intelligence 215 (2014)55-78 63 Note that for T>0 we have gj(x)+u-uluj=c) T -gc(x) T j+c zmax gj(x)+u-ulj=c)-gc(x) ≥ulp(x≠c We thus call Cr.u(g(x),c)the multiclass C-loss.Clearly,Cr.u(g(x).c)is infinitely smooth and convex in g(x)(see Appendix C for the proof).Moreover,the Hessian matrix of Cr.u(g(x),c)w.r.t.g(x)is conditionally positive definite. 4.1.Properties We now investigate the relationships between the multiclass C-loss Cr.u(g(x).c)and the multiclass hinge loss Hu(g(x),c).and between Cr.u(g(x),c)and multiclass coherence loss Gr(g(x),c).In particular,we have the following propo- sition. Proposition 11.Let Gr(g(x).c).Hu(g(x).c)and Cr.u(g(x),c)be defined by (9).(11)and (14),respectively.Then, (i)x)CT.1(g(x).c)0,the limit of Cr.u(g(x),c)at T =0 is still the majorization of ).We thus see an essential difference between Cr.1(g(x).c)and Cr.o(g(x).c).which are respectively the generaliza- tions of the C-loss and the logit loss. For notational simplicity,here and later we denote Cr.(g(x).c)by Cr(g(x).c).Throughout our analysis in this section, we assume that the maximizing argument 1=argmaxjgj(x)is unique.This implies that gi(x)>gj(x)for j.The following theorem shows that the C-loss is Fisher-consistent
Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 63 Note that for T > 0 we have T log 1 + j=c exp u + g j(x) − gc (x) T = T log j exp g j(x) + u − uI{j=c} T − gc (x) ≥ max j g j(x) + u − uI{j=c} − gc (x) ≥ uI{φ(x)=c}. We thus call CT ,u(g(x), c) the multiclass C-loss. Clearly, CT ,u(g(x), c) is infinitely smooth and convex in g(x) (see Appendix C for the proof). Moreover, the Hessian matrix of CT ,u(g(x), c) w.r.t. g(x) is conditionally positive definite. 4.1. Properties We now investigate the relationships between the multiclass C-loss CT ,u(g(x), c) and the multiclass hinge loss Hu(g(x), c), and between CT ,u(g(x), c) and multiclass coherence loss GT (g(x), c). In particular, we have the following proposition. Proposition 11. Let GT (g(x), c), Hu(g(x), c) and CT ,u(g(x), c) be defined by (9), (11) and (14), respectively. Then, (i) I{φ(x)=c} ≤ CT ,1(g(x), c) 0, we have (i) limT→∞ CT ,u(g(x), c) − T logm = 1 m j=c (u + g j(x) − gc (x)) and lim T→∞ ωc j (x) = 1 m for j = 1,...,m. (ii) limT→0 CT ,u(g(x), c) = Hu(g(x), c) and lim T→0 ωc j (x) = 1 j = argmaxl{gl(x) + 1 − I{l=c}} 0 otherwise. (iii) CT ,u(g(x), c) is increasing in T . The proof is given in Appendix C.2. It is worth noting that Proposition 12-(ii) shows that at T = 0, CT ,1(g(x), c) reduces to the multiclass hinge loss ξc (g(x)) of Crammer and Singer [5]. Additionally, when u = 0, we have CT ,0 g(x), c = T log 1 + j=c exp g j(x) − gc (x) T , which was proposed by Zhang et al. [29]. When T = 1, it is the logit loss γc (g(x)) in (10). Thus, C1,1(g(x), c) bridges the hinge loss ξc (g(x)) and the logit loss γc (g(x)). Consider that lim T−→0 CT ,0 g(x), c = max j g j(x) − gc (x) . This shows that CT ,0(g(x), c) no longer converges to the majorizations of I{φ(x)=c} as T → 0. However, as a special case of u = 1, we have limT→0 CT ,1(g(x), c) = ξc (g(x)) ≥ I{φ(x)=c}; that is, CT ,1(g(x), c) converges to the majorization of I{φ(x)=c}. In fact, Proposition 12-(ii) implies that for an arbitrary u > 0, the limit of CT ,u(g(x), c) at T = 0 is still the majorization of I{φ(x)=c}. We thus see an essential difference between CT ,1(g(x), c) and CT ,0(g(x), c), which are respectively the generalizations of the C-loss and the logit loss. For notational simplicity, here and later we denote CT ,1(g(x), c) by CT (g(x), c). Throughout our analysis in this section, we assume that the maximizing argument l = argmaxj g j(x) is unique. This implies that gl(x) > g j(x) for j = l. The following theorem shows that the C-loss is Fisher-consistent.
Z Zhang et aL Artificial Intelligence 215(2014)55-78 Table 1 Summary of the Multicategory Loss Functions w.r.t.(x,c).Here "I","II"and "Ill"represent the constrained comparison, pairwise comparison and maximum pairwise comparison settings.respectively. Hinge ∑j*(1+j)+13引 ∑j*e(1-8x+gx)+I251 (g(x))maxlgj(x)+1-Ij-c))-ge(x)[5] 分 Coherence rgW.c)=T∑*elog[1+exp(中巴】 Grg0.c0=T∑logl1+exp(+8r-3四】seE4(91 Cug.c0=T1og1+∑kexp-四1IseE4141 Logit Ye(g(x))=log[1+j exp(gj(x)-ge(x))]Isee Eq.(10)] Theorem 13.Assume that Pe(x)>0 for c=1....,m.Consider the optimization problem: m argmax Cr.u(g(x).c)Pc(x) (15) g(X)EG c=1 for fixed T>0 and u =0.Letg(x)=(g1(x).....gm(x))be the solution.Then g(x)is unique.Moreover,if Pi(x)0 for c=1....,m,and let Pi(x)=maxc Pc(x). (1)If Pi(x)>1/2.then u(m-1)/m ifc=I, m8(x= -u/m otherwise. (2)If P(x)<1/2,then lim ge(x)=0 forc=1,...,m. T0 The proofs of Theorems 13 and 14 are given in Appendix B.5.Theorem 14 shows a very important asymptotic property of the solution ge(x).Especially when u =1.ge(x)as T0 converges to the solution of Problem(5)which is based on the multiclass hinge loss Ec(g(x))of Crammer and Singer [5](see Theorem 8). Remark 1.We present three multicategory coherence functions Lr(g(x),c).Gr(g(x).c)and Cr(g(x),c).They are respec- tively upper bounds of three multicategory hinge losses studied in Section 3.3.so they are majorizations of the 0-1 loss ).When m=2.these three losses become identical.Our theoretical analysis shows that their limits as the tempera- ture approaches zero become the corresponding hinge losses,and the limits of the minimizers of their expected errors are the minimizers of the expected errors of the corresponding hinge losses(see Theorems 9.10 and 14).We summarize the multicategory loss functions discussed in the paper in Table 1. Remark 2.The coherence losses Lr(g(x).c)and Cr(g(x),c)can result in explicit expressions for the class conditional prob- abilities(see(8)and(16)).Thus,this can provide us with an approach for conditional class probability estimation in the multicategory SVMs of Lee et al.[13]and of Crammer and Singer [5].Roughly speaking,one replaces the solutions of clas- sification models based on the multicategory coherence losses with those of the corresponding multiclass SVMs in(8)and (16).respectively.Based on Gr(g(x).c),however,there does not exist an explicit expression for the class probability similar to(8)or(16).In this case,the above approach for class probability estimation does not apply to the multiclass SVM model of Vapnik [25],Bredensteiner and Bennett [3].Weston and Watkins [26].Guermeur [12]. Remark 3.An advantage of Cr(g(x),c)over Lr(g(x).c)is in that it can make condition g(x)EG automatically satisfy in developing a classification method.Moreover,we see that the multiclass C-loss Cr(g(x).c)bridges the hinge loss and the
64 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 Table 1 Summary of the Multicategory Loss Functions w.r.t. (x, c). Here “I”, “II” and “III” represent the constrained comparison, pairwise comparison and maximum pairwise comparison settings, respectively. Hinge j=c (1 + g j(x))+ [13] I j=c (1 − gc (x) + g j(x))+ [25] II ξc (g(x)) = max{g j(x) + 1 − I{j=c}} − gc (x) [5] III Coherence LT (g(x), c) = T j=c log[1 + exp( 1+g j(x) T )] I GT (g(x), c) = T j=c log[1 + exp( 1+g j(x)−gc (x) T )] [see Eq. (9)] II CT ,u (g(x), c) = T log[1 + j=c exp u+g j(x)−gc (x) T ] [see Eq. (14)] III Logit γc (g(x)) = log[1 + j=c exp(g j(x) − gc (x))] [see Eq. (10)] Theorem 13. Assume that Pc(x) > 0 for c = 1,...,m. Consider the optimization problem: argmax g(x)∈G m c=1 CT ,u g(x), c Pc (x) (15) for fixed T > 0 and u ≥ 0. Let gˆ(x) = (gˆ 1(x),..., gˆm(x))T be the solution. Then gˆ(x) is unique. Moreover, if Pi(x) 0 for c = 1,...,m, and let Pl(x) = maxc Pc (x). (1) If Pl(x) > 1/2, then lim T→0 gˆ c (x) = u(m − 1)/m if c = l, −u/m otherwise. (2) If Pl(x) < 1/2, then lim T→0 gˆ c (x) = 0 for c = 1,...,m. The proofs of Theorems 13 and 14 are given in Appendix B.5. Theorem 14 shows a very important asymptotic property of the solution gˆ c (x). Especially when u = 1, gˆ c (x) as T → 0 converges to the solution of Problem (5) which is based on the multiclass hinge loss ξc (g(x)) of Crammer and Singer [5] (see Theorem 8). Remark 1. We present three multicategory coherence functions LT (g(x), c), GT (g(x), c) and CT (g(x), c). They are respectively upper bounds of three multicategory hinge losses studied in Section 3.3, so they are majorizations of the 0–1 loss I{φ(x)=c}. When m = 2, these three losses become identical. Our theoretical analysis shows that their limits as the temperature approaches zero become the corresponding hinge losses, and the limits of the minimizers of their expected errors are the minimizers of the expected errors of the corresponding hinge losses (see Theorems 9, 10 and 14). We summarize the multicategory loss functions discussed in the paper in Table 1. Remark 2. The coherence losses LT (g(x), c) and CT (g(x), c) can result in explicit expressions for the class conditional probabilities (see (8) and (16)). Thus, this can provide us with an approach for conditional class probability estimation in the multicategory SVMs of Lee et al. [13] and of Crammer and Singer [5]. Roughly speaking, one replaces the solutions of classification models based on the multicategory coherence losses with those of the corresponding multiclass SVMs in (8) and (16), respectively. Based on GT (g(x), c), however, there does not exist an explicit expression for the class probability similar to (8) or (16). In this case, the above approach for class probability estimation does not apply to the multiclass SVM model of Vapnik [25], Bredensteiner and Bennett [3], Weston and Watkins [26], Guermeur [12]. Remark 3. An advantage of CT (g(x), c) over LT (g(x), c) is in that it can make condition g(x) ∈ G automatically satisfy in developing a classification method. Moreover, we see that the multiclass C-loss CT (g(x), c) bridges the hinge loss and the