Coherence Functions for Multicategory Margin-based Classification Methods Zhihua Zhang Michael I.Jordan Wu-Jun Li and Dit-Yan Yeung College of Comp.Sci.and Tech. Depts.of EECS and Statistics Dept.of Comp.Sci.and Eng. Zhejiang University University of California,Berkeley Hong Kong Univ.of Sci.and Tech Zhejiang 310027,China Berkeley.CA 94720.USA Hong Kong,China zhzhang@cs.zju.edu.cn jordanOcs.berkeley.edu {liwujun,dyyeung}@cse.ust.hk Abstract A variety of ad hoc extensions of the binary SVM and boosting to multicategory classification problems Margin-based classification methods are typ- have been studied.These include one-versus-rest.one- ically devised based on a majorization- versus-one,error-correcting codes,and pairwise cou- minimization procedure,which approxi- pling (Allwein et al.,2000).Among these methods, mately solves an otherwise intractable min- one-versus-rest has been the dominant approach.The imization problem defined with the 0-1 loss. basic idea is to train m binary classifiers for an m- The extension of such methods from the bi- class(m >2)problem so that each classifier learns to nary classification setting to the more general discriminate one class from the rest.However,opti- multicategory setting turns out to be non- mality achieved for each of the m independent binary trivial.In this paper,our focus is to devise problems does not readily guarantee optimality for the margin-based classification methods that can original m-class problem. be seamlessly applied to both settings,with The goal of this paper is to solve multicategory clas- the binary setting simply as a special case. sification problems using the same margin principle In particular,we propose a new majoriza- as that for binary problems.Of crucial concern are tion loss function that we call the coherence the statistical properties(Bartlett et al.,2006;Tewari function,and then devise a new multicate- and Bartlett,2007;Zhang,2004)of a majorization gory margin-based boosting algorithm based function for the original 0-1 loss function.In particu- on the coherence function.Analogous to lar,we analyze the Fisher-consistency properties(Zou deterministic annealing,the coherence func- et al.,2008)of extant majorization functions,which tion is characterized by a temperature fac- are built on the exponential,logit and hinge functions. tor.It is closely related to the multinomial This analysis inspires us to propose a new majorization log-likelihood function and its limit at zero function,which we call the coherence function. temperature corresponds to a multicategory hinge loss function. The coherence function is attractive because it is a Fisher-consistent majorization of the 0-1 loss.Also, one limiting version of it is just the multicategory hinge 1 Introduction loss function of Crammer and Singer(2001),and its re- lationship with the multinomial log-likelihood function Margin-based classification methods have become in- is very clear.Moreover,this function is differentiable creasingly popular since the advent of the support vec- and convex.Thus it is very appropriate for use in the tor machine (SVM)(Cortes and Vapnik,1995)and development of multicategory margin-based classifica- boosting (Freund,1995:Freund and Schapire.1997). tion methods,especially boosting algorithms.Fried- These algorithms were originally designed for binary man et al.(2000)and Zou et al.(2008)proposed the classification problems.Unfortunately,extension of multicategory LogitBoost and GentleBoost algorithms them to the multicategory setting has been found to based on the multinomial log-likelihood function and be non-trivial. the exponential loss function,respectively.We pro- Appearing in Proceedings of the 12th International Confe- pose in this paper a new multicategory GentleBoost algorithm based on our coherence function. rence on Artificial Intelligence and Statistics (AISTATS) 2009,Clearwater Beach,Florida,USA.Volume 5 of JMLR: The rest of this paper is organized as follows.Sec- W&CP 5.Copyright 2009 by the authors. tion 2 presents theoretical discussions of extant loss
Coherence Functions for Multicategory Margin-based Classification Methods Zhihua Zhang College of Comp. Sci. and Tech. Zhejiang University Zhejiang 310027, China zhzhang@cs.zju.edu.cn Michael I. Jordan Depts. of EECS and Statistics University of California, Berkeley Berkeley, CA 94720, USA jordan@cs.berkeley.edu Wu-Jun Li and Dit-Yan Yeung Dept. of Comp. Sci. and Eng. Hong Kong Univ. of Sci. and Tech. Hong Kong, China {liwujun, dyyeung}@cse.ust.hk Abstract Margin-based classification methods are typically devised based on a majorizationminimization procedure, which approximately solves an otherwise intractable minimization problem defined with the 0-l loss. The extension of such methods from the binary classification setting to the more general multicategory setting turns out to be nontrivial. In this paper, our focus is to devise margin-based classification methods that can be seamlessly applied to both settings, with the binary setting simply as a special case. In particular, we propose a new majorization loss function that we call the coherence function, and then devise a new multicategory margin-based boosting algorithm based on the coherence function. Analogous to deterministic annealing, the coherence function is characterized by a temperature factor. It is closely related to the multinomial log-likelihood function and its limit at zero temperature corresponds to a multicategory hinge loss function. 1 Introduction Margin-based classification methods have become increasingly popular since the advent of the support vector machine (SVM) (Cortes and Vapnik, 1995) and boosting (Freund, 1995; Freund and Schapire, 1997). These algorithms were originally designed for binary classification problems. Unfortunately, extension of them to the multicategory setting has been found to be non-trivial. Appearing in Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR: W&CP 5. Copyright 2009 by the authors. A variety of ad hoc extensions of the binary SVM and boosting to multicategory classification problems have been studied. These include one-versus-rest, oneversus-one, error-correcting codes, and pairwise coupling (Allwein et al., 2000). Among these methods, one-versus-rest has been the dominant approach. The basic idea is to train m binary classifiers for an mclass (m ≥ 2) problem so that each classifier learns to discriminate one class from the rest. However, optimality achieved for each of the m independent binary problems does not readily guarantee optimality for the original m-class problem. The goal of this paper is to solve multicategory classification problems using the same margin principle as that for binary problems. Of crucial concern are the statistical properties (Bartlett et al., 2006; Tewari and Bartlett, 2007; Zhang, 2004) of a majorization function for the original 0-1 loss function. In particular, we analyze the Fisher-consistency properties (Zou et al., 2008) of extant majorization functions, which are built on the exponential, logit and hinge functions. This analysis inspires us to propose a new majorization function, which we call the coherence function. The coherence function is attractive because it is a Fisher-consistent majorization of the 0-1 loss. Also, one limiting version of it is just the multicategory hinge loss function of Crammer and Singer (2001), and its relationship with the multinomial log-likelihood function is very clear. Moreover, this function is differentiable and convex. Thus it is very appropriate for use in the development of multicategory margin-based classification methods, especially boosting algorithms. Friedman et al. (2000) and Zou et al. (2008) proposed the multicategory LogitBoost and GentleBoost algorithms based on the multinomial log-likelihood function and the exponential loss function, respectively. We propose in this paper a new multicategory GentleBoost algorithm based on our coherence function. The rest of this paper is organized as follows. Section 2 presents theoretical discussions of extant loss
Coherence Functions for Multicategory Margin-based Classification Methods functions for multicategory margin-based classification thus used to approximate I().The correspond- methods.Section 3 proposes the coherence function ing population and empirical risk functions are given and discusses its statistical properties.Section 4 de- by vises a multicategory margin-based boosting algorithm using the coherence function.An experimental anal- R(P,g)=Ex ((g(x).c)P:(x) ysis is presented in Section 5 and concluding remarks c】 are given in Section 6.Some proofs of the theoretical results are left to the appendices (g)= 1∑(gx.c), 2 i=1 2 Problem Formulation where Ex()is the expectation taken w.r.t.the distri- bution of X. We are given an m-class(m >2)classification problem If a is a positive constant that does not de- with a set of training data {(xi,ci))21,where xi E pend on (x,c),argming((g)is equivalent to C Rd is an input vector and ci {1,2,...,m}is argming(x)eR(g).We thus present the following def- its corresponding class label.We assume that each x inition. belongs to one and only one class.Our goal is to find a classifier(x):x一→c∈{l,…,m Definition 1 A surrogate loss (g(x),c)is said to be Let Pe(x)=P(Cc x),c 1,...,m be the a majori2 ation o时e(x)≠dw.r.t.(x,c)fS(g(x),c)≥ class probabilities given x.The expected error at x is )twhere a is a positive constant that does not depend on (x,c). then defined by∑2,le(x≠aP.(x,whereI=1if is true and 0 otherwise.The empirical error on the Given a majorization function (g(x),c),the classi- training data is given by fier resulting from the minimization of R(g)w.r.t.the margin vector g(x)is called a margin-based classifier (x)≠c or a margin-based classification method.Therefore. i=1 a margin-based classifier corresponds to a so-called majorization-minimization procedure.In the binary Since e is equal to its minimum value of zero when classification setting,a wide variety of classifiers can be all the training data points are correctly classified,we understood as minimizers of a majorization loss func- wish to use e as a basis for devising multicategory clas- tion of the 0-1 loss.If such functions satisfy other sification algorithms. technical conditions,the resulting classifiers can be shown to be Bayes consistent (Bartlett et al.,2006). 2.1 Multicategory Margins It seems reasonable to pursue a similar development Suppose the classifier is modeled using an m-vector in the case of multicategory classification,and indeed g(x)=(g1(x),...,9m(x))T,where the induced clas- such a proposal has been made by Zou et al.(2008) (see also Tewari and Bartlett (2007);Zhang (2004)). sifier is obtained via maximization in a manner akin to discriminant analysis:(x)=argmax;{gi(x)}.For The following definition refines the definition of Zou et al.(2008).(Specifically,we do not require that the simplicity of analysis.we assume that the maximiz- ing argument of maxjgi(x)is unique.Of course this function (g(x),c)depends only on ge(x).) does not imply that the maximum value is unique;in- Definition 2 A surrogate function ((g(x),c)is said deed,adding a constant to each component gi(x)does to be Fisher consistent w.r.t.a margin vector g(x)= not change the maximizing argument.To remove this (g(x),....9m(x))T at x if (i)the following risk min- redundancy,it is convenient to impose a sum-to-zero imization problem constraint.Thus we define m =0 g(x)=argmin (g(x),c)P(x) (1) g(x)E9 c=l and assume g(x)Eg.Zou et al.(2008)referred to the has a unique solutiong(x)=((x),...,m(x))T;and (i) vectors in g as multicategory margin vectors. argmax gc(x)=argmax Pe(x). Since a margin vector g(x)induces a classifier,we ex- plore the minimization of e with respect to (w.r.t.) 2.2 Multicategory Losses g(x).However,this minimization problem is in- tractable because is the 0-1 function.Var- Zou et al.(2008)derived multicategory boosting al- ious tractable surrogate loss functions (g(x),c)are gorithms by using ((g(x),c)=exp(-ge(x)).In their
Coherence Functions for Multicategory Margin-based Classification Methods functions for multicategory margin-based classification methods. Section 3 proposes the coherence function and discusses its statistical properties. Section 4 devises a multicategory margin-based boosting algorithm using the coherence function. An experimental analysis is presented in Section 5 and concluding remarks are given in Section 6. Some proofs of the theoretical results are left to the appendices. 2 Problem Formulation We are given an m-class (m ≥ 2) classification problem with a set of training data {(xi , ci)} n i=1, where xi ∈ X ⊂ R d is an input vector and ci ∈ {1, 2, . . . , m} is its corresponding class label. We assume that each x belongs to one and only one class. Our goal is to find a classifier φ(x) : x → c ∈ {1, . . . , m}. Let Pc(x) = P(C = c|X = x), c = 1, . . . , m be the class probabilities given x. The expected error at x is then defined by Pm c=1 I[φ(x)6=c]Pc(x), where I[#] = 1 if # is true and 0 otherwise. The empirical error on the training data is given by ǫ = 1 n Xn i=1 I[φ(xi)6=ci] . Since ǫ is equal to its minimum value of zero when all the training data points are correctly classified, we wish to use ǫ as a basis for devising multicategory classification algorithms. 2.1 Multicategory Margins Suppose the classifier is modeled using an m-vector g(x) = (g1(x), . . . , gm(x))T , where the induced classifier is obtained via maximization in a manner akin to discriminant analysis: φ(x) = argmaxj{gj (x)}. For simplicity of analysis, we assume that the maximizing argument of maxj gj (x) is unique. Of course this does not imply that the maximum value is unique; indeed, adding a constant to each component gj (x) does not change the maximizing argument. To remove this redundancy, it is convenient to impose a sum-to-zero constraint. Thus we define G = (g1(x), . . . , gm(x))T Xm j=1 gj (x) = 0 and assume g(x) ∈ G. Zou et al. (2008) referred to the vectors in G as multicategory margin vectors. Since a margin vector g(x) induces a classifier, we explore the minimization of ǫ with respect to (w.r.t.) g(x). However, this minimization problem is intractable because I[φ(x)6=c] is the 0-1 function. Various tractable surrogate loss functions ζ(g(x), c) are thus used to approximate I[φ(x)6=c] . The corresponding population and empirical risk functions are given by R(P, g) = EX Xm c=1 ζ(g(x), c)Pc(x) , Rˆ(g) = 1 n Xn i=1 ζ(g(xi), ci), where EX(·) is the expectation taken w.r.t. the distribution of X. If α is a positive constant that does not depend on (x, c), argming(x)∈G 1 αRˆ(g) is equivalent to argming(x)∈G Rˆ(g). We thus present the following definition. Definition 1 A surrogate loss ζ(g(x), c) is said to be a majorization of I[φ(x)6=c] w.r.t. (x, c) if ζ(g(x), c) ≥ αI[φ(x)6=c] where α is a positive constant that does not depend on (x, c). Given a majorization function ζ(g(x), c), the classi- fier resulting from the minimization of Rˆ(g) w.r.t. the margin vector g(x) is called a margin-based classifier or a margin-based classification method. Therefore, a margin-based classifier corresponds to a so-called majorization-minimization procedure. In the binary classification setting, a wide variety of classifiers can be understood as minimizers of a majorization loss function of the 0-1 loss. If such functions satisfy other technical conditions, the resulting classifiers can be shown to be Bayes consistent (Bartlett et al., 2006). It seems reasonable to pursue a similar development in the case of multicategory classification, and indeed such a proposal has been made by Zou et al. (2008) (see also Tewari and Bartlett (2007); Zhang (2004)). The following definition refines the definition of Zou et al. (2008). (Specifically, we do not require that the function ζ(g(x), c) depends only on gc(x).) Definition 2 A surrogate function ζ(g(x), c) is said to be Fisher consistent w.r.t. a margin vector g(x) = (g1(x), . . . , gm(x))T at x if (i) the following risk minimization problem gˆ(x) = argmin g(x)∈G Xm c=1 ζ(g(x), c)Pc(x) (1) has a unique solution gˆ(x) = (ˆg1(x), . . . , gˆm(x))T ; and (ii) argmax c gˆc(x) = argmax c Pc(x). 2.2 Multicategory Losses Zou et al. (2008) derived multicategory boosting algorithms by using ζ(g(x), c) = exp(−gc(x)). In their
Zhang,Jordan,Li,Yeung discrete boosting algorithms,the margin vector g(x) and Fisher-consistent majorization loss,which bridges is modeled as an m-vector function with one and only hinge-type losses and the negative multinomial log- one positive element.In this case,I)o is equal to likelihood.Thus,it is applicable to the construction (Consequently,exp(-ge(x))is a majorization of multicategory margin-based classifiers. of I()c.Therefore,the discrete AdaBoost algo- rithms of Zou et al.(2008)still approximate the origi- 3.1 Definition nal empirical 0-1 loss function.However,in the general case,exp(-ge(x))is not a majorization of I(x) In order to obtain a majorization function of(x), Thus the multicategory GentleBoost algorithm of Zou we express max{gj(x)+1-Ij=d}as 1f(x)[1+ et al.(2008)is not a margin-based method gj(x)-Iti=d where Friedman et al.(2000)proposed a multicategory Log- 1 j=argmaxi{gi(x)+1-Ig=c} itBoost algorithm by using the negative multinomial (x)=0 otherwise. log-likelihood function,which is given by Motivated by the idea behind deterministic anneal- C(g(x),c)=log>exp(gj(x)-ge(x)) (2) ing (Rose et al.,1990),we relax this hard function 1 (x),retaining only(x)>0 and (x)=1. =log 1+>exp(gj(x)-gc(x)) With respect to a soft Be(x)respecting these con- j≠c straints,we maximize (x)1+g(x)-I=c] under an entropy constraint,namely, at (x,c).Although log[1 exp(-ge(x))]is an up- per bound of log(2)Ige(x)0 as a temperature jorization of I()t if the margin vector g(x)has only one positive element.Unfortunately,when this The maximization of F w.r.t.Be(x)is straightforward, majorization as well as C(g(x),c)are used to derive and it gives rise to the following distribution multicategory discrete boosting algorithms,a closed- form solution no longer exists. (x)= ep[+o,--4] ∑iexp[+-2a7 (5) In the case of the multicategory SVM algorithm, Crammer and Singer(2001)used the surrogate: The corresponding maximum of F is obtained by plug- ging (5)back into (4): H(g(x),c)=max{g(x)+1-6=d}-gc(x).(3) It is easily seen that F*=Tlog∑exp| 1+9(x)-6= [(x)≠d=[日j≠c,9影(x)-9c(x)>0 ≤max{9(x)+1-6=d}-9e(x). Note that for T>0 we have This shows that H(g(x),c)is a majorization of T1og[1+∑ewp1+9x-9.x】 (x)d,but it is Fisher consistent only when 1≠c maP(x)>1/2(Zhang,2004). -T1og∑em(l+96-=国)-.ld T 3 Coherence Functions ≥max{9,(x)+1-16=d}-9e(x) Since hinge-type loss functions are not smooth,exist- ing multicategory SVMs do not directly estimate the ≥1[(x)≠d class probability Pe(x).Moreover,it is rare to de- This thus leads us to the following family of majoriza- vise a boosting algorithm with them.However,lo- tion functions of ( gistic regression extends naturally from binary classi- fication to multicategory classification by simply us- ing the multinomial likelihood in place of the bino- C(g(x).)=Tlog 1+exp()-g(x) T T>0 j≠ mial likelihood.In this section,we present a smooth (6)
Zhang, Jordan, Li, Yeung discrete boosting algorithms, the margin vector g(x) is modeled as an m-vector function with one and only one positive element. In this case, I[gc(x)≤0] is equal to I[φ(x)6=c] . Consequently, exp(−gc(x)) is a majorization of I[φ(x)6=c] . Therefore, the discrete AdaBoost algorithms of Zou et al. (2008) still approximate the original empirical 0-1 loss function. However, in the general case, exp(−gc(x)) is not a majorization of I[φ(x)6=c] . Thus the multicategory GentleBoost algorithm of Zou et al. (2008) is not a margin-based method. Friedman et al. (2000) proposed a multicategory LogitBoost algorithm by using the negative multinomial log-likelihood function, which is given by L(g(x), c) = logXm j=1 exp gj (x) − gc(x) (2) = log h 1 +X j6=c exp(gj (x)−gc(x))i at (x, c). Although log[1 + exp(−gc(x))] is an upper bound of log(2)I[gc(x)≤0], it is not a majorization of I[φ(x)6=c] . However, L(g(x), c) is a majorization of I[φ(x)6=c] because of L(g(x), c) ≥ log(2)I[φ(x)6=c] . Thus, the multicategory LogitBoost algorithm (Friedman et al., 2000) is also a margin-based method. It is worth noting that log[1 + exp(−gc(x))] is the majorization of I[φ(x)6=c] if the margin vector g(x) has only one positive element. Unfortunately, when this majorization as well as L(g(x), c) are used to derive multicategory discrete boosting algorithms, a closedform solution no longer exists. In the case of the multicategory SVM algorithm, Crammer and Singer (2001) used the surrogate: H(g(x), c) = max{gj (x) + 1 − I[j=c]} − gc(x). (3) It is easily seen that I[φ(x)6=c] = I[∃j6=c, gj (x)−gc(x)>0] ≤ max n gj (x) + 1 − I[j=c] o − gc(x). This shows that H(g(x), c) is a majorization of I[φ(x)6=c] , but it is Fisher consistent only when maxl Pl(x) > 1/2 (Zhang, 2004). 3 Coherence Functions Since hinge-type loss functions are not smooth, existing multicategory SVMs do not directly estimate the class probability Pc(x). Moreover, it is rare to devise a boosting algorithm with them. However, logistic regression extends naturally from binary classi- fication to multicategory classification by simply using the multinomial likelihood in place of the binomial likelihood. In this section, we present a smooth and Fisher-consistent majorization loss, which bridges hinge-type losses and the negative multinomial loglikelihood. Thus, it is applicable to the construction of multicategory margin-based classifiers. 3.1 Definition In order to obtain a majorization function of I[φ(x)6=c] , we express max{gj (x) + 1 − I[j=c]} as Pm j=1 β c j (x) 1 + gj (x) − I[j=c] where β c j (x) = 1 j = argmaxl{gl(x)+1−I[l=c]} 0 otherwise. Motivated by the idea behind deterministic annealing (Rose et al., 1990), we relax this hard function β c j (x), retaining only β c j (x) > 0 and Pm j=1 β c j (x) = 1. With respect to a soft β c j (x) respecting these constraints, we maximize Pm j=1 β c j (x) 1 + gj (x) − I[j=c] under an entropy constraint, namely, max {β c j (x)} n F = Xm j=1 β c j (x) 1 + gj (x) − I[j=c] − T Xm j=1 β c j (x) log β c j (x) o , (4) where we refer to T > 0 as a temperature. The maximization of F w.r.t. β c j (x) is straightforward, and it gives rise to the following distribution β c j (x) = exp 1+gj (x)−I[j=c] T P l exp 1+gl(x)−I[l=c] T . (5) The corresponding maximum of F is obtained by plugging (5) back into (4): F ∗ = T logX j exp h1 + gj (x) − I[j=c] T i . Note that for T > 0 we have T log h 1+X j6=c exp 1+gj (x)−gc(x) T i = T logX j exp(1 + gj (x) − I[j=c] T ) − gc(x) ≥ max j n gj (x)+1−I[j=c] o −gc(x) ≥ I[φ(x)6=c] . This thus leads us to the following family of majorization functions of I[φ(x)6=c] : C(g(x), c) = T log h 1+X j6=c exp 1+gj (x)−gc(x) T i , T > 0. (6)
Coherence Functions for Multicategory Margin-based Classification Methods We refer to the functions as coherence functions due to Pi(x)exp(1+gj(x)-9c(x), Moreover,we have the following properties. j≠c Theorem 2 Let H(g(x),c),B(x)and C(g(x),c)be which is just an upper bound of the negative multino- defined by (3),(5)and(6),respectively.Then, mial log-likelihood function C(g(x),c)in (2) S(g(x),c)≤C(g(x),c)-Tlog m≤H(g(x),c): In the binary case,i.e.m =2,we let gi(x)=-g2(x)= f(x)and encode y 1 if c 1 and y=-1 if c=2. where We can thus express the coherence function as cofe》=Tsl+eml- sgx.9=∑1+9x)-9e(x). ,T>0. m (7) j≠c Figure 1 depicts the coherence function(T=1)and Importantly,when treating g(x)fixed and considering other common loss functions for m=2. (x)and C(g(x),c)as functions of T,we have Theorem 3 Under the conditions in Theorem 2,for a fired g(x)we have 0-109s ---Coherence -i-Logt (i)limr_oc(g(x),c)-Tlogm=S(g(x),c)and ......Exponential -Hinge 黑刻=片m=1m (ii)limr_oC(g(x),c)=H(g(x),c)and x)={0 j=argmaxi{gi(x)+1-I=c} otherwise. -1.5 -0.5 0 0.5 1 1.5 y f(x) It is worth noting that Theorem 3-(ii)shows that at T=0,C(g(x),c)reduces to the multicategory hinge Figure 1:A variety of loss functions,which are re- loss H(g(x),c),which is used by Crammer and Singer garded as a function of yf(x).Here T=1 in the co- (2001). herence loss.Logit loss:logexp(yf())];E- As an immediate corollary of Theorems 2 and 3 in the ponential loss:exp(-yf(x));Hinge loss:[1-yf(x)]+ binary case (m =2),we have where[u+=uifu≥0and[叫+=0 otherwise. Corollary 1 Let C(yf(x))be defined by (7).Then 3.2 Properties ()(1-yf(x)+≤C(yf(x)≤Tlog2+[1-yf(x1+: The following theorem shows that the coherence func- (ii)limr-oC(yf(x))=[1-yf(x)]+; tion is Fisher consistent. (iii)(1-yf(x))0 for c =1,...,m. (iv)limr_ooC(yf(x))-Tlog2=(1-yf(x)) Consider the optimization problem ∑T1og1+∑ep+9x-g.1P. Graphs of C(yf(x))with different values of T are argmax shown in Figure 2.We can see that C(yf(x))with T= g(x)Eg ≠c 0.01 is almost the same as the hinge loss [1-yf(x)]+. for a fired T>0 and let g(x)=((x),....m(x))T Wang et al.(2005)derived an annealed discriminant be its solution.Then g(x)is unique.Moreover,if analysis algorithm in which the loss function is
Coherence Functions for Multicategory Margin-based Classification Methods We refer to the functions as coherence functions due to their statistical mechanical properties similar to those of deterministic annealing (Rose et al., 1990). Note that the coherence function is also a majorization of the multicategory hinge loss H(g(x), c) in (3). When T = 1, we have C(g(x), c) = log h 1+X j6=c exp 1+gj (x)−gc(x) i , which is just an upper bound of the negative multinomial log-likelihood function L(g(x), c) in (2). In the binary case, i.e. m = 2, we let g1(x) = −g2(x) = 1 2 f(x) and encode y = 1 if c = 1 and y = −1 if c = 2. We can thus express the coherence function as C(yf(x)) = T log h 1+ exp 1−yf(x) T i , T > 0. (7) Figure 1 depicts the coherence function (T = 1) and other common loss functions for m = 2. −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 0 1 2 3 4 5 6 7 8 y f(x) Loss 0−1 loss Coherence Logit Exponential Hinge Figure 1: A variety of loss functions, which are regarded as a function of yf(x). Here T = 1 in the coherence loss. Logit loss: 1 log 2 log[1+ exp(−yf(x))]; Exponential loss: exp(−yf(x)); Hinge loss: [1 − yf(x)]+ where [u]+ = u if u ≥ 0 and [u]+ = 0 otherwise. 3.2 Properties The following theorem shows that the coherence function is Fisher consistent. Theorem 1 Assume Pc(x) > 0 for c = 1, . . . , m. Consider the optimization problem argmax g(x)∈G Xm c=1 T log h 1+X j6=c exp 1+gj (x)−gc(x) T i Pc(x) for a fixed T > 0 and let gˆ(x) = (ˆg1(x), . . . , gˆm(x))T be its solution. Then gˆ(x) is unique. Moreover, if Pi(x) < Pj (x), we have gˆi(x) < gˆj (x). Furthermore, after having obtained gˆ(x), Pc(x) is given by Pc(x) = Pm l=1 exp 1+ˆgl(x)+ˆgc(x)−I[l=c] T Pm j=1 Pm l=1 exp 1+ˆgl(x)+ˆgj (x)−I[l=j] T . (8) Moreover, we have the following properties. Theorem 2 Let H(g(x), c), β c j (x) and C(g(x), c) be defined by (3), (5) and (6), respectively. Then, S(g(x), c) ≤ C(g(x), c) − T log m ≤ H(g(x), c), where S(g(x), c) = 1 m X j6=c 1 + gj (x) − gc(x) . Importantly, when treating g(x) fixed and considering β c j (x) and C(g(x), c) as functions of T, we have Theorem 3 Under the conditions in Theorem 2, for a fixed g(x) we have (i) limT→∞ C(g(x), c) − T log m = S(g(x), c) and lim T→∞ β c j (x) = 1 m for j = 1, . . . , m (ii) limT→0 C(g(x), c) = H(g(x), c) and lim T→0 β c j (x) = 1 j = argmaxl{gl(x)+1−I[l=c]} 0 otherwise. It is worth noting that Theorem 3-(ii) shows that at T = 0, C(g(x), c) reduces to the multicategory hinge loss H(g(x), c), which is used by Crammer and Singer (2001). As an immediate corollary of Theorems 2 and 3 in the binary case (m = 2), we have Corollary 1 Let C(yf(x)) be defined by (7). Then (i) (1−yf(x))+ ≤ C(yf(x)) ≤ T log 2+[1−yf(x)]+; (ii) limT→0 C(yf(x)) = [1 − yf(x)]+; (iii) 1 2 (1−yf(x)) ≤ C(yf(x)) − T log 2; (iv) limT→∞ C(yf(x)) − T log 2 = 1 2 (1−yf(x)). Graphs of C(yf(x)) with different values of T are shown in Figure 2. We can see that C(yf(x)) with T = 0.01 is almost the same as the hinge loss [1 − yf(x)]+. Wang et al. (2005) derived an annealed discriminant analysis algorithm in which the loss function is
Zhang,Jordan,Li,Yeung and Singer, 1999),multicategory LogitBoost S (MulLogitBoost)(Friedman et al.,2000) and multicategory GentleBoost (GentleBoost.E)(Zou hinge et al.,2008),on six publicly available datasets =+=…T=0.1 (Vowel,Waveform,Image Segmentation,Optdigits, T=0.01 Pendigits and Satimage)from the UCI Machine Learning Repository.Following the settings in Fried- man et al.(2000);Zou et al.(2008),we use predefined training samples and test samples for these six datasets.Summary information for the datasets is given in Table 1. Based on the experimental strategy in Zou et al. 032 -1.5 -1 -05 0.5 1.5 (2008),eight-node regression trees are used as weak yf(x) learners for all the boosting algorithms with the ex- ception of AdaBoost.MH,which is based on eight-node Figure 2:Coherence functions with T =1,T =0.1 classification trees.In the experiments,we observe andT=0.01. that the performance of all the methods becomes sta- ble after about 50 boosting steps.Hence,the num- ber of boosting steps for all the methods is set to 100 (H=100)in all the experiments.The test er- A(g(x,d=T1og1+∑ep95x)-9.(☒ ror rates (in %)of all the boosting algorithms are T T>0. shown in Table 2,from which we can see that all j≠c the boosting methods achieve much better results than Thus,the negative multinomial log-likelihood function CART,and our method slightly outperforms the other C(g(x),c)and the conventional logistic regression are boosting algorithms.Among all the datasets tested, respectively the special cases ofA(g(x),c)and the an- Vowel and Waveform are the most difficult for clas- nealed discriminant analysis algorithm with T=1 sification.The notably better performance of our (also refer to Zhang and Oles (2001)for the binary method for these two datasets reveals its promising case).However,since properties.Figure 3 depicts the test error curves of MulLogitBoost,GentleBoost.E and GentleBoost.C imA(g(x),c)=max(gj(x)-ge(x)), on these two datasets. Theorem 3 shows that as T-0,C(g(x),c)approaches it is no longer guaranteed that A(g(x),c)is always a max{g(x)+1-Ij=d)-ge(x).This encourages us majorization of I(x)for any T>0. to try to decrease T gradually over the boosting steps. However,when T gets very small,it can lead to numer- 4 The GentleBoost Algorithm ical problems and often makes the algorithm unstable. The experiments show that when T takes a value in In this section we apply the coherence function to 0.1,2],our algorithm is always able to obtain promis- the development of multicategory margin-based boost- ing performance.Here our reported results are based ing algorithms.Like the negative multinomial log- on the setting of T=1.Recall that C(g(x),c)is the likelihood function,when the coherence function is special case of A(g(x),c)with T=1,so the com- used to devise multicategory discrete boosting algo- parison of GentleBoost.C with MulLogitBoost is fair rithms,a closed-form solution no longer exists.We based on T=1. instead use the coherence function to devise a genuine As we established in Section 2.2.GentleBoost.E does multicategory margin-based boosting algorithm.With not implement a margin-based decision because the a derivation similar to that in Friedman et al.(2000); loss function used in this algorithm is not a majoriza- Zou et al.(2008).our GentleBoost algorithm is shown tion of the 0-1 loss.Our experiments show that in Algorithm 1. MulLogitBoost and GentleBoost.C are competitive, and outperform GentleBoost.E. 5 Experimental Evaluation 6 Conclusion We compare our algorithm (called GentleBoost.C) with some representative multicategory boost- In this paper,we have proposed a novel majorization ing algorithms,including AdaBoost.MH (Schapire function and a multicategory boosting algorithm based
Zhang, Jordan, Li, Yeung −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 yf(x) Coherence Function hinge T=1 T=0.1 T=0.01 Figure 2: Coherence functions with T = 1, T = 0.1 and T = 0.01. A(g(x), c) = T log h 1+X j6=c exp gj (x)−gc(x) T i , T > 0. Thus, the negative multinomial log-likelihood function L(g(x), c) and the conventional logistic regression are respectively the special cases of A(g(x), c) and the annealed discriminant analysis algorithm with T = 1 (also refer to Zhang and Oles (2001) for the binary case). However, since lim T −→0 A(g(x), c) = max j (gj (x) − gc(x)), it is no longer guaranteed that A(g(x), c) is always a majorization of I[φ(x)6=c] for any T > 0. 4 The GentleBoost Algorithm In this section we apply the coherence function to the development of multicategory margin-based boosting algorithms. Like the negative multinomial loglikelihood function, when the coherence function is used to devise multicategory discrete boosting algorithms, a closed-form solution no longer exists. We instead use the coherence function to devise a genuine multicategory margin-based boosting algorithm. With a derivation similar to that in Friedman et al. (2000); Zou et al. (2008), our GentleBoost algorithm is shown in Algorithm 1. 5 Experimental Evaluation We compare our algorithm (called GentleBoost.C) with some representative multicategory boosting algorithms, including AdaBoost.MH (Schapire and Singer, 1999), multicategory LogitBoost (MulLogitBoost) (Friedman et al., 2000) and multicategory GentleBoost (GentleBoost.E) (Zou et al., 2008), on six publicly available datasets (Vowel, Waveform, Image Segmentation, Optdigits, Pendigits and Satimage) from the UCI Machine Learning Repository. Following the settings in Friedman et al. (2000); Zou et al. (2008), we use predefined training samples and test samples for these six datasets. Summary information for the datasets is given in Table 1. Based on the experimental strategy in Zou et al. (2008), eight-node regression trees are used as weak learners for all the boosting algorithms with the exception of AdaBoost.MH, which is based on eight-node classification trees. In the experiments, we observe that the performance of all the methods becomes stable after about 50 boosting steps. Hence, the number of boosting steps for all the methods is set to 100 (H = 100) in all the experiments. The test error rates (in %) of all the boosting algorithms are shown in Table 2, from which we can see that all the boosting methods achieve much better results than CART, and our method slightly outperforms the other boosting algorithms. Among all the datasets tested, Vowel and Waveform are the most difficult for classification. The notably better performance of our method for these two datasets reveals its promising properties. Figure 3 depicts the test error curves of MulLogitBoost, GentleBoost.E and GentleBoost.C on these two datasets. Theorem 3 shows that as T → 0, C(g(x), c) approaches max{gj (x) + 1 − I[j=c]} − gc(x). This encourages us to try to decrease T gradually over the boosting steps. However, when T gets very small, it can lead to numerical problems and often makes the algorithm unstable. The experiments show that when T takes a value in [0.1, 2], our algorithm is always able to obtain promising performance. Here our reported results are based on the setting of T = 1. Recall that L(g(x), c) is the special case of A(g(x), c) with T = 1, so the comparison of GentleBoost.C with MulLogitBoost is fair based on T = 1. As we established in Section 2.2, GentleBoost.E does not implement a margin-based decision because the loss function used in this algorithm is not a majorization of the 0-1 loss. Our experiments show that MulLogitBoost and GentleBoost.C are competitive, and outperform GentleBoost.E. 6 Conclusion In this paper, we have proposed a novel majorization function and a multicategory boosting algorithm based
Coherence Functions for Multicategory Margin-based Classification Methods Algorithm 1 GentleBoost.C({(xi,ci)}21C RPx{1,...,m),T,H) 1:Start with uniform weights wij =1/n for i=1,...,n and j=1,...,m,and Bj(x)=1/m and gi(x)=0 for j=1,,m. 2:Repeat for h=1 to H: (a)Repeat for j=1,...,m: (i)Compute working responses and weights in the jth class, Itj=e-Bj(xi) 列=,x1-月,(x 0=6(x)1-月(x) (ii)Fit the regression function(x)by a weighted least-squares fit of the working response to with weights wij on the training data. (i曲Set9(6x)←95(x)+g(x. (b)Set5(x)-=[g5(x)-是∑21g(x)]for j=1,,m. (c)Compute Bi(xi)for j=1,...,m as exp(1+9)-9ex2) 1+∑e,exp(+97-9e@ ifj≠c 3(x)= 1+∑e4exp(+9@Pe@) if j=ci. 3:Output o(x)=argmaxi gi(x). Table 1:Summary of benchmark datasets. Dataset#Train Test Features #Classes Vowel 528 462 10 11 Waveform 300 4700 21 3 Segmentation 210 2100 19 Optdigits 3823 1797 64 9 Pendigits 7494 3498 16 10 Satimage 4435 2000 36 6 Table 2:Test error rates of our method and related methods(in %)The best result for each dataset is shown in bold Dataset CART AdaBoost.MH MulLogitBoost GentleBoost.E GentleBoost.C Vowel 54.10 50.87 49.13 50.43 47.62 Waveform 31.60 18.22 17.23 17.62 16.53 Segmentation 9.80 5.29 4.10 4.52 4.05 Optdigits 16.60 5.18 3.28 5.12 3.17 Pendigits 8.32 5.86 3.12 3.95 3.14 Satimage 14.80 10.00 9.25 12.00 8.75 on this function.The majorization function is Fisher Owing to the relationship of our majorization function consistent,differential and convex.Thus,it is appro- with the hinge loss and the negative multinomial log- priate for the design of margin-based boosting algo- likelihood function,it is also natural to use the coher- rithms.While our main focus has been theoretical, ence function to devise a multicategory margin-based we have also shown experimentally that our boosting classifier as an alternative to existing multicategory algorithm is effective,although it will be necessary to SVMs and multinomial logistic regression models. investigate its empirical performance more extensively
Coherence Functions for Multicategory Margin-based Classification Methods Algorithm 1 GentleBoost.C({(xi , ci)} n i=1 ⊂ R p×{1, . . . , m}, T, H) 1: Start with uniform weights wij = 1/n for i = 1, . . . , n and j = 1, . . . , m, and βj (x) = 1/m and gj (x) = 0 for j = 1, . . . , m. 2: Repeat for h = 1 to H: (a) Repeat for j = 1, . . . , m: (i) Compute working responses and weights in the jth class, zij = I[j=ci] − βj (xi) βj (xi)(1 − βj (xi)), wij = βj (xi)(1 − βj (xi)). (ii) Fit the regression function g (h) j (x) by a weighted least-squares fit of the working response zij to xi with weights wij on the training data. (iii) Set gj (x) ← gj (x) + g (h) j (x). (b) Set gj (x) ← m−1 m gj (x) − 1 m Pm l=1 gl(x) for j = 1, . . . , m. (c) Compute βj (xi) for j = 1, . . . , m as βj (xi) = exp 1+gj (xi )−gci (xi ) T 1+P j6=ci exp 1+gj (xi )−gci (xi ) T if j 6= ci , 1 1+P j6=ci exp 1+gj (xi )−gci (xi ) T if j = ci . 3: Output φ(x) = argmaxj gj (x). Table 1: Summary of benchmark datasets. Dataset # Train # Test # Features # Classes Vowel 528 462 10 11 Waveform 300 4700 21 3 Segmentation 210 2100 19 7 Optdigits 3823 1797 64 10 Pendigits 7494 3498 16 10 Satimage 4435 2000 36 6 Table 2: Test error rates of our method and related methods (in %). The best result for each dataset is shown in bold. Dataset CART AdaBoost.MH MulLogitBoost GentleBoost.E GentleBoost.C Vowel 54.10 50.87 49.13 50.43 47.62 Waveform 31.60 18.22 17.23 17.62 16.53 Segmentation 9.80 5.29 4.10 4.52 4.05 Optdigits 16.60 5.18 3.28 5.12 3.17 Pendigits 8.32 5.86 3.12 3.95 3.14 Satimage 14.80 10.00 9.25 12.00 8.75 on this function. The majorization function is Fisher consistent, differential and convex. Thus, it is appropriate for the design of margin-based boosting algorithms. While our main focus has been theoretical, we have also shown experimentally that our boosting algorithm is effective, although it will be necessary to investigate its empirical performance more extensively. Owing to the relationship of our majorization function with the hinge loss and the negative multinomial loglikelihood function, it is also natural to use the coherence function to devise a multicategory margin-based classifier as an alternative to existing multicategory SVMs and multinomial logistic regression models.
Zhang,Jordan,Li,Yeung 0.3 0.6 -GeotleBoostE 028 MulLogitBoost 一GentleBoost..E MulLogitBoost GentleBoost.C 0.26 0.2 02 30 405060708090 0 90 5060 70 90 100 Boosting Steps Boosting Steps (a)Vowel (b)Waveform Figure 3:Test error rates versus boosting steps. A Proof of Theorem 1 02L A map L:-R,where is a normal function space 0ge0gk =-ikP.+∑a8kP。-cPs defined over Rd,is said to be Gateaur differentiable at l≠c g(x)∈2,if for every fixed h∈2 there exists +e∑3uP-∑3c3kB I≠k j≠c,k L'(g(x))=lim L(g(x)+th)-L(g(x)) t+0 j=1 In our derivation,for notational simplicity,we omit x fork≠c,and in the functions and denote L'(g())by 8 02L m Without loss of generality,we let T=1 in the following Be(1-月c) derivation.Consider the following Lagrangian OgeOge j=1 where -2sf++m-B+ m 1 j=1 1+∑4eexp(1+9-9e】 Bej exp(1+95-9c) and calculate the first and second derivatives of L w.r.t.the ge as 1+∑i4eexp(1+91-9c) 0L_∑i4eexp(1+9-9e) We denote△,=diag(3i1,fi2,.,月m)and B;= 两.i+4ep(1+9u-gB (2....Bm)T.The Hessian matrix is m exp(1+9c-95) 82L +工1+∑p1+m=B+入 H -∑P(A-3,) ∂g0g j=1 ∑"exp(1+-9P =-1+4.exp(1+91-9e】 For any nonzero u∈msubject to∑l4=0,itis easily seen that exp(1+gc-9i】 ++9-B+小 uHm-∑p[呢-(月≥0 j=1 一%R+三.+ m m Here we use the fact that u2 is convex.Moreover,the =1 above inequality is strictly satisfied for any nonzero
Zhang, Jordan, Li, Yeung 0 10 20 30 40 50 60 70 80 90 100 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 Boosting Steps Test Error Rate GentleBoost.E MulLogitBoost GentleBoost.C 0 10 20 30 40 50 60 70 80 90 100 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 Boosting Steps Test Error Rate GentleBoost.E MulLogitBoost GentleBoost.C (a) Vowel (b) Waveform Figure 3: Test error rates versus boosting steps. A Proof of Theorem 1 A map L : Ω → R, where Ω is a normal function space defined over R d , is said to be Gateaux differentiable at g(x) ∈ Ω, if for every fixed h ∈ Ω there exists L ′ (g(x)) = limt→0 L(g(x) + th) − L(g(x)) t . In our derivation, for notational simplicity, we omit x in the functions and denote L ′ (gj (x)) by ∂L ∂gj . Without loss of generality, we let T = 1 in the following derivation. Consider the following Lagrangian L = Xm j=1 log h 1 +X l6=j exp(1 + gl − gj ) i Pj + λ Xm j=1 gj and calculate the first and second derivatives of L w.r.t. the gc as ∂L ∂gc = − P l6=c exp(1+gl−gc) 1+P l6=c exp(1+gl−gc) Pc + X j6=c exp(1 + gc − gj ) 1 + P l6=j exp(1+gl−gj ) Pj + λ = − Pm l=1 exp(1+gl−gc) 1+P l6=c exp(1+gl−gc) Pc + Xm j=1 exp(1 + gc − gj ) 1 + P l6=j exp(1+gl−gj ) Pj + λ = − Xm l=1 βclPc + Xm j=1 βjcPj + λ, ∂ 2L ∂gc∂gk = −βckPc + X l6=c βclβckPc − βkcPk + βkcX l6=k βklPk − X j6=c,k βjcβjkPj = − Xm j=1 βjcβjkPj for k 6= c, and ∂ 2L ∂gc∂gc = Xm j=1 βjc(1 − βjc)Pj where βcc = 1 1 + P l6=c exp(1 + gl − gc) βcj = exp(1 + gj − gc) 1 + P l6=c exp(1 + gl − gc) . We denote ∆j = diag(βj1, βj2, . . . , βjm) and βj = (βj1, βj2, . . . , βjm) T . The Hessian matrix is H , ∂ 2L ∂g T ∂g = Xm j=1 Pj (∆j − βjβ T j ). For any nonzero u ∈ R m subject to Pm j=1 uj = 0, it is easily seen that u T Hu = Xm j=1 Pj hXm c=1 βjcu 2 c − ( Xm c=1 βjcuc) 2 i ≥ 0. Here we use the fact that u 2 is convex. Moreover, the above inequality is strictly satisfied for any nonzero
Coherence Functions for Multicategory Margin-based Classification Methods u with∑g1凸=0.This shows that the optimiza- Acknowledgements tion problem has a strictly local minimum point g. Again,we note that the Hessian matrix is positive Zhihua Zhang is supported in part by program for semi-definite,.so∑m1log[1+∑i4与exp(1+9-g)l乃 Changjiang Scholars and Innovative Research Team in University (IRT0652,PCSIRT),China.Zhang, is convex.Thus,g is also the global minimum point. Li and Yeung are supported by General Research Now we prove that if Pe>Pk,then de >gk.Since Fund 621407 from the Research Grants Council isthe solution of equations0,it immediately of the Hong Kong Special Administrative Region, follows that入=0 by using∑-1张=0.Hence,. China.Michael Jordan acknowledges support from NSF Grant 0509559 and grants from Google and Mi- crosoft Research. P.∑1exp(1+9-9e= 1+∑1≠eexp(1+9-9e) References from which we get Allwein,E.L.,R.E.Schapire,and Y.Singer (2000).Re- ducing multiclass to binary:A unifying approach for margin classifiers.Journal of Machine Learning Re- Pc= exp(gc)exp(ge)+>ic exp(1+) Pexp(gk)exp(gk)+∑≠kexp(1+9) (9) search1,113-141 Bartlett,P.L.,M.I.Jordan,and J.D.McAuliffe (2006). exp(2gc)-exp(1+2ge)+exp(c)∑1exp(1+】 Convexity,classification,and risk bounds.Journal of the American Statistical Association 101(473),138-156 exp(2gk)-exp(1+2gk)+exp(gk)1 exp(1+g) Cortes,C.and V.Vapnik (1995).Support-vector networks. >1. Machine Learning 20,273-297. Crammer.K.and Y.Singer (2001).On the algorithmic im- Consequently, plementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2,265-292. 0>[exp(2gc)-exp(2gx)][1-exp(1)] Freund,Y.(1995).Boosting a weak learning algorithm by majority.Information and Computation 21,256-285. +[ep0e)-exp(】∑ep(1+n) Freund,Y.and R.E.Schapire(1997).A decision-theoretic = generalization of on-line learning and an application to boosting.Journal of Computer and System Sci- =(exp(jc)-exp(gx))exp(gc)+exp(g) eces55(1),119-139. +∑exp(1+) Friedman,J.H.,T.Hastie,and R.Tibshirani (2000).Ad- ditive logistic regression:A statistical view of boosting. l≠c,k The Annals of Statistics 28(2),337-374. Rose,K.,E.Gurewitz,and G.C.Fox (1990).Statistical Thus we obtain ge gk.From (9),we get (8). mechanics and phase transitions in clustering.Physics Review Letters 65,945-948. B Proof of Theorem 2 Schapire,R.E.and Y.Singer (1999).Improved boosting algorithms using confidence-rated predictions.Machine Learning 37,297-336. First,consider that Tewari.A.and P.L.Bartlett (2007).On the consistency of multiclass classification methods.Journal of Machine Tlogm+S(g(x),c)-C(g(x),c) Learning Research 8,1007-1025. exp品∑t:+9g9e☒ Wang,G.,Z.Zhang,and F.H.Lochovsky (2005).An- =Tlog nealed discriminant analysis.In ECML. 1+∑jcX+9-☒ Zhang,T.(2004).Statistical analysis of some multi- ≤Tlog 1+∑je exp+9x9☒ category large margin classification methods.Journal ”1+∑*p+y☒=0. of Machine Learning Research 5,1225-1251. Zhang,T.and F.Oles (2001).Text categorization based on regularized linear classification methods.Information Here we use the fact that exp()is convex.Second, Retrieval 4,5-31. assume that l=argmaxj{gi(x)+1-Ij=c}.Then Zou,H.,J.Zhu,and T.Hastie (2008).New multicate- gory boosting algorithms based on multicategory Fisher- consistent losses.Annals of Applied Statistics 2(4), Tlogm+H(g(x),c)-C(g(x),c) 1290-1306. Tlog mexp1+9x)-g=x)-l 1+∑j≠cexp+92s=9s区≥0
Coherence Functions for Multicategory Margin-based Classification Methods u with Pm j=1 uj = 0. This shows that the optimization problem has a strictly local minimum point gˆ. Again, we note that the Hessian matrix is positive semi-definite, so Pm j=1 log h 1+P l6=j exp(1+gl−gj ) i Pj is convex. Thus, gˆ is also the global minimum point. Now we prove that if Pc > Pk, then ˆgc > gˆk. Since gˆ is the solution of equations ∂L ∂gc = 0, it immediately follows that λ = 0 by using Pm c=1 ∂L ∂gc = 0. Hence, Pc Pm l=1 exp(1+ˆgl−gˆc) 1+P l6=c exp(1+ˆgl−gˆc) = Xm j=1 Pj × exp(1+ˆgc−gˆj ) 1+P l6=j exp(1+ˆgl−gˆj ) , from which we get Pc Pk = exp(ˆgc) exp(ˆgk) exp(ˆgc)+P l6=c exp(1+ˆgl) exp(ˆgk) + P l6=k exp(1+ˆgl) (9) = exp(2ˆgc)− exp(1+2ˆgc)+ exp(ˆgc) Pm l=1 exp(1+ˆgl) exp(2ˆgk)− exp(1+2ˆgk)+ exp(ˆgk) Pm l=1 exp(1+ˆgl) > 1. Consequently, 0 > exp(2ˆgc) − exp(2ˆgk) 1 − exp(1) + exp(ˆgc) − exp(ˆgk) Xm l=1 exp(1+ˆgl) = (exp(ˆgc) − exp(ˆgk))h exp(ˆgc) + exp(ˆgk) + X l6=c,k exp(1 + ˆgl) i . Thus we obtain ˆgc > gˆk. From (9), we get (8). B Proof of Theorem 2 First, consider that T log m + S(g(x), c) − C(g(x), c) = T log m exp 1 m P j6=c 1+gj (x)−gc(x) T 1+P j6=c exp 1+gj (x)−gc(x) T ≤ T log 1+P j6=c exp 1+gj (x)−gc(x) T 1+P j6=c exp 1+gj (x)−gc(x) T = 0. Here we use the fact that exp(·) is convex. Second, assume that l = argmaxj{gj (x) + 1 − I[j=c]}. Then T log m + H(g(x), c) − C(g(x), c) = T log m exp 1+gl(x)−gc(x)−I[l=c] T 1+P j6=c exp 1+gj (x)−gc(x) T ≥ 0. Acknowledgements Zhihua Zhang is supported in part by program for Changjiang Scholars and Innovative Research Team in University (IRT0652, PCSIRT), China. Zhang, Li and Yeung are supported by General Research Fund 621407 from the Research Grants Council of the Hong Kong Special Administrative Region, China. Michael Jordan acknowledges support from NSF Grant 0509559 and grants from Google and Microsoft Research. References Allwein, E. L., R. E. Schapire, and Y. Singer (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research 1, 113–141. Bartlett, P. L., M. I. Jordan, and J. D. McAuliffe (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association 101 (473), 138–156. Cortes, C. and V. Vapnik (1995). Support-vector networks. Machine Learning 20, 273–297. Crammer, K. and Y. Singer (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, 265–292. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation 21, 256–285. Freund, Y. and R. E. Schapire (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55 (1), 119–139. Friedman, J. H., T. Hastie, and R. Tibshirani (2000). Additive logistic regression: A statistical view of boosting. The Annals of Statistics 28 (2), 337–374. Rose, K., E. Gurewitz, and G. C. Fox (1990). Statistical mechanics and phase transitions in clustering. Physics Review Letters 65, 945–948. Schapire, R. E. and Y. Singer (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning 37, 297–336. Tewari, A. and P. L. Bartlett (2007). On the consistency of multiclass classification methods. Journal of Machine Learning Research 8, 1007–1025. Wang, G., Z. Zhang, and F. H. Lochovsky (2005). Annealed discriminant analysis. In ECML. Zhang, T. (2004). Statistical analysis of some multicategory large margin classification methods. Journal of Machine Learning Research 5, 1225–1251. Zhang, T. and F. Oles (2001). Text categorization based on regularized linear classification methods. Information Retrieval 4, 5–31. Zou, H., J. Zhu, and T. Hastie (2008). New multicategory boosting algorithms based on multicategory Fisherconsistent losses. Annals of Applied Statistics 2 (4), 1290–1306.