公 Z.Zhang et aL Artificial Intelligence 215(2014)55-78 for a fixed T>0 and letg(x)=(g(x).....gm(x))T be its solution.Theng(x)is unique.Moreover,if Pi(x)<P j(x),we havegi(x)< gj(x).Additionally,if Pi(x)>1/2 or Pk(x)<1/m,then m8w=1+m8(x≥1+)m8新)orj≠l,k whenever the limits exist. The proof of Theorem 10 is given in Appendix B.4.We see that the limit of g(x)at T=0 agrees with that shown in The- orem 7.Unfortunately,based on Gr(g(x),c),it is hard to obtain an explicit expression of the class conditional probabilities Pe(x)via the ge(x). 4.Multiclass C-losses In this section,we present a smooth and Fisher-consistent majorization of the multiclass hinge loss (g(x))in(4)using the idea behind the coherence function.We call this new majorization multiclass C-loss.We will see that this multiclass C-loss bridges the multiclass hinge loss c(g(x))and the negative multinomial log-likelihood (logit)of the form g)=log∑exp(zy(x)-&x)=log1+∑exp(gx)-g(x) (10) j=1 In the 0-1 loss the misclassification costs are specified as 1.It is natural to set the misclassification costs as a positive constant u>0.This setting will reveal an important connection between the hinge loss and the logit loss.The empirical error on the training data is then E= n In this setting.we can extend the multiclass hinge loss e(g(x))as Hu(g(x).c)max gj(x)+u-ultj=c)-gc(x). (11) It is clear that Hu(g(x).c)>u)To establish the connection among the multiclass C-loss,the multiclass hinge loss and the logit loss,we employ this setting to present the definition of the multiclass C-loss. We now express max(gj(x)+u-ultj=c))as f(x)[gj(x)+u-ulltj=c)]where ω(X)= 1 j=argmax (gi(x)+u-ull=c)} 】0 otherwise. Motivated by the idea behind deterministic annealing [20].we relax this hard function (x).retaining only (x)>0 and(x)=1.With such soft f().we maximize()gj+u-uj-g(under entropy penalization: namely, max Fe∑oj(x[g0x+u-uu=]-8-w-T∑w1ogo5w (12) j=1 j=1 where T>0 is also referred to as the temperature.The maximization of F w.r.t.(x)is straightforward,and it gives rise to the following distribution exp ω(X)= ∑expr+W-u4] (13) based on the Karush-Kuhn-Tucker condition.The corresponding maximum of F is obtained by plugging(13)back into(12): CT.(g(X).c)Tog1+exp-gc(x) T>0,u>0. (14) T j≠c62 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 for a fixed T > 0 and let gˆ(x) = (gˆ 1(x),..., gˆm(x))T be its solution. Then gˆ(x) is unique. Moreover, if Pi(x) < P j(x), we have gˆi(x) < gˆ j(x). Additionally, if Pl(x) > 1/2 or Pk(x) < 1/m, then lim T→0 gˆl(x) = 1 + lim T→0 gˆk(x) ≥ 1 + lim T→0 gˆ j(x) for j = l,k, whenever the limits exist. The proof of Theorem 10 is given in Appendix B.4. We see that the limit of gˆl(x) at T = 0 agrees with that shown in Theorem 7. Unfortunately, based on GT (g(x), c), it is hard to obtain an explicit expression of the class conditional probabilities Pc (x) via the gˆ c (x). 4. Multiclass C-losses In this section, we present a smooth and Fisher-consistent majorization of the multiclass hinge loss ξc (g(x)) in (4) using the idea behind the coherence function. We call this new majorization multiclass C-loss. We will see that this multiclass C-loss bridges the multiclass hinge loss ξc (g(x)) and the negative multinomial log-likelihood (logit) of the form γc g(x) = logm j=1 exp g j(x) − gc (x) = log 1 + j=c exp g j(x) − gc (x) . (10) In the 0–1 loss the misclassification costs are specified as 1. It is natural to set the misclassification costs as a positive constant u > 0. This setting will reveal an important connection between the hinge loss and the logit loss. The empirical error on the training data is then = u n n i=1 I{φ(xi)=ci}. In this setting, we can extend the multiclass hinge loss ξc (g(x)) as Hu g(x), c = max j g j(x) + u − uI{j=c} − gc (x). (11) It is clear that Hu(g(x), c) ≥ uI{φ(x)=c}. To establish the connection among the multiclass C-loss, the multiclass hinge loss and the logit loss, we employ this setting to present the definition of the multiclass C-loss. We now express max{g j(x) + u − uI{j=c}} as m j=1 ωc j (x)[g j(x) + u − uI{j=c}] where ωc j (x) = 1 j = argmaxl{gl(x) + u − uI{l=c}} 0 otherwise. Motivated by the idea behind deterministic annealing [20], we relax this hard function ωc j (x), retaining only ωc j (x) ≥ 0 and m j=1 ωc j (x) = 1. With such soft ωc j (x), we maximize m j=1 ωc j (x)[g j(x)+u−uI{j=c}]− gc (x) under entropy penalization; namely, max {ωc j (x)} F m j=1 ωc j (x) g j(x) + u − uI{j=c} − gc (x) − T m j=1 ωc j (x)logωc j (x) , (12) where T > 0 is also referred to as the temperature. The maximization of F w.r.t. ωc j (x) is straightforward, and it gives rise to the following distribution ωc j (x) = exp[ g j(x)+u−uI{j=c} T ] l exp[ gl(x)+u−uI{l=c} T ] (13) based on the Karush–Kuhn–Tucker condition. The corresponding maximum of F is obtained by plugging (13) back into (12): CT ,u g(x), c T log 1 + j=c exp u + g j(x) − gc (x) T , T > 0, u > 0. (14)