62 Z. Zhang et al. / Artificial Intel_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Multicategory large margin classification methods - hinge losses vs. coherence functions.

正在加载图片...

公 Z.Zhang et aL Artificial Intelligence 215(2014)55-78 for a fixed T>0 and letg(x)=(g(x).....gm(x))T be its solution.Theng(x)is unique.Moreover,if Pi(x)<P j(x),we havegi(x)< gj(x).Additionally,if Pi(x)>1/2 or Pk(x)<1/m,then m8w=1+m8(x≥1+)m8新）orj≠l,k whenever the limits exist. The proof of Theorem 10 is given in Appendix B.4.We see that the limit of g(x)at T=0 agrees with that shown in The- orem 7.Unfortunately,based on Gr(g(x),c),it is hard to obtain an explicit expression of the class conditional probabilities Pe(x)via the ge(x). 4.Multiclass C-losses In this section,we present a smooth and Fisher-consistent majorization of the multiclass hinge loss (g(x))in(4)using the idea behind the coherence function.We call this new majorization multiclass C-loss.We will see that this multiclass C-loss bridges the multiclass hinge loss c(g(x))and the negative multinomial log-likelihood (logit)of the form g)=log∑exp(zy(x)-&x)=log1+∑exp(gx)-g(x) (10) j=1 In the 0-1 loss the misclassification costs are specified as 1.It is natural to set the misclassification costs as a positive constant u>0.This setting will reveal an important connection between the hinge loss and the logit loss.The empirical error on the training data is then E= n In this setting.we can extend the multiclass hinge loss e(g(x))as Hu(g(x).c)max gj(x)+u-ultj=c)-gc(x). (11) It is clear that Hu(g(x).c)>u)To establish the connection among the multiclass C-loss,the multiclass hinge loss and the logit loss,we employ this setting to present the definition of the multiclass C-loss. We now express max(gj(x)+u-ultj=c))as f(x)[gj(x)+u-ulltj=c)]where ω(X)= 1 j=argmax (gi(x)+u-ull=c)} 】0 otherwise. Motivated by the idea behind deterministic annealing [20].we relax this hard function (x).retaining only (x)>0 and(x)=1.With such soft f().we maximize()gj+u-uj-g(under entropy penalization: namely, max Fe∑oj(x[g0x+u-uu=]-8-w-T∑w1ogo5w (12) j=1 j=1 where T>0 is also referred to as the temperature.The maximization of F w.r.t.(x)is straightforward,and it gives rise to the following distribution exp ω(X)= ∑expr+W-u4] (13) based on the Karush-Kuhn-Tucker condition.The corresponding maximum of F is obtained by plugging(13)back into(12): CT.(g(X).c)Tog1+exp-gc(x) T>0,u>0. (14) T j≠c62 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 for a fixed T > 0 and let gˆ(x) = (gˆ 1(x),..., gˆm(x))T be its solution. Then gˆ(x) is unique. Moreover, if Pi(x) < P j(x), we have gˆi(x) < gˆ j(x). Additionally, if Pl(x) > 1/2 or Pk(x) < 1/m, then lim T→0 gˆl(x) = 1 + lim T→0 gˆk(x) ≥ 1 + lim T→0 gˆ j(x) for j = l,k, whenever the limits exist. The proof of Theorem 10 is given in Appendix B.4. We see that the limit of gˆl(x) at T = 0 agrees with that shown in Theorem 7. Unfortunately, based on GT (g(x), c), it is hard to obtain an explicit expression of the class conditional probabilities Pc (x) via the gˆ c (x). 4. Multiclass C-losses In this section, we present a smooth and Fisher-consistent majorization of the multiclass hinge loss ξc (g(x)) in (4) using the idea behind the coherence function. We call this new majorization multiclass C-loss. We will see that this multiclass C-loss bridges the multiclass hinge loss ξc (g(x)) and the negative multinomial log-likelihood (logit) of the form γc g(x) = logm j=1 exp g j(x) − gc (x) = log 1 + j=c exp g j(x) − gc (x) . (10) In the 0–1 loss the misclassification costs are specified as 1. It is natural to set the misclassification costs as a positive constant u > 0. This setting will reveal an important connection between the hinge loss and the logit loss. The empirical error on the training data is then = u n n i=1 I{φ(xi)=ci}. In this setting, we can extend the multiclass hinge loss ξc (g(x)) as Hu g(x), c = max j g j(x) + u − uI{j=c} − gc (x). (11) It is clear that Hu(g(x), c) ≥ uI{φ(x)=c}. To establish the connection among the multiclass C-loss, the multiclass hinge loss and the logit loss, we employ this setting to present the definition of the multiclass C-loss. We now express max{g j(x) + u − uI{j=c}} as m j=1 ωc j (x)[g j(x) + u − uI{j=c}] where ωc j (x) = 1 j = argmaxl{gl(x) + u − uI{l=c}} 0 otherwise. Motivated by the idea behind deterministic annealing [20], we relax this hard function ωc j (x), retaining only ωc j (x) ≥ 0 and m j=1 ωc j (x) = 1. With such soft ωc j (x), we maximize m j=1 ωc j (x)[g j(x)+u−uI{j=c}]− gc (x) under entropy penalization; namely, max {ωc j (x)} F m j=1 ωc j (x) g j(x) + u − uI{j=c} − gc (x) − T m j=1 ωc j (x)logωc j (x) , (12) where T > 0 is also referred to as the temperature. The maximization of F w.r.t. ωc j (x) is straightforward, and it gives rise to the following distribution ωc j (x) = exp[ g j(x)+u−uI{j=c} T ] l exp[ gl(x)+u−uI{l=c} T ] (13) based on the Karush–Kuhn–Tucker condition. The corresponding maximum of F is obtained by plugging (13) back into (12): CT ,u g(x), c T log 1 + j=c exp u + g j(x) − gc (x) T , T > 0, u > 0. (14)

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Multicategory large margin classification methods - hinge losses vs. coherence functions.