正在加载图片...
公 Z.Zhang et aL Artificial Intelligence 215(2014)55-78 for a fixed T>0 and letg(x)=(g(x).....gm(x))T be its solution.Theng(x)is unique.Moreover,if Pi(x)<P j(x),we havegi(x)< gj(x).Additionally,if Pi(x)>1/2 or Pk(x)<1/m,then m8w=1+m8(x≥1+)m8新)orj≠l,k whenever the limits exist. The proof of Theorem 10 is given in Appendix B.4.We see that the limit of g(x)at T=0 agrees with that shown in The- orem 7.Unfortunately,based on Gr(g(x),c),it is hard to obtain an explicit expression of the class conditional probabilities Pe(x)via the ge(x). 4.Multiclass C-losses In this section,we present a smooth and Fisher-consistent majorization of the multiclass hinge loss (g(x))in(4)using the idea behind the coherence function.We call this new majorization multiclass C-loss.We will see that this multiclass C-loss bridges the multiclass hinge loss c(g(x))and the negative multinomial log-likelihood (logit)of the form g)=log∑exp(zy(x)-&x)=log1+∑exp(gx)-g(x) (10) j=1 In the 0-1 loss the misclassification costs are specified as 1.It is natural to set the misclassification costs as a positive constant u>0.This setting will reveal an important connection between the hinge loss and the logit loss.The empirical error on the training data is then E= n In this setting.we can extend the multiclass hinge loss e(g(x))as Hu(g(x).c)max gj(x)+u-ultj=c)-gc(x). (11) It is clear that Hu(g(x).c)>u)To establish the connection among the multiclass C-loss,the multiclass hinge loss and the logit loss,we employ this setting to present the definition of the multiclass C-loss. We now express max(gj(x)+u-ultj=c))as f(x)[gj(x)+u-ulltj=c)]where ω(X)= 1 j=argmax (gi(x)+u-ull=c)} 】0 otherwise. Motivated by the idea behind deterministic annealing [20].we relax this hard function (x).retaining only (x)>0 and(x)=1.With such soft f().we maximize()gj+u-uj-g(under entropy penalization: namely, max Fe∑oj(x[g0x+u-uu=]-8-w-T∑w1ogo5w (12) j=1 j=1 where T>0 is also referred to as the temperature.The maximization of F w.r.t.(x)is straightforward,and it gives rise to the following distribution exp ω(X)= ∑expr+W-u4] (13) based on the Karush-Kuhn-Tucker condition.The corresponding maximum of F is obtained by plugging(13)back into(12): CT.(g(X).c)Tog1+exp-gc(x) T>0,u>0. (14) T j≠c62 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 for a fixed T > 0 and let gˆ(x) = (gˆ 1(x),..., gˆm(x))T be its solution. Then gˆ(x) is unique. Moreover, if Pi(x) < P j(x), we have gˆi(x) < gˆ j(x). Additionally, if Pl(x) > 1/2 or Pk(x) < 1/m, then lim T→0 gˆl(x) = 1 + lim T→0 gˆk(x) ≥ 1 + lim T→0 gˆ j(x) for j = l,k, whenever the limits exist. The proof of Theorem 10 is given in Appendix B.4. We see that the limit of gˆl(x) at T = 0 agrees with that shown in The￾orem 7. Unfortunately, based on GT (g(x), c), it is hard to obtain an explicit expression of the class conditional probabilities Pc (x) via the gˆ c (x). 4. Multiclass C-losses In this section, we present a smooth and Fisher-consistent majorization of the multiclass hinge loss ξc (g(x)) in (4) using the idea behind the coherence function. We call this new majorization multiclass C-loss. We will see that this multiclass C-loss bridges the multiclass hinge loss ξc (g(x)) and the negative multinomial log-likelihood (logit) of the form γc  g(x)  = logm j=1 exp g j(x) − gc (x)  = log 1 + j=c exp g j(x) − gc (x)   . (10) In the 0–1 loss the misclassification costs are specified as 1. It is natural to set the misclassification costs as a positive constant u > 0. This setting will reveal an important connection between the hinge loss and the logit loss. The empirical error on the training data is then = u n n i=1 I{φ(xi)=ci}. In this setting, we can extend the multiclass hinge loss ξc (g(x)) as Hu  g(x), c  = max j g j(x) + u − uI{j=c}  − gc (x). (11) It is clear that Hu(g(x), c) ≥ uI{φ(x)=c}. To establish the connection among the multiclass C-loss, the multiclass hinge loss and the logit loss, we employ this setting to present the definition of the multiclass C-loss. We now express max{g j(x) + u − uI{j=c}} as m j=1 ωc j (x)[g j(x) + u − uI{j=c}] where ωc j (x) =  1 j = argmaxl{gl(x) + u − uI{l=c}} 0 otherwise. Motivated by the idea behind deterministic annealing [20], we relax this hard function ωc j (x), retaining only ωc j (x) ≥ 0 and m j=1 ωc j (x) = 1. With such soft ωc j (x), we maximize m j=1 ωc j (x)[g j(x)+u−uI{j=c}]− gc (x) under entropy penalization; namely, max {ωc j (x)} F m j=1 ωc j (x)  g j(x) + u − uI{j=c}  − gc (x) − T m j=1 ωc j (x)logωc j (x)  , (12) where T > 0 is also referred to as the temperature. The maximization of F w.r.t. ωc j (x) is straightforward, and it gives rise to the following distribution ωc j (x) = exp[ g j(x)+u−uI{j=c} T ]  l exp[ gl(x)+u−uI{l=c} T ] (13) based on the Karush–Kuhn–Tucker condition. The corresponding maximum of F is obtained by plugging (13) back into (12): CT ,u  g(x), c  T log 1 + j=c exp u + g j(x) − gc (x) T  , T > 0, u > 0. (14)
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有