Zhang, Jordan, Li, Yeung discrete boo_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Coherence functions for multicategory margin-based classification methods

正在加载图片...

Zhang,Jordan,Li,Yeung discrete boosting algorithms,the margin vector g(x) and Fisher-consistent majorization loss,which bridges is modeled as an m-vector function with one and only hinge-type losses and the negative multinomial log- one positive element.In this case,I)o is equal to likelihood.Thus,it is applicable to the construction (Consequently,exp(-ge(x))is a majorization of multicategory margin-based classifiers. of I()c.Therefore,the discrete AdaBoost algo- rithms of Zou et al.(2008)still approximate the origi- 3.1 Definition nal empirical 0-1 loss function.However,in the general case,exp(-ge(x))is not a majorization of I(x) In order to obtain a majorization function of(x), Thus the multicategory GentleBoost algorithm of Zou we express max{gj(x)+1-Ij=d}as 1f(x)[1+ et al.(2008)is not a margin-based method gj(x)-Iti=d where Friedman et al.(2000)proposed a multicategory Log- 1 j=argmaxi{gi(x)+1-Ig=c} itBoost algorithm by using the negative multinomial (x)=0 otherwise. log-likelihood function,which is given by Motivated by the idea behind deterministic anneal- C(g(x),c)=log>exp(gj(x)-ge(x)) (2) ing (Rose et al.,1990),we relax this hard function 1 (x),retaining only(x)>0 and (x)=1. =log 1+>exp(gj(x)-gc(x)) With respect to a soft Be(x)respecting these con- j≠c straints,we maximize (x)1+g(x)-I=c] under an entropy constraint,namely, at (x,c).Although log[1 exp(-ge(x))]is an up- per bound of log(2)Ige(x)<o),it is not a majoriza- tion of I(x)e.However,C(g(x),c)is a majoriza- 器，{F=∑5+w-=l tion of1o(x)≠because of C(g(x,c)≥log(2)lexf≠4 Thus,the multicategory LogitBoost algorithm(Fried- (4) man et al.,2000)is also a margin-based method. -T∑5(x)1og5x)} It is worth noting that log[1+exp(-ge(x))]is the ma- where we refer to T>0 as a temperature jorization of I()t if the margin vector g(x)has only one positive element.Unfortunately,when this The maximization of F w.r.t.Be(x)is straightforward, majorization as well as C(g(x),c)are used to derive and it gives rise to the following distribution multicategory discrete boosting algorithms,a closed- form solution no longer exists. (x)= ep[+o,--4] ∑iexp[+-2a7 (5) In the case of the multicategory SVM algorithm, Crammer and Singer(2001)used the surrogate: The corresponding maximum of F is obtained by plug- ging (5)back into (4): H(g(x),c)=max{g(x)+1-6=d}-gc(x).(3) It is easily seen that F*=Tlog∑exp| 1+9(x)-6= [(x)≠d=[日j≠c,9影(x)-9c(x)>0 ≤max{9(x)+1-6=d}-9e(x). Note that for T>0 we have This shows that H(g(x),c)is a majorization of T1og[1+∑ewp1+9x-9.x】 (x)d,but it is Fisher consistent only when 1≠c maP(x)>1/2(Zhang,2004). -T1og∑em(l+96-=国)-.ld T 3 Coherence Functions ≥max{9,(x)+1-16=d}-9e(x) Since hinge-type loss functions are not smooth,exist- ing multicategory SVMs do not directly estimate the ≥1[(x)≠d class probability Pe(x).Moreover,it is rare to de- This thus leads us to the following family of majoriza- vise a boosting algorithm with them.However,lo- tion functions of ( gistic regression extends naturally from binary classi- fication to multicategory classification by simply us- ing the multinomial likelihood in place of the bino- C(g(x).)=Tlog 1+exp()-g(x) T T>0 j≠ mial likelihood.In this section,we present a smooth (6)Zhang, Jordan, Li, Yeung discrete boosting algorithms, the margin vector g(x) is modeled as an m-vector function with one and only one positive element. In this case, I[gc(x)≤0] is equal to I[φ(x)6=c] . Consequently, exp(−gc(x)) is a majorization of I[φ(x)6=c] . Therefore, the discrete AdaBoost algorithms of Zou et al. (2008) still approximate the original empirical 0-1 loss function. However, in the general case, exp(−gc(x)) is not a majorization of I[φ(x)6=c] . Thus the multicategory GentleBoost algorithm of Zou et al. (2008) is not a margin-based method. Friedman et al. (2000) proposed a multicategory LogitBoost algorithm by using the negative multinomial log-likelihood function, which is given by L(g(x), c) = logXm j=1 exp gj (x) − gc(x) (2) = log h 1 +X j6=c exp(gj (x)−gc(x))i at (x, c). Although log[1 + exp(−gc(x))] is an upper bound of log(2)I[gc(x)≤0], it is not a majorization of I[φ(x)6=c] . However, L(g(x), c) is a majorization of I[φ(x)6=c] because of L(g(x), c) ≥ log(2)I[φ(x)6=c] . Thus, the multicategory LogitBoost algorithm (Friedman et al., 2000) is also a margin-based method. It is worth noting that log[1 + exp(−gc(x))] is the majorization of I[φ(x)6=c] if the margin vector g(x) has only one positive element. Unfortunately, when this majorization as well as L(g(x), c) are used to derive multicategory discrete boosting algorithms, a closedform solution no longer exists. In the case of the multicategory SVM algorithm, Crammer and Singer (2001) used the surrogate: H(g(x), c) = max{gj (x) + 1 − I[j=c]} − gc(x). (3) It is easily seen that I[φ(x)6=c] = I[∃j6=c, gj (x)−gc(x)>0] ≤ max n gj (x) + 1 − I[j=c] o − gc(x). This shows that H(g(x), c) is a majorization of I[φ(x)6=c] , but it is Fisher consistent only when maxl Pl(x) > 1/2 (Zhang, 2004). 3 Coherence Functions Since hinge-type loss functions are not smooth, existing multicategory SVMs do not directly estimate the class probability Pc(x). Moreover, it is rare to devise a boosting algorithm with them. However, logistic regression extends naturally from binary classi- fication to multicategory classification by simply using the multinomial likelihood in place of the binomial likelihood. In this section, we present a smooth and Fisher-consistent majorization loss, which bridges hinge-type losses and the negative multinomial loglikelihood. Thus, it is applicable to the construction of multicategory margin-based classifiers. 3.1 Definition In order to obtain a majorization function of I[φ(x)6=c] , we express max{gj (x) + 1 − I[j=c]} as Pm j=1 β c j (x) 1 + gj (x) − I[j=c] where β c j (x) = 1 j = argmaxl{gl(x)+1−I[l=c]} 0 otherwise. Motivated by the idea behind deterministic annealing (Rose et al., 1990), we relax this hard function β c j (x), retaining only β c j (x) > 0 and Pm j=1 β c j (x) = 1. With respect to a soft β c j (x) respecting these constraints, we maximize Pm j=1 β c j (x) 1 + gj (x) − I[j=c] under an entropy constraint, namely, max {β c j (x)} n F = Xm j=1 β c j (x) 1 + gj (x) − I[j=c] − T Xm j=1 β c j (x) log β c j (x) o , (4) where we refer to T > 0 as a temperature. The maximization of F w.r.t. β c j (x) is straightforward, and it gives rise to the following distribution β c j (x) = exp 1+gj (x)−I[j=c] T P l exp 1+gl(x)−I[l=c] T . (5) The corresponding maximum of F is obtained by plugging (5) back into (4): F ∗ = T logX j exp h1 + gj (x) − I[j=c] T i . Note that for T > 0 we have T log h 1+X j6=c exp 1+gj (x)−gc(x) T i = T logX j exp(1 + gj (x) − I[j=c] T ) − gc(x) ≥ max j n gj (x)+1−I[j=c] o −gc(x) ≥ I[φ(x)6=c] . This thus leads us to the following family of majorization functions of I[φ(x)6=c] : C(g(x), c) = T log h 1+X j6=c exp 1+gj (x)−gc(x) T i , T > 0. (6)

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Coherence functions for multicategory margin-based classification methods