Z. Zhang et al. / Artificial Intellig_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Multicategory large margin classification methods - hinge losses vs. coherence functions.

正在加载图片...

Z Zhang et aL Artificial Intelligence 215 (2014)55-78 59 (x)C.Thus,we always have Iige(x)<o=x).Furthermore,let gi(x)=-g2(x)=f(x)and encode y=1 if c=1 and y=-1 if c=2.Then the empirical error is n n (yif(x)0] =1 i=1 In the multicategory case,(x)=c implies ge(x)>0 but (x)c does not imply ge(x)<0.We shall see that ge(x)<0 is a sufficient but not necessary condition of (x)c.In general,we only have Itge(x)soc)Although gj(x)- ge(x)>0 for some jc is a necessary and sufficient condition of (x)c,it in most cases yields an optimization problem which is not easily solved.This is an important reason why it is not trivial to develop multicategory AdaBoost and SVMs with the same principle as binary AdaBoost and SVMs. 3.1.Methodology Recall that g(=0 and there is at least one j(1.....m)such that gj(x).If ge(x)0.then there exists one IE(1.....m)such that Ic and gi(x)>0.As a result,we have (x)c.Therefore,ge(x)<0 implies (x)c. Unfortunately,if (x)c.ge(x)<0 does not necessarily hold.For example,consider the case that m=3 and c=1.Assume that g(x)=(2,3,-5).Then we have (x)=2 1 and gi(x)=2>0.In addition,it is clear that (x)=c implies gc(x)>0. However,ge(x)>0 does not imply (x)=c. On the other hand,it is obvious that (x)=c is equivalent to (x)j for all jc.In terms of the above discussions, a condition of making (x)=c is gj(x)<0 for jc.To summarize,we immediately have the following theorem. Proposition 5.For (x,c),let g(x)be a margin vector at x and the induced classifier be (x)=argmaxj gj(x).Then (@）Ige)0,≤b6x≠1=Ukg影6-&x>01≤Uj*8x>0) (b) nk8x≤0≤1n*g0w-gc(x≤0！=ow=d≤gex>0 (c) 4Ukg阀>0，≤∑1g,0>01 j≠c (d1Ukgw-g<w>0,≤∑1g1x-&x>01- j女 Proposition 5 shows that ge(x)<0 is the sufficient condition of (x)c,while gj(x)>0 for some jc is its necessary condition.The following theorem shows that they become sufficient and necessary when g has one and only one positive element. Proposition 6.Under the conditions in Proposition 5.The relationship of g.国s0=1国=lUk8国-g0=Ukeg0W01=∑g0x>01 j≠ holds if and only if the margin vector g(x)has only one positive element. In the binary case,this relationship always holds because gi(x)=-g2(x).Recently.Zou et al.[33]derived multicat- egory boosting algorithms using exp(-ge(x)).In their discrete boosting algorithm,the margin vector g(x)is modeled as an m-vector function with one and only one positive element.In this case,Igx)o)is equal to I Consequently. exp(-ge(x))is a majorization of because exp(-gc(x))is an upper bound of )Therefore,this discrete Ad- aBoost algorithm still approximates the original empirical 0-1 loss function.In the general case,however,Proposition 6 implies that exp(-ge(x))is not the majorization of) 3.2.Approaches Proposition 5 provides us with approaches for constructing majorization functions of the 0-1 loss function). Clearly,andare separable.so they are more tractable respectively thanI andg()0Thus.)>0 and g)are popularly employed in practical applications. In particular,suppose n(gj(x))upper bounds Igj(x):that is.n(gj(x))(x))Note that n(gj(x))Ig)o if and only if n(-gj(x))>Itgj(x)zo).Thus n(-gj(x))upper bounds Ig;(x)zo),and hence n(gj(x)-gi(x))upper bounds It then follows from Proposition 5 thatgj())and(g(x)-gj(x))are majorizations of Consequently,we can define two classes of majorizations for I)The first one isZ. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 59 φ(x) = c. Thus, we always have I{gc (x)≤0} = I{φ(x)=c}. Furthermore, let g1(x) = −g2(x) = 1 2 f (x) and encode y = 1 if c = 1 and y = −1 if c = 2. Then the empirical error is = 1 n n i=1 I{φ(xi)=ci} = 1 n n i=1 I{gci (xi)≤0} = 1 n n i=1 I{yi f (xi)≤0}. In the multicategory case, φ(x) = c implies gc (x) > 0 but φ(x) = c does not imply gc (x) ≤ 0. We shall see that gc (x) ≤ 0 is a sufficient but not necessary condition of φ(x) = c. In general, we only have I{gc (x)≤0} ≤ I{φ(x)=c}. Although g j(x) − gc (x) > 0 for some j = c is a necessary and sufficient condition of φ(x) = c, it in most cases yields an optimization problem which is not easily solved. This is an important reason why it is not trivial to develop multicategory AdaBoost and SVMs with the same principle as binary AdaBoost and SVMs. 3.1. Methodology Recall that m j=1 g j(x) = 0 and there is at least one j ∈ {1,...,m} such that g j(x) = 0. If gc (x) ≤ 0, then there exists one l ∈ {1,...,m} such that l = c and gl(x) > 0. As a result, we have φ(x) = c. Therefore, gc (x) ≤ 0 implies φ(x) = c. Unfortunately, if φ(x) = c, gc (x) ≤ 0 does not necessarily hold. For example, consider the case that m = 3 and c = 1. Assume that g(x) = (2, 3,−5). Then we have φ(x) = 2 = 1 and g1(x) = 2 > 0. In addition, it is clear that φ(x) = c implies gc (x) > 0. However, gc (x) > 0 does not imply φ(x) = c. On the other hand, it is obvious that φ(x) = c is equivalent to φ(x) = j for all j = c. In terms of the above discussions, a condition of making φ(x) = c is g j(x) ≤ 0 for j = c. To summarize, we immediately have the following theorem. Proposition 5. For (x, c), let g(x) be a margin vector at x and the induced classifier be φ(x) = arg maxj g j(x). Then (a) I{gc (x)≤0} ≤ I{φ(x)=c} = I{ j=c g j(x)−gc (x)>0} ≤ I{ j=c g j(x)>0} (b) I{ j=c g j(x)≤0} ≤ I{ j=c g j(x)−gc (x)≤0} = I{φ(x)=c} ≤ I{gc (x)>0} (c) I{ j=c g j(x)>0} ≤ j=c I{g j(x)>0} (d) I{ j=c g j(x)−gc (x)>0} ≤ j=c I{g j(x)−gc (x)>0}. Proposition 5 shows that gc (x) ≤ 0 is the sufficient condition of φ(x) = c, while g j(x) > 0 for some j = c is its necessary condition. The following theorem shows that they become sufficient and necessary when g has one and only one positive element. Proposition 6. Under the conditions in Proposition 5. The relationship of I{gc (x)≤0} = I{φ(x)=c} = I{ j=c g j(x)−gc (x)>0} = I{ j=c g j(x)>0} = j=c I{g j(x)>0} holds if and only if the margin vector g(x) has only one positive element. In the binary case, this relationship always holds because g1(x) = −g2(x). Recently, Zou et al. [33] derived multicategory boosting algorithms using exp(−gc (x)). In their discrete boosting algorithm, the margin vector g(x) is modeled as an m-vector function with one and only one positive element. In this case, I{gc (x)≤0} is equal to I{φ(x)=c}. Consequently, exp(−gc (x)) is a majorization of I{φ(x)=c} because exp(−gc (x)) is an upper bound of I{gc (x)≤0}. Therefore, this discrete AdaBoost algorithm still approximates the original empirical 0–1 loss function. In the general case, however, Proposition 6 implies that exp(−gc (x)) is not the majorization of I{φ(x)=c}. 3.2. Approaches Proposition 5 provides us with approaches for constructing majorization functions of the 0–1 loss function I{φ(x)=c}. Clearly, j=c I{g j(x)>0} and j=c I{g j(x)−gc (x)>0} are separable, so they are more tractable respectively than I{ j=c g j(x)>0} and I{ j=c g j(x)−gc (x)>0}. Thus, j=c I{g j(x)>0} and j=c I{g j(x)−gc (x)>0} are popularly employed in practical applications. In particular, suppose η(g j(x)) upper bounds I{g j(x)≤0}; that is, η(g j(x)) ≥ I{g j(x)≤0}. Note that η(g j(x)) ≥ I{g j(x)≤0} if and only if η(−g j(x)) ≥ I{g j(x)≥0}. Thus η(−g j(x)) upper bounds I{g j(x)≥0}, and hence η(g j(x) − gl(x)) upper bounds I{gl(x)−g j(x)>0}. It then follows from Proposition 5 that j=c η(−g j(x)) and j=c η(gc (x) − g j(x)) are majorizations of I{φ(x)=c}. Consequently, we can define two classes of majorizations for I{φ(x)=c}. The first one is

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Multicategory large margin classification methods - hinge losses vs. coherence functions.