58 Z. Zhang et al. / Artificial Intel_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Multicategory large margin classification methods - hinge losses vs. coherence functions.

正在加载图片...

58 Z Zhang et aL Artificial Intelligence 215 (2014)55-78 based on the coherence function.Section 4 develops another multicategory coherence loss that we call the multiclass C-loss.Based on the multiclass C-loss,a multiclass C learning algorithm and a multiclass GentleBoost algorithm are given in Section 5.We conduct empirical analysis for the multicategory large margin algorithms in Section 6,and conclude our work in Section 7.All proofs are deferred to the appendix. 2.A general result on Fisher consistency Using the notion and notation given in Section 1.1,we now consider a more general setting than the pairwise comparison. Let gf (x)=(g1(x)-gc(x).....gc-1(x)-gc(x).gc+1(x)-ge(x).....gm(x)-gc(x)).We define ve(g(x))as a function of g (x) (thereafter denoted f(g(x))).It is clear that the pairwise comparison ve(g(x))=(ge(x)-gj(x)),the multiclass hinge loss of Crammer and Singer [5].the multicategory -loss of Liu and Shen [15].and the truncated hinge loss of Wu and Liu [27]follow this generalized definition.Moreover,for these cases,we note that f(g)is symmetric. Furthermore,we present a unifying definition of(g)=(v(g).....vm(g)):Rm->Rm where we ignore the depen- dency of g on x.Let be a set of mappings(g)satisfying the conditions:(i)when fixed geve(g)is symmetric w.r.t. the remaining arguments and(ii)ve(g)=vj(gi)where gic is obtained by only exchanging ge and gj of g.Obviously,the mapping(g)defined via the independent and identical setting.the constrained comparison,or the generalized pairwise comparison with symmetric f belongs to With this notion,we give an important theorem of this paper as follows. Theorem 3.Let (g)e be a twice differentiable function from Rm to Rm.Assume that the Hessian matrix of ve(g)w.r.t.g is conditionally positive definite for c=1.....m.Then the minimizer=(.....m)ofve(g(x))Pe(x)in g exists and is unique. Furthermore.f where is negative for anythen PP implies agi The proof of the theorem is given in Appendix A.1.Note that an mxm real matrix A is said to be conditionally positive definite if y'Ay for any nonzero real vector y=(y1....ym)with=0.The condition that age agj on G for jc is not necessary for Fisher consistency.For example,in the setting ve(g(x))=n(ge(x)).Zou et al.[33] proved that if n(z)is a twice differentiable function with n(0)<0 and n"(z)>0 Vz,then ve(g(x))is Fisher-consistent. In the setting ve(g(x))=jn(-gj).we note that ejn(-gj)Pe=c=1n(-ge)(1-Pe).Based on the proof of Zou et al.33].we have that if n(z)is a twice differentiable function with n(0)<0 and n(z)>0 Vz,then ve(g(x))= ()is Fisher-consistent.That is,in these two cases,we can relax the condition thatfor any dgc agj g∈Ga5-0<0,forj≠c agi We have the following corollary,whose proof is given in Appendix A.2.We will see two concrete cases of this corollary (that is,Theorems 10 and 13). Corollary 4.Assume ve(g)=f(g)where f(z)is asymmetric and twice differentiable function from Rm-1 to R.If the Hessian matrix of f(z)w.r.t.z is positive definite,then the minimizerg=(.....gm)ofe(g(x))P(x)in G exists and is unique.Furthermore, ife恩-ae段where j≠c is negative for any g.then P1>Pk implies>gk d只c agi Theorem 3 or Corollary 4 shows that ve(g)admits the ISC of Zhang [28].Thus,under the conditions in Theorem 3 or Corollary 4,we also have the relationship between the approximate minimization of the risk based onve and the approximate minimization of the classification error.In particular,if ve((X))Pc(X inf 2 vc(E(X)P:(X 1 1 for some e>0,then there exists an e2 >0 such that Ex ≤Ex +e2 -=1,c≠(X) Lc=1,c≠φ*(X) where()=argmaxj().()=argmaxj(Pj(X))and Ex)Pe(X)]is the optimal error.This result di- rectly follows from Theorem 3 in Zhang [28. 3.Multicategory majorization losses Given x and its label c,we let g(x)be a margin vector at x and the induced classifier be(x)=argmaxj gj(x).In the binary case,it is clear that (x)=c if and only if ge(x)>0,and that ge(x)<0 is a necessary and sufficient condition of 1 A symmetric function of p variables is one whose value at any p-tuple of arguments is the same as its value at any permutation of that p-tuple.58 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 based on the coherence function. Section 4 develops another multicategory coherence loss that we call the multiclass C-loss. Based on the multiclass C-loss, a multiclass C learning algorithm and a multiclass GentleBoost algorithm are given in Section 5. We conduct empirical analysis for the multicategory large margin algorithms in Section 6, and conclude our work in Section 7. All proofs are deferred to the appendix. 2. A general result on Fisher consistency Using the notion and notation given in Section 1.1, we now consider a more general setting than the pairwise comparison. Let gc (x) = (g1(x)− gc (x),..., gc−1(x)− gc (x), gc+1(x)− gc (x),..., gm(x)− gc (x))T . We define ψc (g(x)) as a function of gc (x) (thereafter denoted f (gc (x))). It is clear that the pairwise comparison ψc (g(x)) = j=c η(gc (x)− g j(x)), the multiclass hinge loss of Crammer and Singer [5], the multicategory ψ-loss of Liu and Shen [15], and the truncated hinge loss of Wu and Liu [27] follow this generalized definition. Moreover, for these cases, we note that f (gc ) is symmetric.1 Furthermore, we present a unifying definition of ψ(g) = (ψ1(g),...,ψm(g))T : Rm → Rm where we ignore the dependency of g on x. Let Ψ be a set of mappings ψ(g) satisfying the conditions: (i) when fixed gc ψc (g) is symmetric w.r.t. the remaining arguments and (ii) ψc (g) = ψj(gjc ) where gjc is obtained by only exchanging gc and g j of g. Obviously, the mapping ψ(g) defined via the independent and identical setting, the constrained comparison, or the generalized pairwise comparison with symmetric f belongs to Ψ . With this notion, we give an important theorem of this paper as follows. Theorem 3. Let ψ(g) ∈ Ψ be a twice differentiable function from Rm to Rm. Assume that the Hessian matrix of ψc(g) w.r.t. g is conditionally positive definite for c = 1,...,m. Then the minimizer gˆ = (gˆ 1,..., gˆm)T of c ψc (g(x))Pc (x) in G exists and is unique. Furthermore, if ∂ψc (g) ∂ gc − ∂ψc (g) ∂ g j where j = c is negative for any g ∈ G, then Pl > Pk implies gˆl > gˆk. The proof of the theorem is given in Appendix A.1. Note that an m×m real matrix A is said to be conditionally positive definite if yT Ay > 0 for any nonzero real vector y = (y1,..., ym)T with m j=1 y j = 0. The condition that ∂ψc (g) ∂ gc − ∂ψc (g) ∂ g j < 0 on G for j = c is not necessary for Fisher consistency. For example, in the setting ψc (g(x)) = η(gc (x)), Zou et al. [33] proved that if η(z) is a twice differentiable function with η (0) < 0 and η(z) > 0 ∀z, then ψc (g(x)) is Fisher-consistent. In the setting ψc (g(x)) = j=c η(−g j), we note that c j=c η(−g j)Pc = c=1 η(−gc )(1 − Pc ). Based on the proof of Zou et al. [33], we have that if η(z) is a twice differentiable function with η (0) < 0 and η(z) > 0 ∀z, then ψc (g(x)) = j=c η(−g j(x)) is Fisher-consistent. That is, in these two cases, we can relax the condition that ∂ψc (g) ∂ gc − ∂ψc (g) ∂ g j < 0 for any g ∈ G as ∂ψc (0) ∂ gc − ∂ψc (0) ∂ g j < 0, for j = c. We have the following corollary, whose proof is given in Appendix A.2. We will see two concrete cases of this corollary (that is, Theorems 10 and 13). Corollary 4. Assume ψc (g) = f (gc ) where f (z) is a symmetric and twice differentiable function from Rm−1 to R. If the Hessian matrix of f (z) w.r.t. z is positive definite, then the minimizer gˆ = (gˆ 1,..., gˆm)T of c ψc (g(x))Pc (x) in G exists and is unique. Furthermore, if ∂ψc (g) ∂ gc − ∂ψc (g) ∂ g j where j = c is negative for any g, then Pl > Pk implies gˆl > gˆk. Theorem 3 or Corollary 4 shows that ψc (g) admits the ISC of Zhang [28]. Thus, under the conditions in Theorem 3 or Corollary 4, we also have the relationship between the approximate minimization of the risk based on ψc and the approximate minimization of the classification error. In particular, if EX m c=1 ψc gˆ(X) Pc (X) ≤ inf g∈G EX m c=1 ψc g(X) Pc (X) + 1 for some 1 > 0, then there exists an 2 > 0 such that EX m c=1,c=φ(ˆ X) Pc (X) ≤ EX m c=1,c=φ∗(X) Pc (X) + 2 where φ(ˆ X) = argmaxj{gˆ j(X)}, φ∗(X) = argmaxj{P j(X)} and EX [ m c=1,c=φ∗(X) Pc (X)] is the optimal error. This result directly follows from Theorem 3 in Zhang [28]. 3. Multicategory majorization losses Given x and its label c, we let g(x) be a margin vector at x and the induced classifier be φ(x) = argmax j g j(x). In the binary case, it is clear that φ(x) = c if and only if gc (x) > 0, and that gc (x) ≤ 0 is a necessary and sufficient condition of 1 A symmetric function of p variables is one whose value at any p-tuple of arguments is the same as its value at any permutation of that p-tuple.

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Multicategory large margin classification methods - hinge losses vs. coherence functions.