正在加载图片...
56 Z Zhang et aL Artificial Intelligence 215(2014)55-78 problems to multiclass classification problems in the same margin principle [25,3,26,12,5,131.Thus,one seemingly natu- ral approach to constructing a classifier for the binary and multiclass problems is to consider a smooth loss function. For example,regularized logistic regression models based on the negative multinomial log-likelihood function (also called the logit loss)[31.10]are competitive with SVMs.Moreover,it is natural to exploit the logit loss in the development of a multicategory boosting algorithm [9].Recently.Zhang et al.[30]proposed a smooth loss function that called coherence function for developing binary large margin classification methods.The coherence function establishes a bridge between the hinge loss and the logit loss.In this paper,we study the application of the coherence function in the multiclass classification problem. 1.1.Multicategory margin classification methods We are concerned with an m-class(m>2)classification problem with a set of training data points ((xi,ci)where xA'C RP is an input vector and ci e(1.2....,m}is its corresponding class label.We assume that each x belongs to one and only one class.Our goal is to find a classifier中(x):X→c∈{1,..,mh. Let Pe(x)=Pr(C=clX =x),c=1,...,m,be the class conditional probabilities given x.The expected error at x is then defined by ox≠知PcX, c=1 where Ig#)is 1 if is true and 0 otherwise.The empirical error on the training data is thus given by n n i=1 Given that e is equal to its minimum value zero when all training data points are correctly classified,we wish to use e as a basis for devising classification methods. Suppose the classifier is modeled using an m-vector g(x)=(gi(x).....gm(x)),where the induced classifier is obtained via maximization in a manner akin to discriminant analysis:(x)=argmax (gj(x)).For simplicity of our analysis,we assume that for a fixed x,each gi itself lies in a compact set.We also assume that the maximizing argument of maxjgj(x) is unique.Of course this excludes the trivial case that gj=0 for all je(1.....m).However,this assumption does not imply that the maximum value is unique;indeed,adding a constant to each component gj(x)does not change the maximizing argument.To remove this redundancy,it is convenient to impose a sum-to-zero constraint.Thus we define and assume geg in this paper unless otherwise specified.Zou et al.[33]referred to such a g as the margin vector.Liu and Shen [15]referred to maxj(gj(x)-ge(x))as the generalized margin of (x,c)with respect to (w.r.t.)g. Since a margin vector g induces a classifier,we explore the minimization of e w.r.t.g.However,this minimization problem is intractable because is the 0-1 function.A wide variety of margin-based classifiers can be understood as minimizers of a surrogate loss function v(g(x)).which upper bounds the 0-1 loss )That is,various tractable surrogate loss functions ve(g(x))are thus used to upper approximate.The corresponding empirical risk function is given by (g) vc(g(xi)). i=1 If is a positive constant that does not depend on (xc).argmin(g)is equivalent to argmin(g).We thus present the following definition. Definition 1.A surrogate loss ve(g(x))is said to be the majorization of w.r.t.(x.c)if ve(g(x))xwhere a is a positive constant that does not depend on (x,c). In practice,convex majorization functions play an important role in the development of classification algorithms.On one hand,the convexity makes the resulting optimization problems computationally tractable.On the other hand,the classifica- tion methods usually have better statistical properties. Given a majorization function ve(g(x)),the classifier resulted from the minimization of R(g)w.r.t.the margin vector g is called a large margin classifier or a margin-based classification method.In the binary classification setting.a wide variety of classifiers can be understood as minimizers of a majorization loss function of the 0-1 loss.If such functions satisfy56 Z. Zhang et al. / Artificial Intelligence 215 (2014) 55–78 problems to multiclass classification problems in the same margin principle [25,3,26,12,5,13]. Thus, one seemingly natu￾ral approach to constructing a classifier for the binary and multiclass problems is to consider a smooth loss function. For example, regularized logistic regression models based on the negative multinomial log-likelihood function (also called the logit loss) [31,10] are competitive with SVMs. Moreover, it is natural to exploit the logit loss in the development of a multicategory boosting algorithm [9]. Recently, Zhang et al. [30] proposed a smooth loss function that called coherence function for developing binary large margin classification methods. The coherence function establishes a bridge between the hinge loss and the logit loss. In this paper, we study the application of the coherence function in the multiclass classification problem. 1.1. Multicategory margin classification methods We are concerned with an m-class (m > 2) classification problem with a set of training data points {(xi, ci)} n i=1 where xi ∈ X ⊂ Rp is an input vector and ci ∈ {1, 2,...,m} is its corresponding class label. We assume that each x belongs to one and only one class. Our goal is to find a classifier φ(x) : x → c ∈ {1,...,m}. Let Pc (x) = Pr(C = c|X = x), c = 1,...,m, be the class conditional probabilities given x. The expected error at x is then defined by m c=1 I{φ(x)=c} Pc (x), where I{#} is 1 if # is true and 0 otherwise. The empirical error on the training data is thus given by = 1 n n i=1 I{φ(xi)=ci}. Given that is equal to its minimum value zero when all training data points are correctly classified, we wish to use as a basis for devising classification methods. Suppose the classifier is modeled using an m-vector g(x) = (g1(x),..., gm(x))T , where the induced classifier is obtained via maximization in a manner akin to discriminant analysis: φ(x) = argmaxj{g j(x)}. For simplicity of our analysis, we assume that for a fixed x, each g j itself lies in a compact set. We also assume that the maximizing argument of max j g j(x) is unique. Of course this excludes the trivial case that g j = 0 for all j ∈ {1,...,m}. However, this assumption does not imply that the maximum value is unique; indeed, adding a constant to each component g j(x) does not change the maximizing argument. To remove this redundancy, it is convenient to impose a sum-to-zero constraint. Thus we define G =  g1(x), . . . , gm(x) T    m j=1 g j(x) = 0  and assume g ∈ G in this paper unless otherwise specified. Zou et al. [33] referred to such a g as the margin vector. Liu and Shen [15] referred to maxj(g j(x) − gc (x)) as the generalized margin of (x, c) with respect to (w.r.t.) g. Since a margin vector g induces a classifier, we explore the minimization of w.r.t. g. However, this minimization problem is intractable because I{φ(x)=c} is the 0–1 function. A wide variety of margin-based classifiers can be understood as minimizers of a surrogate loss function ψc (g(x)), which upper bounds the 0–1 loss I{φ(x)=c}. That is, various tractable surrogate loss functions ψc (g(x)) are thus used to upper approximate I{φ(x)=c}. The corresponding empirical risk function is given by Rˆ(g) = 1 n n i=1 ψci  g(xi)  . If α is a positive constant that does not depend on (x, c), argming(x)∈G 1 α Rˆ (g) is equivalent to argming(x)∈G Rˆ (g). We thus present the following definition. Definition 1. A surrogate loss ψc (g(x)) is said to be the majorization of I{φ(x)=c} w.r.t. (x, c) if ψc (g(x)) ≥ αI{φ(x)=c} where α is a positive constant that does not depend on (x, c). In practice, convex majorization functions play an important role in the development of classification algorithms. On one hand, the convexity makes the resulting optimization problems computationally tractable. On the other hand, the classifica￾tion methods usually have better statistical properties. Given a majorization function ψc (g(x)), the classifier resulted from the minimization of Rˆ (g) w.r.t. the margin vector g is called a large margin classifier or a margin-based classification method. In the binary classification setting, a wide variety of classifiers can be understood as minimizers of a majorization loss function of the 0–1 loss. If such functions satisfy
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有