正在加载图片...
Coherence Functions for Multicategory Margin-based Classification Methods We refer to the functions as coherence functions due to Pi(x)<Pi(x),we have gi(x)<gi(x).Furthermore, their statistical mechanical properties similar to those after having obtained g(x),Pe(x)is given by of deterministic annealing (Rose et al.,1990).Note that the coherence function is also a majorization of the multicategory hinge loss H(g(x),c)in (3). P(x)= 21exp1+x+.6- ∑”1∑1ep++gx-- (8) When T=1,we have C(g(x),c)=log 1+>exp(1+gj(x)-9c(x), Moreover,we have the following properties. j≠c Theorem 2 Let H(g(x),c),B(x)and C(g(x),c)be which is just an upper bound of the negative multino- defined by (3),(5)and(6),respectively.Then, mial log-likelihood function C(g(x),c)in (2) S(g(x),c)≤C(g(x),c)-Tlog m≤H(g(x),c): In the binary case,i.e.m =2,we let gi(x)=-g2(x)= f(x)and encode y 1 if c 1 and y=-1 if c=2. where We can thus express the coherence function as cofe》=Tsl+eml- sgx.9=∑1+9x)-9e(x). ,T>0. m (7) j≠c Figure 1 depicts the coherence function(T=1)and Importantly,when treating g(x)fixed and considering other common loss functions for m=2. (x)and C(g(x),c)as functions of T,we have Theorem 3 Under the conditions in Theorem 2,for a fired g(x)we have 0-109s ---Coherence -i-Logt (i)limr_oc(g(x),c)-Tlogm=S(g(x),c)and ......Exponential -Hinge 黑刻=片m=1m (ii)limr_oC(g(x),c)=H(g(x),c)and x)={0 j=argmaxi{gi(x)+1-I=c} otherwise. -1.5 -0.5 0 0.5 1 1.5 y f(x) It is worth noting that Theorem 3-(ii)shows that at T=0,C(g(x),c)reduces to the multicategory hinge Figure 1:A variety of loss functions,which are re- loss H(g(x),c),which is used by Crammer and Singer garded as a function of yf(x).Here T=1 in the co- (2001). herence loss.Logit loss:logexp(yf())];E- As an immediate corollary of Theorems 2 and 3 in the ponential loss:exp(-yf(x));Hinge loss:[1-yf(x)]+ binary case (m =2),we have where[u+=uifu≥0and[叫+=0 otherwise. Corollary 1 Let C(yf(x))be defined by (7).Then 3.2 Properties ()(1-yf(x)+≤C(yf(x)≤Tlog2+[1-yf(x1+: The following theorem shows that the coherence func- (ii)limr-oC(yf(x))=[1-yf(x)]+; tion is Fisher consistent. (iii)(1-yf(x))<C(yf(x))-Tlog2; Theorem 1 Assume Pe(x)>0 for c =1,...,m. (iv)limr_ooC(yf(x))-Tlog2=(1-yf(x)) Consider the optimization problem ∑T1og1+∑ep+9x-g.1P. Graphs of C(yf(x))with different values of T are argmax shown in Figure 2.We can see that C(yf(x))with T= g(x)Eg ≠c 0.01 is almost the same as the hinge loss [1-yf(x)]+. for a fired T>0 and let g(x)=((x),....m(x))T Wang et al.(2005)derived an annealed discriminant be its solution.Then g(x)is unique.Moreover,if analysis algorithm in which the loss function isCoherence Functions for Multicategory Margin-based Classification Methods We refer to the functions as coherence functions due to their statistical mechanical properties similar to those of deterministic annealing (Rose et al., 1990). Note that the coherence function is also a majorization of the multicategory hinge loss H(g(x), c) in (3). When T = 1, we have C(g(x), c) = log h 1+X j6=c exp ￾ 1+gj (x)−gc(x)  i , which is just an upper bound of the negative multino￾mial log-likelihood function L(g(x), c) in (2). In the binary case, i.e. m = 2, we let g1(x) = −g2(x) = 1 2 f(x) and encode y = 1 if c = 1 and y = −1 if c = 2. We can thus express the coherence function as C(yf(x)) = T log h 1+ exp 1−yf(x) T i , T > 0. (7) Figure 1 depicts the coherence function (T = 1) and other common loss functions for m = 2. −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 0 1 2 3 4 5 6 7 8 y f(x) Loss 0−1 loss Coherence Logit Exponential Hinge Figure 1: A variety of loss functions, which are re￾garded as a function of yf(x). Here T = 1 in the co￾herence loss. Logit loss: 1 log 2 log[1+ exp(−yf(x))]; Ex￾ponential loss: exp(−yf(x)); Hinge loss: [1 − yf(x)]+ where [u]+ = u if u ≥ 0 and [u]+ = 0 otherwise. 3.2 Properties The following theorem shows that the coherence func￾tion is Fisher consistent. Theorem 1 Assume Pc(x) > 0 for c = 1, . . . , m. Consider the optimization problem argmax g(x)∈G Xm c=1 T log h 1+X j6=c exp 1+gj (x)−gc(x) T i Pc(x) for a fixed T > 0 and let gˆ(x) = (ˆg1(x), . . . , gˆm(x))T be its solution. Then gˆ(x) is unique. Moreover, if Pi(x) < Pj (x), we have gˆi(x) < gˆj (x). Furthermore, after having obtained gˆ(x), Pc(x) is given by Pc(x) = Pm l=1 exp 1+ˆgl(x)+ˆgc(x)−I[l=c] T Pm j=1 Pm l=1 exp 1+ˆgl(x)+ˆgj (x)−I[l=j] T . (8) Moreover, we have the following properties. Theorem 2 Let H(g(x), c), β c j (x) and C(g(x), c) be defined by (3), (5) and (6), respectively. Then, S(g(x), c) ≤ C(g(x), c) − T log m ≤ H(g(x), c), where S(g(x), c) = 1 m X j6=c ￾ 1 + gj (x) − gc(x)  . Importantly, when treating g(x) fixed and considering β c j (x) and C(g(x), c) as functions of T, we have Theorem 3 Under the conditions in Theorem 2, for a fixed g(x) we have (i) limT→∞ C(g(x), c) − T log m = S(g(x), c) and lim T→∞ β c j (x) = 1 m for j = 1, . . . , m (ii) limT→0 C(g(x), c) = H(g(x), c) and lim T→0 β c j (x) =  1 j = argmaxl{gl(x)+1−I[l=c]} 0 otherwise. It is worth noting that Theorem 3-(ii) shows that at T = 0, C(g(x), c) reduces to the multicategory hinge loss H(g(x), c), which is used by Crammer and Singer (2001). As an immediate corollary of Theorems 2 and 3 in the binary case (m = 2), we have Corollary 1 Let C(yf(x)) be defined by (7). Then (i) (1−yf(x))+ ≤ C(yf(x)) ≤ T log 2+[1−yf(x)]+; (ii) limT→0 C(yf(x)) = [1 − yf(x)]+; (iii) 1 2 (1−yf(x)) ≤ C(yf(x)) − T log 2; (iv) limT→∞ C(yf(x)) − T log 2 = 1 2 (1−yf(x)). Graphs of C(yf(x)) with different values of T are shown in Figure 2. We can see that C(yf(x)) with T = 0.01 is almost the same as the hinge loss [1 − yf(x)]+. Wang et al. (2005) derived an annealed discriminant analysis algorithm in which the loss function is
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有