正在加载图片...
Support Matrix Machines whee0≤B:≤C. and G(S)=TlISIl.. Denote=∑是1aaXi.We see that is the combi-- ADMM solves (7)by using the augmented Lagrangian nation of those Xi associated with nonzero B,while W is function: the SVT of 1.In fact,we will see from Eqn.(12)in The- L1(W,b,S,A)=H(W,b)+G(S)+tr[AT(S-W)] orem 5 that W is indeed the linear combination of a set of support matrices {Xi}. +1s-wIe。 Lemma 1.The difference between [k and [Slkalz meets the following inequality where p>0 is a hyperparameter. (]kl4-k22)2≤2nC2(1-ff,f2l2): The ADMM learning procedure for our SMM is summa- rized in Algorithm 1.The key steps of Algorithm 1 are the Recall thatf=1 and fa=1,so f fkal computations of S(k)and (W(k),()).the derivation of [-1,1]is the correlation coefficient of input features at po- which is based on Theorems 3 and 4 below. sitions (1,1)and (k2,2)over the n training samples. Theorem 3.For positive numbers T and p,let the matrix Lemma 1 says that n has the element-wise grouping ef- pW-A have SVD of the form: fect.Specifically,higher(lower)correlation between two pW-A=Uo∑oV6+U11Vf, (8) elements of leads to smaller (larger)difference.Based on this lemma,we also have the following theorem. where So is the diagonal matrix whose diagonal entries Theorem 2.Let [W]:be the lth column of W.Then are greater than T,Uo and Vo are matrices of the corre- sponding left and right singular vectors:U and V lw:h-w:2≤2nc2(p-∑项,f) correspond the rest parts of the SVD whose singular values are less than or equal to T.Define Especially.if fk1=fkl2 for any k=l,,卫,then S*D-(pW-A)=Uo(Zo-rI)V6. W].=[W].2 0 Theorem 2 reflects the relationship of W with the train- Then we have 0 E 8 G1(S*).where G1(S)=G(S)+ ing input matrices Xi.Interestingly,the columns of the t(As)+Iw-s服. regression matrix W have grouping effect in our model if the corresponding features have strong correlation.The Since Gi is convex with respect to S,we obtain the update similar conclusion also applies to the rows of W.Note equation of S from Theorem 3.That is, that Theorem 2 can not be extended to the element-wise case,because even if ft=f,we cannot obtain that S(k+1)=argminG(W(),S,()=-D-(pW()-A()). [W=[W We will present an empirical illustra- tion for the grouping effect problem in Section 5.1. Theorem 4.One of the solution of the following problem 4.2.Learning Algorithm argmin H(W,b)-tr(ATW)+W-sl (W,b) The Nesterov method for R-GLM(Zhou Li,2014)re- is quires the derivative of the loss function in question to be Lipschitz-continuous.However,both the hinge loss and the w-(+inx) (10 nuclear norm are not smooth.Thus,it is hard to develop the =1 Nesterov method for finding the SMM solution.Since the objective function of SMM is convex in both W and b,we 网{-(wPx} here derive a learning algorithm based on ADMM with the restart rule(Goldstein et al.,2012).The problem in (6)can where S*=i:0<a;<C),and a*E Rm is the solu- be equivalently written as follows: tion of the following box constraint quadratic programming argmin H(W,b)+G(S), (7) problem: W.b.S s.t. S-W=0, argmax -aTKa+qTa. (11) a where s.t. 0≤a≤C1n, H(W.6)=ztr(WTW)+C>(1-(WTX)+) ∑ah=0 =Support Matrix Machines where 0 ≤ β˜ i ≤ C. Denote Ω = Pn i=1 β˜ iyiXi . We see that Ω is the combi￾nation of those Xi associated with nonzero β˜ i , while W˜ is the SVT of Ω. In fact, we will see from Eqn. (12) in The￾orem 5 that W˜ is indeed the linear combination of a set of support matrices {Xi}. Lemma 1. The difference between [Ω]k1l1 and [Ω]k2l2 meets the following inequality ([Ω]k1l1 − [Ω]k2l2 ) 2 ≤ 2nC2 (1 − f T k1l1 fk2l2 ). Recall that kfk1l1 k = 1 and kfk2l2 k = 1, so f T k1l1 fk2l2 ∈ [−1, 1] is the correlation coefficient of input features at po￾sitions (k1, l1) and (k2, l2) over the n training samples. Lemma 1 says that Ω has the element-wise grouping ef￾fect. Specifically, higher (lower) correlation between two elements of Ω leads to smaller (larger) difference. Based on this lemma, we also have the following theorem. Theorem 2. Let [W˜ ]:,l be the lth column of W˜ . Then [W˜ ]:,l1 − [W˜ ]:,l2 2 ≤ 2nC2  p − Xp k=1 f T kl1 fkl2  . Especially, if fkl1 = fkl2 for any k = 1, . . . , p, then [W˜ ]:,l1 = [W˜ ]:,l2 . Theorem 2 reflects the relationship of W˜ with the train￾ing input matrices Xi . Interestingly, the columns of the regression matrix W˜ have grouping effect in our model if the corresponding features have strong correlation. The similar conclusion also applies to the rows of W˜ . Note that Theorem 2 can not be extended to the element-wise case, because even if fk1l1 = fk2l2 , we cannot obtain that [W˜ ]k1l1 = [W˜ ]k2l2 . We will present an empirical illustra￾tion for the grouping effect problem in Section 5.1. 4.2. Learning Algorithm The Nesterov method for R-GLM (Zhou & Li, 2014) re￾quires the derivative of the loss function in question to be Lipschitz-continuous. However, both the hinge loss and the nuclear norm are not smooth. Thus, it is hard to develop the Nesterov method for finding the SMM solution. Since the objective function of SMM is convex in both W and b, we here derive a learning algorithm based on ADMM with the restart rule (Goldstein et al., 2012). The problem in (6) can be equivalently written as follows: argmin W,b,S H(W, b) + G(S), (7) s.t. S − W = 0, where H(W, b) = 1 2 tr(WTW) + C Xn i=1 {1−yi [tr(WT Xi)+b]}+ and G(S) = τ ||S||∗. ADMM solves (7) by using the augmented Lagrangian function: L1(W, b, S, Λ) =H(W, b) + G(S) + tr[Λ T (S − W)] + ρ 2 ||S − W||2 F , where ρ > 0 is a hyperparameter. The ADMM learning procedure for our SMM is summa￾rized in Algorithm 1. The key steps of Algorithm 1 are the computations of S (k) and (Wc(k) , b(k) ), the derivation of which is based on Theorems 3 and 4 below. Theorem 3. For positive numbers τ and ρ, let the matrix ρW − Λ have SVD of the form: ρW − Λ = U0Σ0VT 0 + U1Σ1VT 1 , (8) where Σ0 is the diagonal matrix whose diagonal entries are greater than τ , U0 and V0 are matrices of the corre￾sponding left and right singular vectors; Σ1, U1 and V1 correspond the rest parts of the SVD whose singular values are less than or equal to τ . Define S ∗ , 1 ρ Dτ (ρW − Λ) = 1 ρ U0(Σ0 − τ I)VT 0 . (9) Then we have 0 ∈ ∂ G1(S ∗ ), where G1(S) = G(S) + tr(ΛT S) + ρ 2 ||W − S||2 F . Since G1 is convex with respect to S, we obtain the update equation of S from Theorem 3. That is, S (k+1) = argmin S G(W(k) , S, Λb(k) ) = 1 ρ Dτ (ρW(k)−Λb(k) ). Theorem 4. One of the solution of the following problem argmin (W,b) H(W, b) − tr(Λ TW) + ρ 2 ||W−S||2 F is W∗ = 1 ρ + 1  Λ + ρS + Xn i=1 α ∗ i yiXi  , (10) b ∗ = 1 |S∗| X i∈S∗ n yi − tr[(W∗ ) T Xi ] o , where S ∗ = {i : 0 < α∗ i < C}, and α∗ ∈ R n is the solu￾tion of the following box constraint quadratic programming problem: argmax α − 1 2 α T Kα + q T α, (11) s.t. 0 ≤ α ≤ C1n, Xn i=1 αiyi = 0
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有