正在加载图片...
Support Matrix Machines where [1-ul+is called the hinge loss function,w E RPa is the singular values of W.Typically,P(W)=Wl.for a vector of regression coefficients,b ER is an offset term, A>0 because the nuclear norm Wl,is the best convex and C is a regularization parameter. approximation of rank(W)over the unit ball of matrices. When reshaped into vector vec(XT),the correlation a- Assuming that the loss function.is smooth and its deriva- mong columns or rows in the matrix is ignored.Howev- tive has the Lipschitz continuity,Zhou Li (2014)devised er,it would be more reasonable to exploit the correlation the Nesterov method (Nesterov,1983)to solve (5). information in developing a classifier,because the corre- lation is helpful and useful in improving the classification 4.The Support Matrix Machine performance It is well known that the hinge loss enjoys the large margin Intuitively,we consider the following formulation: principle.Moreover,it embodies sparseness and robust- ness,which are two desirable properties for a good classifi- tr(wW)+C∑1-hr(wx,)+}+,(3) er.This thus motivates us to employ the hinge loss function in(5)instead.In particular,we present the following for- i✉1 mulation: where W Rpxa is the matrix of regression co- efficients.However.this formulation is essentially e- argmin w.b 2tr(w"w)+rllwll. quivalent to Problem (2)when w vec(WT),be- cause tr(WTXi)=vec(WT)Tvec(XT)=wTxi and tr(WTw)=vec(WT)Tvec(WT)=wTw.This im- +C {1-[tr(WrX)+}+,(6 plies that the formulation in(3)cannot directly address our concern. which defines a matrix classification model that we cal- To capture the correlation,a natural approach is to consider I the support matrix machine (SMM).Recall that the hinge the dependency of the regression matrix W.In particular, loss is not smooth.so our model is not a trivial variant of one can impose a low-rank constraint on W to leverage the regularized GLM.On the other hand,SMM is in fact the structure information within Xi.For example,in the based on a penalty function,which is the combination of bilinear SVM(Pirsiavash et al.,2009),the authors assumed the squared Frobenius norm W and the nuclear nor- that W=WWT where W∈Rqxa,Wy∈Rpxd and m Wll..We call this penalty the spectral elastic net d<min(p,q).Accordingly,they defined the following because tr(WTW)=W((W)and problem without the offset term Wll.-(W).As we see.the spectral elas- tic net is parallel to the elastic net of Zou Hastie(2005). argmin Again recall that tr(WTW)=vec(WT)Tvec(WT)and Wz.Wv 2tr(W:Wgw,w图) tr(WTXi)=vec(WT)Tvec(XT).so SMM degenerates CI1-tr(wTXW.】+· (4) to the conventional linear SVM when T=0.However,the i1 nuclear norm can not be equivalently defined as a vector norm.This implies that we cannot formulate Problem(6) However,the resulting problem is nonconvex in both W in an equivalent of the vector form.Thus,our SMM is able and W.Thus,the authors resorted to an alternately itera- to capture the correlation within the input data matrix. tive update scheme for W and Wy;that is,they updated either of Wr and Wy while keeping the other fixed. 4.1.Theoretical Justification Since the dependency of W can be revealed by its rank We now show that SMM possesses some elegant benefits rank(W),it is also natural to directly impose the rank con- from the conventional SVM(Cortes Vapnik,1995)as straint to W.However,the resulting matrix rank minimiza- well as the conventional elastic net (Zou Hastie,2005). tion is usually NP-hard (Vandenberghe Boyd,1996). Without loss of generality,we suppose that each feature of Zhou Li (2014)suggested a function of the singular val- the training data is normalized to have unit length.That is ues of W as an alternative penalization technique.Based it holds that lfl 1 where f ([Xilkt,...,[Xnlk)T on this idea,they proposed a regularized GLM(R-GLM): for k 1,...,p and l=1,...,q. Theorem 1.Suppose the minimizer of Problem (6)is argmin J(W)+P(W), (5) (W,b).Then W where /(W)is a loss function obtained from the negative log-likelihood and P(W)is a penalty function defined onSupport Matrix Machines where [1−u]+ is called the hinge loss function, w ∈ R pq is a vector of regression coefficients, b ∈ R is an offset term, and C is a regularization parameter. When reshaped into vector vec(XT i ), the correlation a￾mong columns or rows in the matrix is ignored. Howev￾er, it would be more reasonable to exploit the correlation information in developing a classifier, because the corre￾lation is helpful and useful in improving the classification performance. Intuitively, we consider the following formulation: min W,b 1 2 tr(WTW)+C Xn i=1 {1−yi [tr(WT Xi)+b]}+, (3) where W ∈ R p×q is the matrix of regression co￾efficients. However, this formulation is essentially e￾quivalent to Problem (2) when w = vec(WT ), be￾cause tr(WT Xi) = vec(WT ) T vec(XT i ) = wT xi and tr(WTW) = vec(WT ) T vec(WT ) = wT w. This im￾plies that the formulation in (3) cannot directly address our concern. To capture the correlation, a natural approach is to consider the dependency of the regression matrix W. In particular, one can impose a low-rank constraint on W to leverage the structure information within Xi . For example, in the bilinear SVM (Pirsiavash et al., 2009), the authors assumed that W = WyWT x where Wx ∈ R q×d , Wy ∈ R p×d and d < min(p, q). Accordingly, they defined the following problem without the offset term argmin Wx,Wy 1 2 tr(WxWT y WyWT x ) +C Xn i=1 [1 − yitr(WT y XiWx)]+. (4) However, the resulting problem is nonconvex in both Wx and Wy. Thus, the authors resorted to an alternately itera￾tive update scheme for Wx and Wy; that is, they updated either of Wx and Wy while keeping the other fixed. Since the dependency of W can be revealed by its rank rank(W), it is also natural to directly impose the rank con￾straint to W. However, the resulting matrix rank minimiza￾tion is usually NP-hard (Vandenberghe & Boyd, 1996). Zhou & Li (2014) suggested a function of the singular val￾ues of W as an alternative penalization technique. Based on this idea, they proposed a regularized GLM (R-GLM): argmin W J(W) + P(W), (5) where J(W) is a loss function obtained from the negative log-likelihood and P(W) is a penalty function defined on the singular values of W. Typically, P(W) = λkWk∗ for λ > 0 because the nuclear norm kWk∗ is the best convex approximation of rank(W) over the unit ball of matrices. Assuming that the loss function J is smooth and its deriva￾tive has the Lipschitz continuity, Zhou & Li (2014) devised the Nesterov method (Nesterov, 1983) to solve (5). 4. The Support Matrix Machine It is well known that the hinge loss enjoys the large margin principle. Moreover, it embodies sparseness and robust￾ness, which are two desirable properties for a good classifi- er. This thus motivates us to employ the hinge loss function in (5) instead. In particular, we present the following for￾mulation: argmin W,b 1 2 tr(WTW) + τ ||W||∗ +C Xn i=1 {1−yi [tr(WT Xi)+b]}+, (6) which defines a matrix classification model that we cal￾l the support matrix machine (SMM). Recall that the hinge loss is not smooth, so our model is not a trivial variant of the regularized GLM. On the other hand, SMM is in fact based on a penalty function, which is the combination of the squared Frobenius norm kWk 2 F and the nuclear nor￾m kWk∗. We call this penalty the spectral elastic net because tr(WTW) = kWk 2 F = Pmin(p,q) i=1 σ 2 i (W) and kWk∗ = Pmin(q,p) i=1 σi(W). As we see, the spectral elas￾tic net is parallel to the elastic net of Zou & Hastie (2005). Again recall that tr(WTW) = vec(WT ) T vec(WT ) and tr(WT Xi) = vec(WT ) T vec(XT i ), so SMM degenerates to the conventional linear SVM when τ = 0. However, the nuclear norm can not be equivalently defined as a vector norm. This implies that we cannot formulate Problem (6) in an equivalent of the vector form. Thus, our SMM is able to capture the correlation within the input data matrix. 4.1. Theoretical Justification We now show that SMM possesses some elegant benefits from the conventional SVM (Cortes & Vapnik, 1995) as well as the conventional elastic net (Zou & Hastie, 2005). Without loss of generality, we suppose that each feature of the training data is normalized to have unit length. That is, it holds that kfklk = 1 where fkl , ([X1]kl, . . . , [Xn]kl) T for k = 1, . . . , p and l = 1, . . . , q. Theorem 1. Suppose the minimizer of Problem (6) is (W˜ , ˜b). Then W˜ = Dτ Xn i=1 β˜ iyiXi 
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有