Support Matrix Machines LuoLuo RICKY @SJTU.EDU.CN Yubo Xie YUBOTSE@SJTU.EDU.CN Department of Computer Science and Engineering.Shanghai Jiao Tong University,Shanghai.China Zhihua Zhang ZHIHUA @SJTU.EDU.CN Institute of Data Science,Department of Computer Science and Engineering,Shanghai Jiao Tong University,China Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology,Collaborative Innovation Center of Novel Software Technology and Industrialization,Department of Computer Science and Technology,Nanjing University,China Abstract the case that input samples are represented as vectors or s- In many classification problems such as elec- calars.However,it is also often met that input samples are troencephalogram(EEG)classification and im- naturally represented as two-dimensional matrices or ten- age classification,the input features are naturally sors.When using classical classification methods to data of represented as matrices rather than vectors or s- matrix representation,we have to reshape the input matri- calars.In general,the structure information of ces into vectors.However,this would destroy the structure the original feature matrix is useful and informa- information of the data matrix,e.g.,the correlation of dif- tive for data analysis tasks such as classification. ferent channels of electroencephalogram(EEG)data(Zhou One typical structure information is the correla- Li,2014)and the spatial relationship of the nearby pixels tion between columns or rows in the feature ma- of image data (Wolf et al.,2007).Moreover,if the data ma- trix.To leverage this kind of structure informa- trix is stacked (reshaped)into a vector,the dimensionality tion,we propose a new classification method that of the resulting vector typically becomes very high,which we call support matrix machine(SMM).Specif- in turn leads to the curse of dimensionality. ically,SMM is defined as a hinge loss plus a There has been some work on classification methods which so-called spectral elastic net penalty which is a attempt to exploit the correlation between the columns or spectral extension of the conventional elastic net rows of the data matrix.Usually,such a classification over a matrix.The spectral elastic net enjoys a method introduces a matrix of regression coefficients to property of grouping effect,i.e.,strongly corre- leverage the correlation within the data matrix.For ex- lated columns or rows tend to be selected alto- ample,Wolf et al.(2007)proposed a rank-k SVM,which gether or not.Since the optimization problem models the regression matrix as the sum of k rank-one or- for SMM is convex,this encourages us to de- thogonal matrices.Pirsiavash et al.(2009)devised a bilin- vise an alternating direction method of multipli- ear SVM by factorizing the regression matrix into two low- ers(ADMM)algorithm for solving the problem. rank matrices.Cai et al.(2006)proposed a similar bilinear Experimental results on EEG and image classifi- framework called support tensor machines for text catego- cation data show that our model is more robust rization.These methods essentially take advantage of the and efficient than the state-of-the-art methods. low-rank assumption,which can be used for describing the correlation within a matrix.However,their treatments re- sult in non-convex optimization problems. 1.Introduction In this paper we are also concerned with the classification Classical classification methods such as support vector ma- problems on a set of data matrices.Our work is motivat- chines (SVMs)(Cortes Vapnik.1995)and logistic re- ed by the use of the nuclear norm (a.k.a.,the trace nor- gression (Hastie et al.,2001)have been originally built on m)in low-rank matrix approximation (Srebro Shraib- Proceedings of the 32n4 International Conference on Machine man,2005),matrix completion(Candes Recht,2009:Li- Learning,Lille,France,2015.JMLR:W&CP volume 37.Copy- u et al.,2013;Salakhutdinov Srebro,2010;Huang et al., right 2015 by the author(s). 2013),and multi-task learning problems(Pong et al.,2010;
Support Matrix Machines Luo Luo RICKY@SJTU.EDU.CN Yubo Xie YUBOTSE@SJTU.EDU.CN Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China Zhihua Zhang ZHIHUA@SJTU.EDU.CN Institute of Data Science, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology, Collaborative Innovation Center of Novel Software Technology and Industrialization, Department of Computer Science and Technology, Nanjing University, China Abstract In many classification problems such as electroencephalogram (EEG) classification and image classification, the input features are naturally represented as matrices rather than vectors or scalars. In general, the structure information of the original feature matrix is useful and informative for data analysis tasks such as classification. One typical structure information is the correlation between columns or rows in the feature matrix. To leverage this kind of structure information, we propose a new classification method that we call support matrix machine (SMM). Specifically, SMM is defined as a hinge loss plus a so-called spectral elastic net penalty which is a spectral extension of the conventional elastic net over a matrix. The spectral elastic net enjoys a property of grouping effect, i.e., strongly correlated columns or rows tend to be selected altogether or not. Since the optimization problem for SMM is convex, this encourages us to devise an alternating direction method of multipliers (ADMM) algorithm for solving the problem. Experimental results on EEG and image classifi- cation data show that our model is more robust and efficient than the state-of-the-art methods. 1. Introduction Classical classification methods such as support vector machines (SVMs) (Cortes & Vapnik, 1995) and logistic regression (Hastie et al., 2001) have been originally built on Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). the case that input samples are represented as vectors or scalars. However, it is also often met that input samples are naturally represented as two-dimensional matrices or tensors. When using classical classification methods to data of matrix representation, we have to reshape the input matrices into vectors. However, this would destroy the structure information of the data matrix, e.g., the correlation of different channels of electroencephalogram (EEG) data (Zhou & Li, 2014) and the spatial relationship of the nearby pixels of image data (Wolf et al., 2007). Moreover, if the data matrix is stacked (reshaped) into a vector, the dimensionality of the resulting vector typically becomes very high, which in turn leads to the curse of dimensionality. There has been some work on classification methods which attempt to exploit the correlation between the columns or rows of the data matrix. Usually, such a classification method introduces a matrix of regression coefficients to leverage the correlation within the data matrix. For example, Wolf et al. (2007) proposed a rank-k SVM, which models the regression matrix as the sum of k rank-one orthogonal matrices. Pirsiavash et al. (2009) devised a bilinear SVM by factorizing the regression matrix into two lowrank matrices. Cai et al. (2006) proposed a similar bilinear framework called support tensor machines for text categorization. These methods essentially take advantage of the low-rank assumption, which can be used for describing the correlation within a matrix. However, their treatments result in non-convex optimization problems. In this paper we are also concerned with the classification problems on a set of data matrices. Our work is motivated by the use of the nuclear norm (a.k.a., the trace norm) in low-rank matrix approximation (Srebro & Shraibman, 2005), matrix completion (Candes & Recht ` , 2009; Liu et al., 2013; Salakhutdinov & Srebro, 2010; Huang et al., 2013), and multi-task learning problems (Pong et al., 2010;
Support Matrix Machines Harchaoui et al..2012:Kang et al..2011).The cornerstone show that our proposed training algorithm for SMM is effi- of these methods is to use the nuclear norm of a matrix as cient. a convex alternative of the matrix rank.Since the nuclear The remainder of the paper is organized as follows.In Sec- norm is the best convex approximation of the matrix rank tion 2,we give the notation and preliminaries.In Section 3, over the unit ball of matrices.this makes it more tractable to solve the resulting optimization problem.Moreover,some we review our concerned problem.In Section 4 we present nice properties such as the consistency results of the nucle- our model and the learning algorithm.In Section 5,we con- duct experimental analysis to justify our methods.Finally. ar norm minimization have been studied by Bach(2008) we conclude our work in Section 6. There has been some work which applies the nuclear norm penalization with least square loss function to matrix re- gression problems(Hunyadi et al.,2012;Signoretto et al., 2.Notation and Preliminaries 2014).Recently,Zhou Li(2014)applied the nuclear nor- In this section we give the notation and preliminaries which m penalization to matrix regression problems based on gen- will be used in this paper.We let Ip denote the pxp identity eralized linear models (GLMs). matrix.For a vector a E RP,the Euclidean norm is denot- In this paper we propose a new model to address the matrix ed asa=V∑-1a.For a matrix A∈Rpxa of rank classification problem.Our model includes two principal r where r min(p,q),we let the condensed singular val- ingredients.First,we consider the hinge loss due to its ue decomposition (SVD)of A be A =UADAV where widely deployed ability in sparseness and robustness mod- UA∈Rpxr and VA∈Raxr satisfy UTUA=I,and eling.Second,we employ a spectral elastic net penalty for VVA=Ir,and∑A=diag(o1(A),…,o,(A)with the regression matrix.The spectral elastic net is a spec- o1(A)≥·≥o(A)>0.Obviously,the rank of A is tral extension of the conventional elastic net over a matrix. equal to the number of nonzero singular values of A.Ad- In parallel to the conventional elastic net (Zou Hastie, ditionally,.we let AlF=√∑jA号=V∑=1oa(AP 2005)which is the combination of the ridge penalty and lasso penalty,our spectral elastic net is the combination of be the Frobenius norm,IAll.==(A)be the nu- the squared Frobenius matrix norm and nuclear norm.We clear norm,and All2 =1(A)be the spectral norm. prove that the spectral elastic net enjoys the property of For any T 0,we let D[A]=UAS,[EA]V where grouping effect which is similar to the conventional elastic S[>]=diag([1(A)]+;..,[o(A)]+)and net,while keeping a low-rank representation.We show that [+max(2,0).In the literature (Candes Recht,2009; the regression matrix in our model is indeed a combination Cai et al.,2010),D [A]is called the singular value thresh- of a set of support matrices.We thus refer to our model as olding (SVT)operator. a support matrix machine (SMM). It is well known that the nuclear norm A,as a function The optimization problem for SMM is convex but the hinge from RPx4 to R,is not differentiable.Alternatively,one loss function is not smooth.Fortunately,we can resort to considers the subdifferential of All,which is the set of an alternating direction method of multipliers (ADMM)s- subgradients and denoted by lAl.It follows from the tudied in (Goldstein et al.,2012).Specifically,we develop literature (Candes Recht,2009;Lewis,2003;Watson, an iteration algorithm,which is mainly built on ADMM 1992)that for a p x g real matrix A of rank r, and a singular value thresholding(SVT)operator(Candes Recht.2009:Cai et al.,2010).The algorithm converges OAll.=UAVA+Z:ZE RPX9,UTZ =0. quickly and promises to get the global optimal solution.It is worth pointing out that the algorithm requires repeated- ZVA=0,IZl2≤1}() ly computing singular value decomposition(SVD)of ma- trices with the same size as the matrix of input features 3.Problem Formulation and Related Work However,when represented as a matrix,the size of an in- put matrix is usually not too large. In this paper we study a regularized matrix classifier.We are given a set of training samplesT=[i where Finally,we apply our SMM to EEG and image classifica- Xi E RPX9 is the ith input sample and yi E{-1,1}is its tion problems.We see that classification methods direct- corresponding class label.As we have seen,Xi is repre- ly working on data matrices outperform those on vectors sented in matrix form.To fit a classifier,a commonly used such as the conventional SVM.When data are contaminat- approach is to stack Xi into a vector.Let xi vec(X)= ed with non-Gaussian noise or outliers,our SMM has sig- ((XXhg[X21,[Xlg)T∈R四.The soft nificant improvements over baselines.This implies that S- margin SVM is defined as MM is robust and has potential applications in matrix clas- sification problems with noises.Moreover,the experiments 2w2w+C∑1-h(w7x+b+, (2) i=1
Support Matrix Machines Harchaoui et al., 2012; Kang et al., 2011). The cornerstone of these methods is to use the nuclear norm of a matrix as a convex alternative of the matrix rank. Since the nuclear norm is the best convex approximation of the matrix rank over the unit ball of matrices, this makes it more tractable to solve the resulting optimization problem. Moreover, some nice properties such as the consistency results of the nuclear norm minimization have been studied by Bach (2008). There has been some work which applies the nuclear norm penalization with least square loss function to matrix regression problems (Hunyadi et al., 2012; Signoretto et al., 2014). Recently, Zhou & Li (2014) applied the nuclear norm penalization to matrix regression problems based on generalized linear models (GLMs). In this paper we propose a new model to address the matrix classification problem. Our model includes two principal ingredients. First, we consider the hinge loss due to its widely deployed ability in sparseness and robustness modeling. Second, we employ a spectral elastic net penalty for the regression matrix. The spectral elastic net is a spectral extension of the conventional elastic net over a matrix. In parallel to the conventional elastic net (Zou & Hastie, 2005) which is the combination of the ridge penalty and lasso penalty, our spectral elastic net is the combination of the squared Frobenius matrix norm and nuclear norm. We prove that the spectral elastic net enjoys the property of grouping effect which is similar to the conventional elastic net, while keeping a low-rank representation. We show that the regression matrix in our model is indeed a combination of a set of support matrices. We thus refer to our model as a support matrix machine (SMM). The optimization problem for SMM is convex but the hinge loss function is not smooth. Fortunately, we can resort to an alternating direction method of multipliers (ADMM) studied in (Goldstein et al., 2012). Specifically, we develop an iteration algorithm, which is mainly built on ADMM and a singular value thresholding (SVT) operator (Candes` & Recht, 2009; Cai et al., 2010). The algorithm converges quickly and promises to get the global optimal solution. It is worth pointing out that the algorithm requires repeatedly computing singular value decomposition (SVD) of matrices with the same size as the matrix of input features. However, when represented as a matrix, the size of an input matrix is usually not too large. Finally, we apply our SMM to EEG and image classification problems. We see that classification methods directly working on data matrices outperform those on vectors such as the conventional SVM. When data are contaminated with non-Gaussian noise or outliers, our SMM has significant improvements over baselines. This implies that SMM is robust and has potential applications in matrix classification problems with noises. Moreover, the experiments show that our proposed training algorithm for SMM is effi- cient. The remainder of the paper is organized as follows. In Section 2, we give the notation and preliminaries. In Section 3, we review our concerned problem. In Section 4 we present our model and the learning algorithm. In Section 5, we conduct experimental analysis to justify our methods. Finally, we conclude our work in Section 6. 2. Notation and Preliminaries In this section we give the notation and preliminaries which will be used in this paper. We let Ip denote the p×p identity matrix. For a vector a ∈ R p , the Euclidean norm is denoted as ||a|| = pPp i=1 a 2 i . For a matrix A ∈ R p×q of rank r where r ≤ min(p, q), we let the condensed singular value decomposition (SVD) of A be A = UAΣAVT A where UA ∈ R p×r and VA ∈ R q×r satisfy UT AUA = Ir and VT AVA = Ir, and ΣA = diag(σ1(A), · · · , σr(A)) with σ1(A) ≥ · · · ≥ σr(A) > 0. Obviously, the rank of A is equal to the number of nonzero singular values of A. Additionally, we let kAkF = qP i,j A2 ij = pPr i=1 σi(A) 2 be the Frobenius norm, kAk∗ = Pr i=1 σi(A) be the nuclear norm, and kAk2 = σ1(A) be the spectral norm. For any τ > 0, we let Dτ [A] = UASτ [ΣA]VT A where Sτ [Σ] = diag([σ1(A) − τ ]+, · · · , [σr(A) − τ ]+) and [z]+ = max(z, 0). In the literature (Candes & Recht ` , 2009; Cai et al., 2010), Dτ [A] is called the singular value thresholding (SVT) operator. It is well known that the nuclear norm kAk∗, as a function from R p×q to R, is not differentiable. Alternatively, one considers the subdifferential of kAk∗, which is the set of subgradients and denoted by ∂kAk∗. It follows from the literature (Candes & Recht ` , 2009; Lewis, 2003; Watson, 1992) that for a p × q real matrix A of rank r, ∂kAk∗ = n UAVT A + Z : Z ∈ R p×q , UT AZ = 0, ZVA = 0, kZk2 ≤ 1 o . (1) 3. Problem Formulation and Related Work In this paper we study a regularized matrix classifier. We are given a set of training samples T = Xi , yi} n i=1, where Xi ∈ R p×q is the ith input sample and yi ∈ {−1, 1} is its corresponding class label. As we have seen, Xi is represented in matrix form. To fit a classifier, a commonly used approach is to stack Xi into a vector. Let xi , vec(XT i ) = ([Xi ]11, . . . , [Xi ] 1q , [Xi ] 21, . . . , [Xi ]pq) T ∈ R pq. The soft margin SVM is defined as min w,b 1 2 wT w + C Xn i=1 [1 − yi(wT xi + b)]+, (2)
Support Matrix Machines where [1-ul+is called the hinge loss function,w E RPa is the singular values of W.Typically,P(W)=Wl.for a vector of regression coefficients,b ER is an offset term, A>0 because the nuclear norm Wl,is the best convex and C is a regularization parameter. approximation of rank(W)over the unit ball of matrices. When reshaped into vector vec(XT),the correlation a- Assuming that the loss function.is smooth and its deriva- mong columns or rows in the matrix is ignored.Howev- tive has the Lipschitz continuity,Zhou Li (2014)devised er,it would be more reasonable to exploit the correlation the Nesterov method (Nesterov,1983)to solve (5). information in developing a classifier,because the corre- lation is helpful and useful in improving the classification 4.The Support Matrix Machine performance It is well known that the hinge loss enjoys the large margin Intuitively,we consider the following formulation: principle.Moreover,it embodies sparseness and robust- ness,which are two desirable properties for a good classifi- tr(wW)+C∑1-hr(wx,)+}+,(3) er.This thus motivates us to employ the hinge loss function in(5)instead.In particular,we present the following for- i✉1 mulation: where W Rpxa is the matrix of regression co- efficients.However.this formulation is essentially e- argmin w.b 2tr(w"w)+rllwll. quivalent to Problem (2)when w vec(WT),be- cause tr(WTXi)=vec(WT)Tvec(XT)=wTxi and tr(WTw)=vec(WT)Tvec(WT)=wTw.This im- +C {1-[tr(WrX)+}+,(6 plies that the formulation in(3)cannot directly address our concern. which defines a matrix classification model that we cal- To capture the correlation,a natural approach is to consider I the support matrix machine (SMM).Recall that the hinge the dependency of the regression matrix W.In particular, loss is not smooth.so our model is not a trivial variant of one can impose a low-rank constraint on W to leverage the regularized GLM.On the other hand,SMM is in fact the structure information within Xi.For example,in the based on a penalty function,which is the combination of bilinear SVM(Pirsiavash et al.,2009),the authors assumed the squared Frobenius norm W and the nuclear nor- that W=WWT where W∈Rqxa,Wy∈Rpxd and m Wll..We call this penalty the spectral elastic net d<min(p,q).Accordingly,they defined the following because tr(WTW)=W((W)and problem without the offset term Wll.-(W).As we see.the spectral elas- tic net is parallel to the elastic net of Zou Hastie(2005). argmin Again recall that tr(WTW)=vec(WT)Tvec(WT)and Wz.Wv 2tr(W:Wgw,w图) tr(WTXi)=vec(WT)Tvec(XT).so SMM degenerates CI1-tr(wTXW.】+· (4) to the conventional linear SVM when T=0.However,the i1 nuclear norm can not be equivalently defined as a vector norm.This implies that we cannot formulate Problem(6) However,the resulting problem is nonconvex in both W in an equivalent of the vector form.Thus,our SMM is able and W.Thus,the authors resorted to an alternately itera- to capture the correlation within the input data matrix. tive update scheme for W and Wy;that is,they updated either of Wr and Wy while keeping the other fixed. 4.1.Theoretical Justification Since the dependency of W can be revealed by its rank We now show that SMM possesses some elegant benefits rank(W),it is also natural to directly impose the rank con- from the conventional SVM(Cortes Vapnik,1995)as straint to W.However,the resulting matrix rank minimiza- well as the conventional elastic net (Zou Hastie,2005). tion is usually NP-hard (Vandenberghe Boyd,1996). Without loss of generality,we suppose that each feature of Zhou Li (2014)suggested a function of the singular val- the training data is normalized to have unit length.That is ues of W as an alternative penalization technique.Based it holds that lfl 1 where f ([Xilkt,...,[Xnlk)T on this idea,they proposed a regularized GLM(R-GLM): for k 1,...,p and l=1,...,q. Theorem 1.Suppose the minimizer of Problem (6)is argmin J(W)+P(W), (5) (W,b).Then W where /(W)is a loss function obtained from the negative log-likelihood and P(W)is a penalty function defined on
Support Matrix Machines where [1−u]+ is called the hinge loss function, w ∈ R pq is a vector of regression coefficients, b ∈ R is an offset term, and C is a regularization parameter. When reshaped into vector vec(XT i ), the correlation among columns or rows in the matrix is ignored. However, it would be more reasonable to exploit the correlation information in developing a classifier, because the correlation is helpful and useful in improving the classification performance. Intuitively, we consider the following formulation: min W,b 1 2 tr(WTW)+C Xn i=1 {1−yi [tr(WT Xi)+b]}+, (3) where W ∈ R p×q is the matrix of regression coefficients. However, this formulation is essentially equivalent to Problem (2) when w = vec(WT ), because tr(WT Xi) = vec(WT ) T vec(XT i ) = wT xi and tr(WTW) = vec(WT ) T vec(WT ) = wT w. This implies that the formulation in (3) cannot directly address our concern. To capture the correlation, a natural approach is to consider the dependency of the regression matrix W. In particular, one can impose a low-rank constraint on W to leverage the structure information within Xi . For example, in the bilinear SVM (Pirsiavash et al., 2009), the authors assumed that W = WyWT x where Wx ∈ R q×d , Wy ∈ R p×d and d 0 because the nuclear norm kWk∗ is the best convex approximation of rank(W) over the unit ball of matrices. Assuming that the loss function J is smooth and its derivative has the Lipschitz continuity, Zhou & Li (2014) devised the Nesterov method (Nesterov, 1983) to solve (5). 4. The Support Matrix Machine It is well known that the hinge loss enjoys the large margin principle. Moreover, it embodies sparseness and robustness, which are two desirable properties for a good classifi- er. This thus motivates us to employ the hinge loss function in (5) instead. In particular, we present the following formulation: argmin W,b 1 2 tr(WTW) + τ ||W||∗ +C Xn i=1 {1−yi [tr(WT Xi)+b]}+, (6) which defines a matrix classification model that we call the support matrix machine (SMM). Recall that the hinge loss is not smooth, so our model is not a trivial variant of the regularized GLM. On the other hand, SMM is in fact based on a penalty function, which is the combination of the squared Frobenius norm kWk 2 F and the nuclear norm kWk∗. We call this penalty the spectral elastic net because tr(WTW) = kWk 2 F = Pmin(p,q) i=1 σ 2 i (W) and kWk∗ = Pmin(q,p) i=1 σi(W). As we see, the spectral elastic net is parallel to the elastic net of Zou & Hastie (2005). Again recall that tr(WTW) = vec(WT ) T vec(WT ) and tr(WT Xi) = vec(WT ) T vec(XT i ), so SMM degenerates to the conventional linear SVM when τ = 0. However, the nuclear norm can not be equivalently defined as a vector norm. This implies that we cannot formulate Problem (6) in an equivalent of the vector form. Thus, our SMM is able to capture the correlation within the input data matrix. 4.1. Theoretical Justification We now show that SMM possesses some elegant benefits from the conventional SVM (Cortes & Vapnik, 1995) as well as the conventional elastic net (Zou & Hastie, 2005). Without loss of generality, we suppose that each feature of the training data is normalized to have unit length. That is, it holds that kfklk = 1 where fkl , ([X1]kl, . . . , [Xn]kl) T for k = 1, . . . , p and l = 1, . . . , q. Theorem 1. Suppose the minimizer of Problem (6) is (W˜ , ˜b). Then W˜ = Dτ Xn i=1 β˜ iyiXi
Support Matrix Machines whee0≤B:≤C. and G(S)=TlISIl.. Denote=∑是1aaXi.We see that is the combi-- ADMM solves (7)by using the augmented Lagrangian nation of those Xi associated with nonzero B,while W is function: the SVT of 1.In fact,we will see from Eqn.(12)in The- L1(W,b,S,A)=H(W,b)+G(S)+tr[AT(S-W)] orem 5 that W is indeed the linear combination of a set of support matrices {Xi}. +1s-wIe。 Lemma 1.The difference between [k and [Slkalz meets the following inequality where p>0 is a hyperparameter. (]kl4-k22)2≤2nC2(1-ff,f2l2): The ADMM learning procedure for our SMM is summa- rized in Algorithm 1.The key steps of Algorithm 1 are the Recall thatf=1 and fa=1,so f fkal computations of S(k)and (W(k),()).the derivation of [-1,1]is the correlation coefficient of input features at po- which is based on Theorems 3 and 4 below. sitions (1,1)and (k2,2)over the n training samples. Theorem 3.For positive numbers T and p,let the matrix Lemma 1 says that n has the element-wise grouping ef- pW-A have SVD of the form: fect.Specifically,higher(lower)correlation between two pW-A=Uo∑oV6+U11Vf, (8) elements of leads to smaller (larger)difference.Based on this lemma,we also have the following theorem. where So is the diagonal matrix whose diagonal entries Theorem 2.Let [W]:be the lth column of W.Then are greater than T,Uo and Vo are matrices of the corre- sponding left and right singular vectors:U and V lw:h-w:2≤2nc2(p-∑项,f) correspond the rest parts of the SVD whose singular values are less than or equal to T.Define Especially.if fk1=fkl2 for any k=l,,卫,then S*D-(pW-A)=Uo(Zo-rI)V6. W].=[W].2 0 Theorem 2 reflects the relationship of W with the train- Then we have 0 E 8 G1(S*).where G1(S)=G(S)+ ing input matrices Xi.Interestingly,the columns of the t(As)+Iw-s服. regression matrix W have grouping effect in our model if the corresponding features have strong correlation.The Since Gi is convex with respect to S,we obtain the update similar conclusion also applies to the rows of W.Note equation of S from Theorem 3.That is, that Theorem 2 can not be extended to the element-wise case,because even if ft=f,we cannot obtain that S(k+1)=argminG(W(),S,()=-D-(pW()-A()). [W=[W We will present an empirical illustra- tion for the grouping effect problem in Section 5.1. Theorem 4.One of the solution of the following problem 4.2.Learning Algorithm argmin H(W,b)-tr(ATW)+W-sl (W,b) The Nesterov method for R-GLM(Zhou Li,2014)re- is quires the derivative of the loss function in question to be Lipschitz-continuous.However,both the hinge loss and the w-(+inx) (10 nuclear norm are not smooth.Thus,it is hard to develop the =1 Nesterov method for finding the SMM solution.Since the objective function of SMM is convex in both W and b,we 网{-(wPx} here derive a learning algorithm based on ADMM with the restart rule(Goldstein et al.,2012).The problem in (6)can where S*=i:0(1-(WTX)+) ∑ah=0 =
Support Matrix Machines where 0 ≤ β˜ i ≤ C. Denote Ω = Pn i=1 β˜ iyiXi . We see that Ω is the combination of those Xi associated with nonzero β˜ i , while W˜ is the SVT of Ω. In fact, we will see from Eqn. (12) in Theorem 5 that W˜ is indeed the linear combination of a set of support matrices {Xi}. Lemma 1. The difference between [Ω]k1l1 and [Ω]k2l2 meets the following inequality ([Ω]k1l1 − [Ω]k2l2 ) 2 ≤ 2nC2 (1 − f T k1l1 fk2l2 ). Recall that kfk1l1 k = 1 and kfk2l2 k = 1, so f T k1l1 fk2l2 ∈ [−1, 1] is the correlation coefficient of input features at positions (k1, l1) and (k2, l2) over the n training samples. Lemma 1 says that Ω has the element-wise grouping effect. Specifically, higher (lower) correlation between two elements of Ω leads to smaller (larger) difference. Based on this lemma, we also have the following theorem. Theorem 2. Let [W˜ ]:,l be the lth column of W˜ . Then [W˜ ]:,l1 − [W˜ ]:,l2 2 ≤ 2nC2 p − Xp k=1 f T kl1 fkl2 . Especially, if fkl1 = fkl2 for any k = 1, . . . , p, then [W˜ ]:,l1 = [W˜ ]:,l2 . Theorem 2 reflects the relationship of W˜ with the training input matrices Xi . Interestingly, the columns of the regression matrix W˜ have grouping effect in our model if the corresponding features have strong correlation. The similar conclusion also applies to the rows of W˜ . Note that Theorem 2 can not be extended to the element-wise case, because even if fk1l1 = fk2l2 , we cannot obtain that [W˜ ]k1l1 = [W˜ ]k2l2 . We will present an empirical illustration for the grouping effect problem in Section 5.1. 4.2. Learning Algorithm The Nesterov method for R-GLM (Zhou & Li, 2014) requires the derivative of the loss function in question to be Lipschitz-continuous. However, both the hinge loss and the nuclear norm are not smooth. Thus, it is hard to develop the Nesterov method for finding the SMM solution. Since the objective function of SMM is convex in both W and b, we here derive a learning algorithm based on ADMM with the restart rule (Goldstein et al., 2012). The problem in (6) can be equivalently written as follows: argmin W,b,S H(W, b) + G(S), (7) s.t. S − W = 0, where H(W, b) = 1 2 tr(WTW) + C Xn i=1 {1−yi [tr(WT Xi)+b]}+ and G(S) = τ ||S||∗. ADMM solves (7) by using the augmented Lagrangian function: L1(W, b, S, Λ) =H(W, b) + G(S) + tr[Λ T (S − W)] + ρ 2 ||S − W||2 F , where ρ > 0 is a hyperparameter. The ADMM learning procedure for our SMM is summarized in Algorithm 1. The key steps of Algorithm 1 are the computations of S (k) and (Wc(k) , b(k) ), the derivation of which is based on Theorems 3 and 4 below. Theorem 3. For positive numbers τ and ρ, let the matrix ρW − Λ have SVD of the form: ρW − Λ = U0Σ0VT 0 + U1Σ1VT 1 , (8) where Σ0 is the diagonal matrix whose diagonal entries are greater than τ , U0 and V0 are matrices of the corresponding left and right singular vectors; Σ1, U1 and V1 correspond the rest parts of the SVD whose singular values are less than or equal to τ . Define S ∗ , 1 ρ Dτ (ρW − Λ) = 1 ρ U0(Σ0 − τ I)VT 0 . (9) Then we have 0 ∈ ∂ G1(S ∗ ), where G1(S) = G(S) + tr(ΛT S) + ρ 2 ||W − S||2 F . Since G1 is convex with respect to S, we obtain the update equation of S from Theorem 3. That is, S (k+1) = argmin S G(W(k) , S, Λb(k) ) = 1 ρ Dτ (ρW(k)−Λb(k) ). Theorem 4. One of the solution of the following problem argmin (W,b) H(W, b) − tr(Λ TW) + ρ 2 ||W−S||2 F is W∗ = 1 ρ + 1 Λ + ρS + Xn i=1 α ∗ i yiXi , (10) b ∗ = 1 |S∗| X i∈S∗ n yi − tr[(W∗ ) T Xi ] o , where S ∗ = {i : 0 < α∗ i < C}, and α∗ ∈ R n is the solution of the following box constraint quadratic programming problem: argmax α − 1 2 α T Kα + q T α, (11) s.t. 0 ≤ α ≤ C1n, Xn i=1 αiyi = 0
Support Matrix Machines Algorithm 1 ADMM for SMM 5.Experiments Initialize S(-1)=S(0)E RPX4,A(-1)=(0)E RPX9,p> 0,t)=1,n∈(0,1) In this section we conduct the experimental analysis of our for =1,2,3...do proposed SMM.We first analyze the group effect proper- (W(k),b())=argmin H(W,6)-tr(A(k)TW)+ ty of SMM.Then we study the classification performance (W,b) on synthetic and real-world data sets.All experiments are w-sI恨 implemented in Matlab R2011b on a workstation with Intel s)=argmin G(S)+r(Ars)+号Iw因-s到房 Xeon CPU X5675 3.06GHz(2 x 12 cores).24GB RAM. and 64bit Windows Server 2008 system. A(=)-pW)-S) c(k)=p-lA(k)(+pllS(e)-(k) 5.1.Group Effect Property on Synthetic Data if c(k)0 GLM fails to obtain the groups structure totally.The sim- ulation results show that SMM has a better grouping effect Theorem 5 can be obtained directly by Algorithm 1 through property than other baselines,which also implies that SM- Eqn.(10).Ifi>0,we call the corresponding Xi a sup- M is able to capture the structure information in the feature port matrix.Theorem 5 shows that the solution W of our matrices. SMM can be written as the linear combination of support matrices plus an offset.This is the reason that we call our 5.2.Classification Accuracy on Synthetic Data model the support matrix machine. We now conduct the performance of SMM on synthetic da- Since the hinge loss and nuclear norm are weakly convex, ta sets.We use the same data generation process as in the the convergence property of Algorithm 1 can be proved im- previous subsection,but we set V 10 and 6=10-2 to mediately based on the result in(Goldstein et al.,2012:He obtain more complicated data.We use 1000 samples for Yuan,2012).That is,we have training,and other 500 samples for testing.All the hyper- Theorem 6.For any p >0 and n E (0,1),the iteration parameters involved are selected via cross validation. sequence given by Algorithm I converges to the optimal IThe code is available in http://bcmi.s jtu.edu.cn/ solution of Problem(7). ~luoluo/code/smm.zip
Support Matrix Machines Algorithm 1 ADMM for SMM Initialize S (−1) = Sb(0) ∈ R p×q , Λ (−1) = Λb(0) ∈ R p×q , ρ > 0, t(1) = 1, η ∈ (0, 1). for k = 1, 2, 3 . . . do (W(k) , b(k) ) = argmin (W,b) H(W, b) − tr(Λb(k)TW) + ρ 2 ||W−Sb(k) ||2 F S (k) = argmin S G(S) + tr(Λb(k)T S) + ρ 2 ||W(k) − S||2 F Λ (k) = Λb(k) − ρ(W(k) − S (k) ) c (k) = ρ −1 ||Λ (k) − Λb(k) ||2 F + ρ||S (k) − Sb(k) ||2 F if c (k) 0 α˜iyiXi . (12) Theorem 5 can be obtained directly by Algorithm 1 through Eqn. (10). If α˜i > 0, we call the corresponding Xi a support matrix. Theorem 5 shows that the solution W˜ of our SMM can be written as the linear combination of support matrices plus an offset. This is the reason that we call our model the support matrix machine. Since the hinge loss and nuclear norm are weakly convex, the convergence property of Algorithm 1 can be proved immediately based on the result in (Goldstein et al., 2012; He & Yuan, 2012). That is, we have Theorem 6. For any ρ > 0 and η ∈ (0, 1), the iteration sequence given by Algorithm 1 converges to the optimal solution of Problem (7). 5. Experiments In this section we conduct the experimental analysis of our proposed SMM 1 . We first analyze the group effect property of SMM. Then we study the classification performance on synthetic and real-world data sets. All experiments are implemented in Matlab R2011b on a workstation with Intel Xeon CPU X5675 3.06GHz (2 × 12 cores), 24GB RAM, and 64bit Windows Server 2008 system. 5.1. Group Effect Property on Synthetic Data To intuitively demonstrate the grouping effect property described in Theorem 2, we design a simulated experiment to visualize it. We generate a synthetic data set of n samples as follows. First, we generate V orthogonal ndimensional basis vectors ν1, ν2, · · · , νV with the unit Euclidean length, respectively. Second, we construct pq feature vectors of length n by the following process: ˜fkl = νdl/(0.2q)e + kl, kl i.i.d ∼ N (0, δ2 In), for k = 1, . . . , p, and l = 1, . . . , q. Here dl/(0.2q)e denotes the smallest integer no smaller than l/(0.2q). In other words, the elements in each sample matrix Xi (i = 1, . . . , n) can be roughly partitioned into four groups (vertical blocks). The features within the same group have high correlation, while the features between different groups have low correlation. Then we generate a p × q matrix W of rank 0.2q, and the label for each sample is generated by yi = sign[tr(WT Xi)]. We set n = 1000, V = 4, p = 80, q = 100 and δ = 10−3 in this simulation. The values of the regression matrix obtained respectively from the bilinear SVM (B-SVM) (Pirsiavash et al., 2009), regularized GLM (R-GLM) (Zhou & Li, 2014) and SMM are shown in Figure 1. It is easy to see that the regression matrix of SMM is clearly partitioned into four pure color blocks, while the blocks of B-SVM have higher noise. RGLM fails to obtain the groups structure totally. The simulation results show that SMM has a better grouping effect property than other baselines, which also implies that SMM is able to capture the structure information in the feature matrices. 5.2. Classification Accuracy on Synthetic Data We now conduct the performance of SMM on synthetic data sets. We use the same data generation process as in the previous subsection, but we set V = 10 and δ = 10−2 to obtain more complicated data. We use 1000 samples for training, and other 500 samples for testing. All the hyperparameters involved are selected via cross validation. 1The code is available in http://bcmi.sjtu.edu.cn/ ˜luoluo/code/smm.zip
Support Matrix Machines Table 1.Summary of four data sets Data sets #positive #negative dimension EEG alcoholism 77 45 256×64 EEG emotion 1286 1334 31×10 (a)B-SVM (b)R-GLM (c)SMM students face 200 200 200×200 INRIA person 607 1214 160×96 Figure 1.(a).(b)and (c)display the values of normalized regres- sion matrix of B-SVM.R-GLM and SMM respectively. The EEG alcoholism data set arises to examine EEG cor- relates of genetic predisposition to alcoholism.It contains two groups of subjects:alcoholic and control.For each subject,64 channels of electrodes are placed and the volt- age values are recorded at 256 time points. The EEG emotion data set (Zhu et al.,2014;Zheng et al., 2014)focuses on EEG emotion analysis,which is obtained by showing some positive and negative emotional movie 01 clips to persons and then recording the EEG signal via ESI (a)Synthetic data with Gaussian noise NeuroScan System from 31 pairs.Each pair contain 10 data points(two channels for one pair,and each channel contains five frequency bands).There are 2620 movie clips chosen to evoke the target emotion,such as Titanic,Kung Fu Panda and so on. The student face data set contains 400 photos of Stanford University medical students (Nazir et al..2010).which con- sists of 200 males and 200 females.Each sample is a 200 x 200 gray level image. 05 000601Q015002002☒0.0 The INRIA person data set was collected to detect whether (b)Synthetic data with salt and pepper noise there exist people in the image.We normalize the samples into 160x 96 gray images and remove the same person with Figure 2.Classification accuracy on synthetic data with different different aspects.Combining with the negative samples,we levels of noises.We use Gaussian noise with 0 mean and standard obtain 1821 samples in total. derivation from 0.01 to 1 in(a),and salt and pepper noise with density from 0.001 to 0.035 in (b). We summarize the main information of these data sets in Table 1.For the student face and INRIA person data set- s,we directly use the pixels as input features without any We add different levels of Gaussian noise and salt and pep- advanced visual features. per noise on the test data,and repeat this procedure ten For each of the compared methods,we randomly sam- times to compute the mean and standard deviation of clas- ple 70%of the data set for training and the rest for test- sification accuracy.The results are shown in Figure 2.It ing.All the hyperparameters involved are selected vi- is clear that all methods achieve comparable performance a cross validation.More specifically,we select C from on clean data,but SMM is more robust with respect to high {1×10-3,2×10-3,5×10-3,1×10-2,2×10-2,5× level of noises. 10-2..,1×103,2×10}.For each C,we tune r man- ually to make the rank of classifier matrix varied from I to 5.3.Classification Accuracy on Real-World Data the size of the matrix.We repeat this procedure ten times to We apply SMM to EEG and image classification problem- compute the mean and standard deviation of the classifica- s,and compare its performance with B-SVM(Pirsiavash tion accuracy.Table 2 shows the classification accuracy of et al.,2009).R-GLM (Zhou Li,2014).and the standard the four methods.We can see that SMM achieves the best linear SVM (L-SVM)(Cortes Vapnik,1995).We use performance on all the four data sets. four real-world matrix classification data sets:the EEG al- http://kdd.ics.uci.edu/databases/eeg/ coholism,the EEG emotion,the students face and INRIA eeg.html person. http://pascal.inrialpes.fr/data/human/
Support Matrix Machines 20 40 60 80 100 10 20 30 40 50 60 70 80 −0.03 −0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01 20 40 60 80 100 10 20 30 40 50 60 70 80 −6 −4 −2 0 2 4 6 8 20 40 60 80 100 10 20 30 40 50 60 70 80 −0.03 −0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 (a) B-SVM (b) R-GLM (c) SMM Figure 1. (a), (b) and (c) display the values of normalized regression matrix of B-SVM, R-GLM and SMM respectively. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 standard deviation of Gaussian noise accuracy B−SVM R−GLM SMM (a) Synthetic data with Gaussian noise 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 density of salt and pepper noise accuracy B−SVM R−GLM SMM (b) Synthetic data with salt and pepper noise Figure 2. Classification accuracy on synthetic data with different levels of noises. We use Gaussian noise with 0 mean and standard derivation from 0.01 to 1 in (a), and salt and pepper noise with density from 0.001 to 0.035 in (b). We add different levels of Gaussian noise and salt and pepper noise on the test data, and repeat this procedure ten times to compute the mean and standard deviation of classification accuracy. The results are shown in Figure 2. It is clear that all methods achieve comparable performance on clean data, but SMM is more robust with respect to high level of noises. 5.3. Classification Accuracy on Real-World Data We apply SMM to EEG and image classification problems, and compare its performance with B-SVM (Pirsiavash et al., 2009), R-GLM (Zhou & Li, 2014), and the standard linear SVM (L-SVM) (Cortes & Vapnik, 1995). We use four real-world matrix classification data sets: the EEG alcoholism, the EEG emotion, the students face and INRIA person. Table 1. Summary of four data sets Data sets #positive #negative dimension EEG alcoholism 77 45 256×64 EEG emotion 1286 1334 31×10 students face 200 200 200×200 INRIA person 607 1214 160×96 The EEG alcoholism data set2 arises to examine EEG correlates of genetic predisposition to alcoholism. It contains two groups of subjects: alcoholic and control. For each subject, 64 channels of electrodes are placed and the voltage values are recorded at 256 time points. The EEG emotion data set (Zhu et al., 2014; Zheng et al., 2014) focuses on EEG emotion analysis, which is obtained by showing some positive and negative emotional movie clips to persons and then recording the EEG signal via ESI NeuroScan System from 31 pairs. Each pair contain 10 data points (two channels for one pair, and each channel contains five frequency bands). There are 2620 movie clips chosen to evoke the target emotion, such as Titanic, Kung Fu Panda and so on. The student face data set contains 400 photos of Stanford University medical students (Nazir et al., 2010), which consists of 200 males and 200 females. Each sample is a 200 × 200 gray level image. The INRIA person data set3 was collected to detect whether there exist people in the image. We normalize the samples into 160×96 gray images and remove the same person with different aspects. Combining with the negative samples, we obtain 1821 samples in total. We summarize the main information of these data sets in Table 1. For the student face and INRIA person data sets, we directly use the pixels as input features without any advanced visual features. For each of the compared methods, we randomly sample 70% of the data set for training and the rest for testing. All the hyperparameters involved are selected via cross validation. More specifically, we select C from {1 × 10−3 , 2 × 10−3 , 5 × 10−3 , 1 × 10−2 , 2 × 10−2 , 5 × 10−2 . . . , 1 × 103 , 2 × 103}. For each C, we tune τ manually to make the rank of classifier matrix varied from 1 to the size of the matrix. We repeat this procedure ten times to compute the mean and standard deviation of the classification accuracy. Table 2 shows the classification accuracy of the four methods. We can see that SMM achieves the best performance on all the four data sets. 2http://kdd.ics.uci.edu/databases/eeg/ eeg.html 3http://pascal.inrialpes.fr/data/human/
Support Matrix Machines Table 2.The classification accuracy on four data sets (in % Data sets L-SVM B-SVM R-GLM SMM EEG alcoholism 71.11(±8.30) 71.67(±7.83) 71.39(±6.55 73.33(±5.89) EEG emotion 88.76(±1.16 87.73(±1.18) 82.26(±1.65) 90.01(±0.98) students face 91.67(±1.57) 95.42(±1.72) 94.25(±2.76) 96.83(±1.66 INRIA person 84.88(±1.98) 85.09(±1.46) 84.65(±1.38) 85.95(±0.77) Table 3.The training time on the four data sets (in second) Data sets B-SVM R-GLM SMM EEG alcoholism 86.30(±163.73) 407.59(±100.93) 1.36(±0.09) EEG emotion 292.89(±248.47) 33.32(±3.38) 6.57(±6.73) students face 23.88(±10.53 121.14(±87.40) 7.20(±0.22) INRIA person 19.36(±9.23) 580.06(±229.14) 6.61(±2.44) We are also interested in the computational efficiency of 61472182)and the Fundamental Research Funds for the the three matrix classification models:B-SVM,R-GLM Central Universities (No.20620140510). and SMM.We report the training time on the four data sets in Table 3.Recall that R-GLM is solved by the Nes- Appendix A:The Proof of Lemma 1 terov method (Zhou Li,2014).We can find that R-GLM is the slowest method on EEG alcoholism.students face Proof.Let B=[By,...Bnyn],then we have and INRIA person.This is because the main step of the Nesterov method is affected by the dimension of the in- (kl-k22)2 put sample (Zhou Li,2014).However,the main step of B-SVM and SMM is a quadratic programming problem whose time complexity is mainly affected by the number of training samples.So B-SVM and SMM are more efficient than R-GLM on the data sets with high-dimension samples [aT(Eh-f22)]2 Furthermore,we find that the running time of B-SVM are unstable on different data sets,usually with higher vari- ≤ l3121lE4,-f222 ance than that of SMM.The reason might be that B-SVM is a non-convex problem,the training procedure of which ≤ 2nC2(1-fa). relies heavily on the initial value of the parameter. ▣ Appendix B:The Proof of Theorem 2 6.Conclusion Proof.Suppose has condensed SVD UXVT, In this paper we have proposed a novel matrix classifica- where U∈Rpxr,∑=diag(o1,.,og)andV∈Rgxr tion method called support matrix machine (SMM).SMM satisfy UUT Ip and vVT =Ig.Denote U can leverage the structure of the data matrices and has the [u1,...,ur],where for k=1,...,r,uk is the kth column grouping effect property.We have derived an iteration al- of U.Since the columns of U are orthogonal,we have gorithm based on ADMM for learning,and applied our method to EEG and image classification with better per- lw]k4-w],2 formance than the baselines such as B-SVM and R-GLM. lh-【, Specifically,our method is more robust than B-SVM and R-GLM to model noisy data.Furthermore,our method is ‖∑=1(ak-T)+(V]k-[V]2k)u2 more efficient than B-SVM and R-GLM,and more numer- ically stable than B-SVM. ‖∑g=1ok(V],k-[V2k)u2 =1[(ok-T)+]2(V]k-[V]2k)2llu2 7.Acknowledgement Xk=102([V]k-[V]tak)2llukl2 Luo Luo and Zhihua Zhang are supported by the Natu- ral Science Foundation of Shanghai City of China(No. ∑1los-r+IPVk-IVYs'≤1. 15ZR1424200).Wu-Jun Li is supported by the NSFC (No. ∑%=1o2(V]k-[V]2k)尸
Support Matrix Machines Table 2. The classification accuracy on four data sets (in %) Data sets L-SVM B-SVM R-GLM SMM EEG alcoholism 71.11 (± 8.30) 71.67 (± 7.83) 71.39 (± 6.55) 73.33 (± 5.89) EEG emotion 88.76 (± 1.16) 87.73 (± 1.18) 82.26 (± 1.65) 90.01 (± 0.98) students face 91.67 (± 1.57) 95.42 (± 1.72) 94.25 (± 2.76) 96.83 (± 1.66) INRIA person 84.88 (± 1.98) 85.09 (± 1.46) 84.65 (± 1.38) 85.95 (± 0.77) Table 3. The training time on the four data sets (in second) Data sets B-SVM R-GLM SMM EEG alcoholism 86.30 (± 163.73) 407.59 (± 100.93) 1.36 (± 0.09) EEG emotion 292.89 (± 248.47) 33.32 (± 3.38) 6.57 (± 6.73) students face 23.88 (± 10.53) 121.14 (± 87.40) 7.20 (± 0.22) INRIA person 19.36 (± 9.23) 580.06 (± 229.14) 6.61 (± 2.44) We are also interested in the computational efficiency of the three matrix classification models: B-SVM, R-GLM and SMM. We report the training time on the four data sets in Table 3. Recall that R-GLM is solved by the Nesterov method (Zhou & Li, 2014). We can find that R-GLM is the slowest method on EEG alcoholism, students face and INRIA person. This is because the main step of the Nesterov method is affected by the dimension of the input sample (Zhou & Li, 2014). However, the main step of B-SVM and SMM is a quadratic programming problem whose time complexity is mainly affected by the number of training samples. So B-SVM and SMM are more efficient than R-GLM on the data sets with high-dimension samples. Furthermore, we find that the running time of B-SVM are unstable on different data sets, usually with higher variance than that of SMM. The reason might be that B-SVM is a non-convex problem, the training procedure of which relies heavily on the initial value of the parameter. 6. Conclusion In this paper we have proposed a novel matrix classification method called support matrix machine (SMM). SMM can leverage the structure of the data matrices and has the grouping effect property. We have derived an iteration algorithm based on ADMM for learning, and applied our method to EEG and image classification with better performance than the baselines such as B-SVM and R-GLM. Specifically, our method is more robust than B-SVM and R-GLM to model noisy data. Furthermore, our method is more efficient than B-SVM and R-GLM, and more numerically stable than B-SVM. 7. Acknowledgement Luo Luo and Zhihua Zhang are supported by the Natural Science Foundation of Shanghai City of China (No. 15ZR1424200). Wu-Jun Li is supported by the NSFC (No. 61472182) and the Fundamental Research Funds for the Central Universities (No. 20620140510). Appendix A: The Proof of Lemma 1 Proof. Let β˜0 = [β˜ 1y1, . . . , β˜ nyn] T , then we have ([Ω]k1l1 − [Ω]k2l2 ) 2 = Xn i=1 β˜ iyi [Xi ]k1l1 − Xn i=1 β˜ iyi [Xi ]k2l2 2 = [β˜0T (fk1l1 − fk2l2 )]2 ≤ ||β˜0 ||2 ||fk1l1 − fk2l2 ||2 ≤ 2nC2 (1 − f T k1l1 fk2l2 ). Appendix B: The Proof of Theorem 2 Proof. Suppose Ω has condensed SVD Ω = UΣVT , where U ∈ R p×r , Σ = diag(σ1, . . . , σq) and V ∈ R q×r satisfy UUT = Ip and VVT = Iq. Denote U = [u1, . . . , ur], where for k = 1, . . . , r, uk is the kth column of U. Since the columns of U are orthogonal, we have [W˜ ]:,l1 − [W˜ ]:,l1 2 [Ω]:,l1 − [Ω]:,l1 2 = Pq k=1(σk − τ )+([V]l1k − [V]l2k)uk 2 Pq k=1 σk([V]l1k − [V]l2k)uk 2 = Pq k=1[(σk − τ )+] 2 ([V]l1k − [V]l2k) 2 ||uk||2 Pq k=1 σ 2 k ([V]l1k − [V]l2k) 2||uk||2 = Pq k=1[(σk − τ )+] 2 ([V]l1k − [V]l2k) 2 Pq k=1 σ 2 k ([V]l1k − [V]l2k) 2 ≤ 1
Support Matrix Machines Then we can obtain the following bound based on Lemma Substituting (15)and(16)into (14)to eliminate Yi and 1 we obtain I[W -[w2 L(W,b,,a,Y) ≤lh-2I2 -jt(wTW)-t(ATW)+2lW-sl 1 三Iou.-a -∑a{tr(wrx)+-1. (17) 2=1 ≤2nc2(p-∑) Setting the derivative of L with respect to W to be 0,we k=1 ▣ have the optimal value (18) Appendix C:The Proof of Theorem 3 w=a+s+wX =1 Proof.Let Zo be U11V1.Recall that Uo.U1,Vo and Substituting (18)into(17),we obtain Vi are column orthogonal.So we have UZo =0 and L(W,b,ξ,a,Y) Zo Vo =0.By the SVD form of S,formulation(9)and using Eqn.(1)we have: -∑1- A+pSPX)a i=l p+1 aG1(S)ls=s-=A-pW +D(pW -A)+T llsll-Is=s- 1 Thus,we have 0 E 0G1(S*). ▣ 2(p+1) ∑a:ah5tr(XX)+D, i.j=1 Appendix D:The Proof of Theorem 4 where D= ()Thus., 1 2 Proof.We denote H(W,b)=H(W,b)-tr(ATW)+ finding the minimizer of H(W,b)is equivalent to solving W-Finding the minimizer of H(W,)is equiv- Problem(11)by KKT conditions.Let the optimal solution of (11)be a*,we can obtain (10)from(18)directly.The alent to solving the following problem: KKT conditions also provide r(wrw)+C∑s (13) a{{t[(W*)TX+b-1+}=0 i=1 Y2=0, -t(AV+1w-s服 which means for any 0 st.tr(WrX)+≥1- 0,=0 and yi{tr[(W*)TX]+*}-1 =0.Then the optimal b*can be calculated by Ei≥0. To solve problem (13).we construct the following La- b=贴-tr(W*)TXJ. grangian function In practice,we choose the optimal b*by averaging these solutions L(W,b,E,a,Y) =2rww)+c∑i-traW+w-sl呢 b 网-wyrx iES+ =1 0 -∑a,u(wrX)+-1+s-∑&.(14 -1 References Setting the derivative of L with respect to to be 0,we Bach,Francis R.Consistency of trace norm minimiza- have tion.The Journal of Machine Learning Research,9: %=C-ai20i=1,.,n. (15) 1019-1048.2008. Setting the derivative of L with respect to bbe 0,we have Cai,Deng,He,Xiaofei,Wen,Ji-Rong,Han,Jiawei,and Ma,Wei-Ying.Support tensor machines for text cat- ∑a=0, (16) egorization.Technical report,University of Illinois at = Urbana-Champaign,2006
Support Matrix Machines Then we can obtain the following bound based on Lemma 1 [W˜ ]:,l1 − [W˜ ]:,l2 2 ≤ [Ω]:,l1 − [Ω]:,l2 2 = Xp k=1 [Ω]kl1 − [Ω]kl2 2 ≤ 2nC2 p − Xp k=1 f T kl1 fkl2 . Appendix C: The Proof of Theorem 3 Proof. Let Z0 be 1 τ U1Σ1V1. Recall that U0, U1, V0 and V1 are column orthogonal. So we have UT 0 Z0 = 0 and Z0V0 = 0. By the SVD form of Sb, formulation (9) and using Eqn. (1) we have: ∂G1(S)|S=S∗ = Λ − ρW + Dτ (ρW − Λ) + τ ∂||S||∗|S=S∗ . Thus, we have 0 ∈ ∂G1(S ∗ ). Appendix D: The Proof of Theorem 4 Proof. We denote H1(W, b) = H(W, b) − tr(ΛTW) + ρ 2 ||W−S||2 F . Finding the minimizer of H1(W, b) is equivalent to solving the following problem: min W,b,ξ 1 2 tr(WTW) + C Xn i=1 ξi (13) −tr(Λ TW) + ρ 2 ||W − S||2 F s.t. yi [tr(WT Xi) + b] ≥ 1 − ξi ξi ≥ 0. To solve problem (13), we construct the following Lagrangian function L(W, b, ξ, α, γ) = 1 2 tr(WTW) + C Xn i=1 ξi − tr(Λ TW) + ρ 2 ||W − S||2 F − Xn i=1 αi{yi [tr(WT Xi) + b] − 1 + ξi} −Xn i=1 γiξi . (14) Setting the derivative of L with respect to ξ to be 0, we have γi = C − αi ≥ 0, i = 1, . . . , n. (15) Setting the derivative of L with respect to b be 0, we have Xn i=1 αiyi = 0. (16) Substituting (15) and (16) into (14) to eliminate γi and ξi , we obtain L(W, b, ξ, α, γ) = 1 2 tr(WTW) − tr(Λ TW) + ρ 2 ||W − S||2 F − Xn i=1 αi{yi [tr(WT Xi) + b] − 1}. (17) Setting the derivative of L with respect to W to be 0, we have the optimal value W∗ = 1 ρ + 1 Λ + ρS + Xn i=1 αiyiXi . (18) Substituting (18) into (17), we obtain L(W, b, ξ, α, γ) = Xn i=1 1 − yitr[(Λ + ρS) T Xi ] ρ + 1 αi − 1 2(ρ + 1) Xn i,j=1 αiαiyiyj tr(XT i Xj ) + D, where D = ρ 2 tr(S T S) − 1 2(ρ + 1)||Λ + ρS||2 F . Thus, finding the minimizer of H(W, b) is equivalent to solving Problem (11) by KKT conditions. Let the optimal solution of (11) be α∗ , we can obtain (10) from (18) directly. The KKT conditions also provide α ∗ i yi{tr[(W∗ ) T Xi ] + b ∗ } − 1 + ξ ∗ i = 0 γ ∗ i ξ ∗ i = 0, which means for any 0 0, ξ∗ i = 0 and yi{tr[(W∗ ) T Xi ] + b ∗} − 1 = 0. Then the optimal b ∗ can be calculated by b ∗ = yi − tr[(W∗ ) T Xi ]. In practice, we choose the optimal b ∗ by averaging these solutions b ∗ = 1 |S∗| X i∈S∗ {yi − tr[(W∗ ) T Xi ]}. References Bach, Francis R. Consistency of trace norm minimization. The Journal of Machine Learning Research, 9: 1019–1048, 2008. Cai, Deng, He, Xiaofei, Wen, Ji-Rong, Han, Jiawei, and Ma, Wei-Ying. Support tensor machines for text categorization. Technical report, University of Illinois at Urbana-Champaign, 2006
Support Matrix Machines Cai.Jian-Feng,Candes,Emmanuel J,and Shen,Zuowei.A Liu,Ji,Musialski,Przemyslaw,Wonka,Peter,and Ye. singular value thresholding algorithm for matrix comple- Jieping.Tensor completion for estimating missing val- tion.SIAM Journal on Optimization,20(4):1956-1982, ues in visual data.In IEEE Tansactions on Pattern Anal- 2010. ysis and Machine Intelligence,volume 35,pp.208-220, 2013. Candes,Emmanuel J and Recht,Benjamin.Exact ma- trix completion via convex optimization.Foundations Nazir,M,Ishtiag,Muhammad,Batool,Anab,Jaffar,M Ar- of Computational mathematics,9(6):717-772,2009. fan,and Mirza.Anwar M.Feature selection for efficient gender classification.In Proceedings of the WSEAS in- Cortes,Corinna and Vapnik,Vladimir.Support-vector net- ternational conference,Wisconsin,pp.70-75,2010. works.Machine learning,20(3):273-297,1995. Nesterov,Yurii.A method of solving a convex program- Goldstein,Tom,ODonoghue,Brendan,and Setzer,Simon. ming problem with convergence rate o(1/k2).Soviet Mathematics Doklady,27(2):372-376.1983 Fast alternating direction optimization methods.CAM eport,Pp.12-35,2012. Pirsiavash,Hamed,Ramanan,Deva,and Fowlkes,Char- less C.Bilinear classifiers for visual recognition.In Harchaoui,Zaid,Douze,Matthijs,Paulin,Mattis,Dudik, Proceedings of the Advances in Neural Information Pro- Miroslav,and Malick,Jerome.Large-scale image clas- cessing Systems,pp.1482-1490,2009 sification with trace-norm regularization.In Proceedings of the IEEE Conference on Computer Vision and Pattern Platt,John et al.Sequential minimal optimization:A fast Recognition,pp.3386-3393,2012. algorithm for training support vector machines.Techni- cal report msr-tr-98-14,Microsoft Research,1998. Hastie.Trevor.Robert.Tibshirani.and Jerome.Friedman. Pong,Ting Kei,Tseng,Paul,Ji,Shuiwang,and Ye,Jieping. The Elements of Statistical Learning:Data Mining,In- Trace norm regularization:reformulations,algorithms, ference,and Prediction.Springer-Verlag,2001. and multi-task learning.SIAM Journal on Optimization, 20(6):3465-3489.2010. He,Bingsheng and Yuan,Xiaoming.On non-ergodic con- vergence rate of douglas-rachford alternating direction Salakhutdinov.Ruslan and Srebro,Nathan.Collabora- method of multipliers.Numerische Mathematik,pp.1- tive filtering in a non-uniform world:Learning with the 11,2012. weighted trace norm.arXiv preprint arXiv:1002.2780, 2010. Huang,Jin,Nie,Feiping,and Huang,Heng.Robust discrete matrix completion.In Proceedings of the Signoretto,Marco,Dinh,Quoc Tran,De Lathauwer, AAAI Conference on Artificial Intelligence,pp.424-430, Lieven,and Suykens,Johan AK.Learning with tensors: 2013. a framework based on convex optimization and spectral regularization.Machine Learning,94(3):303-351,2014. Hunyadi,Borbala,Signoretto,Marco,Van Paesschen, Srebro,Nathan and Shraibman,Adi.Rank,trace-norm and Wim,Suykens,Johan AK,Van Huffel,Sabine,and max-norm.In Proceedings of the Conference on Learn- De Vos,Maarten.Incorporating structural informa- ing Theory,pp.545-560.2005. tion from the multichannel eeg improves patient-specific seizure detection.Clinical Neurophysiology,123(12): Vandenberghe,Lieven and Boyd,Stephen.Semidefinite 2352-2361.2012. programming.SIAM review,38(1):49-95,1996. Kang,Zhuoliang,Grauman,Kristen,and Sha,Fei.Learn- Watson,G Alistair.Characterization of the subdifferential ing with whom to share in multi-task feature learning.In of some matrix norms.Linear Algebra and its Applica- Proceedings of the International Conference on Machine tions,170:33-45,1992. Learning,Pp.521-528,2011. Wolf,Lior,Jhuang.Hueihan,and Hazan,Tamir.Model- ing appearances with low-rank SVM.In Proceedings of Keerthi,S.Sathiya and Gilbert,Elmer G.Convergence of the IEEE Conference on Computer Vision and Pattern a generalized smo algorithm for svm classifier design. Recognition,pp.1-6,2007. Machine Learning,46(1-3):351-360,2002. Zheng,Wei-Long,Zhu,Jia-Yi,Peng,Yong,and Lu,Bao- Lewis,Adrian S.The mathematics of eigenvalue opti- Liang.Eeg-based emotion classification using deep be- mization.Mathematical Programming,97(1-2):155- lief networks.In Proceedings of the IEEE International 176.2003 Conference on Multimedia and Expo,pp.1-6,2014
Support Matrix Machines Cai, Jian-Feng, Candes, Emmanuel J, and Shen, Zuowei. A ` singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010. Candes, Emmanuel J and Recht, Benjamin. Exact ma- ` trix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009. Cortes, Corinna and Vapnik, Vladimir. Support-vector networks. Machine learning, 20(3):273–297, 1995. Goldstein, Tom, ODonoghue, Brendan, and Setzer, Simon. Fast alternating direction optimization methods. CAM report, pp. 12–35, 2012. Harchaoui, Zaid, Douze, Matthijs, Paulin, Mattis, Dudik, Miroslav, and Malick, Jer´ ome. Large-scale image clas- ˆ sification with trace-norm regularization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3386–3393, 2012. Hastie, Trevor, Robert, Tibshirani, and Jerome, Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2001. He, Bingsheng and Yuan, Xiaoming. On non-ergodic convergence rate of douglas–rachford alternating direction method of multipliers. Numerische Mathematik, pp. 1– 11, 2012. Huang, Jin, Nie, Feiping, and Huang, Heng. Robust discrete matrix completion. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 424–430, 2013. Hunyadi, Borbala, Signoretto, Marco, Van Paesschen, ´ Wim, Suykens, Johan AK, Van Huffel, Sabine, and De Vos, Maarten. Incorporating structural information from the multichannel eeg improves patient-specific seizure detection. Clinical Neurophysiology, 123(12): 2352–2361, 2012. Kang, Zhuoliang, Grauman, Kristen, and Sha, Fei. Learning with whom to share in multi-task feature learning. In Proceedings of the International Conference on Machine Learning, pp. 521–528, 2011. Keerthi, S. Sathiya and Gilbert, Elmer G. Convergence of a generalized smo algorithm for svm classifier design. Machine Learning, 46(1-3):351–360, 2002. Lewis, Adrian S. The mathematics of eigenvalue optimization. Mathematical Programming, 97(1-2):155– 176, 2003. Liu, Ji, Musialski, Przemyslaw, Wonka, Peter, and Ye, Jieping. Tensor completion for estimating missing values in visual data. In IEEE Tansactions on Pattern Analysis and Machine Intelligence, volume 35, pp. 208–220, 2013. Nazir, M, Ishtiaq, Muhammad, Batool, Anab, Jaffar, M Arfan, and Mirza, Anwar M. Feature selection for efficient gender classification. In Proceedings of the WSEAS international conference, Wisconsin, pp. 70–75, 2010. Nesterov, Yurii. A method of solving a convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983. Pirsiavash, Hamed, Ramanan, Deva, and Fowlkes, Charless C. Bilinear classifiers for visual recognition. In Proceedings of the Advances in Neural Information Processing Systems, pp. 1482–1490, 2009. Platt, John et al. Sequential minimal optimization: A fast algorithm for training support vector machines. Technical report msr-tr-98-14, Microsoft Research, 1998. Pong, Ting Kei, Tseng, Paul, Ji, Shuiwang, and Ye, Jieping. Trace norm regularization: reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20(6):3465–3489, 2010. Salakhutdinov, Ruslan and Srebro, Nathan. Collaborative filtering in a non-uniform world: Learning with the weighted trace norm. arXiv preprint arXiv:1002.2780, 2010. Signoretto, Marco, Dinh, Quoc Tran, De Lathauwer, Lieven, and Suykens, Johan AK. Learning with tensors: a framework based on convex optimization and spectral regularization. Machine Learning, 94(3):303–351, 2014. Srebro, Nathan and Shraibman, Adi. Rank, trace-norm and max-norm. In Proceedings of the Conference on Learning Theory, pp. 545–560. 2005. Vandenberghe, Lieven and Boyd, Stephen. Semidefinite programming. SIAM review, 38(1):49–95, 1996. Watson, G Alistair. Characterization of the subdifferential of some matrix norms. Linear Algebra and its Applications, 170:33–45, 1992. Wolf, Lior, Jhuang, Hueihan, and Hazan, Tamir. Modeling appearances with low-rank SVM. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–6, 2007. Zheng, Wei-Long, Zhu, Jia-Yi, Peng, Yong, and Lu, BaoLiang. Eeg-based emotion classification using deep belief networks. In Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 1–6, 2014
Support Matrix Machines Zhou,Hua and Li,Lexin.Regularized matrix regression Journal of the Royal Statistical Society:Series B(Statis- tical Methodology以,76(2):463-483,2014. Zhu,Jia-Yi,Zheng,Wei-Long,Peng,Yong,Duan,Ruo- Nan,and Lu,Bao-Liang.Eeg-based emotion recognition using discriminative graph regularized extreme learning machine.In Proceedings of the International Joint Con- ference on Neural Networks,pp.525-532,2014. Zou,Hui and Hastie,Trevor.Regularization and variable selection via the elastic net.Journal of the Royal Statis- tical Society:Series B(Statistical Methodology),67(2): 301-320.2005
Support Matrix Machines Zhou, Hua and Li, Lexin. Regularized matrix regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2):463–483, 2014. Zhu, Jia-Yi, Zheng, Wei-Long, Peng, Yong, Duan, RuoNan, and Lu, Bao-Liang. Eeg-based emotion recognition using discriminative graph regularized extreme learning machine. In Proceedings of the International Joint Conference on Neural Networks, pp. 525–532, 2014. Zou, Hui and Hastie, Trevor. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2): 301–320, 2005