Support Matrix Machines LuoLuo RICKY @SJTU.EDU.CN Yubo Xie YUBOTSE@SJTU.EDU.CN Department of Computer Science and Engineering.Shanghai Jiao Tong University,Shanghai.China Zhihua Zhang ZHIHUA @SJTU.EDU.CN Institute of Data Science,Department of Computer Science and Engineering,Shanghai Jiao Tong University,China Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology,Collaborative Innovation Center of Novel Software Technology and Industrialization,Department of Computer Science and Technology,Nanjing University,China Abstract the case that input samples are represented as vectors or s- In many classification problems such as elec- calars.However,it is also often met that input samples are troencephalogram(EEG)classification and im- naturally represented as two-dimensional matrices or ten- age classification,the input features are naturally sors.When using classical classification methods to data of represented as matrices rather than vectors or s- matrix representation,we have to reshape the input matri- calars.In general,the structure information of ces into vectors.However,this would destroy the structure the original feature matrix is useful and informa- information of the data matrix,e.g.,the correlation of dif- tive for data analysis tasks such as classification. ferent channels of electroencephalogram(EEG)data(Zhou One typical structure information is the correla- Li,2014)and the spatial relationship of the nearby pixels tion between columns or rows in the feature ma- of image data (Wolf et al.,2007).Moreover,if the data ma- trix.To leverage this kind of structure informa- trix is stacked (reshaped)into a vector,the dimensionality tion,we propose a new classification method that of the resulting vector typically becomes very high,which we call support matrix machine(SMM).Specif- in turn leads to the curse of dimensionality. ically,SMM is defined as a hinge loss plus a There has been some work on classification methods which so-called spectral elastic net penalty which is a attempt to exploit the correlation between the columns or spectral extension of the conventional elastic net rows of the data matrix.Usually,such a classification over a matrix.The spectral elastic net enjoys a method introduces a matrix of regression coefficients to property of grouping effect,i.e.,strongly corre- leverage the correlation within the data matrix.For ex- lated columns or rows tend to be selected alto- ample,Wolf et al.(2007)proposed a rank-k SVM,which gether or not.Since the optimization problem models the regression matrix as the sum of k rank-one or- for SMM is convex,this encourages us to de- thogonal matrices.Pirsiavash et al.(2009)devised a bilin- vise an alternating direction method of multipli- ear SVM by factorizing the regression matrix into two low- ers(ADMM)algorithm for solving the problem. rank matrices.Cai et al.(2006)proposed a similar bilinear Experimental results on EEG and image classifi- framework called support tensor machines for text catego- cation data show that our model is more robust rization.These methods essentially take advantage of the and efficient than the state-of-the-art methods. low-rank assumption,which can be used for describing the correlation within a matrix.However,their treatments re- sult in non-convex optimization problems. 1.Introduction In this paper we are also concerned with the classification Classical classification methods such as support vector ma- problems on a set of data matrices.Our work is motivat- chines (SVMs)(Cortes Vapnik.1995)and logistic re- ed by the use of the nuclear norm (a.k.a.,the trace nor- gression (Hastie et al.,2001)have been originally built on m)in low-rank matrix approximation (Srebro Shraib- Proceedings of the 32n4 International Conference on Machine man,2005),matrix completion(Candes Recht,2009:Li- Learning,Lille,France,2015.JMLR:W&CP volume 37.Copy- u et al.,2013;Salakhutdinov Srebro,2010;Huang et al., right 2015 by the author(s). 2013),and multi-task learning problems(Pong et al.,2010;Support Matrix Machines Luo Luo RICKY@SJTU.EDU.CN Yubo Xie YUBOTSE@SJTU.EDU.CN Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China Zhihua Zhang ZHIHUA@SJTU.EDU.CN Institute of Data Science, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology, Collaborative Innovation Center of Novel Software Technology and Industrialization, Department of Computer Science and Technology, Nanjing University, China Abstract In many classification problems such as electroencephalogram (EEG) classification and image classification, the input features are naturally represented as matrices rather than vectors or scalars. In general, the structure information of the original feature matrix is useful and informative for data analysis tasks such as classification. One typical structure information is the correlation between columns or rows in the feature matrix. To leverage this kind of structure information, we propose a new classification method that we call support matrix machine (SMM). Specifically, SMM is defined as a hinge loss plus a so-called spectral elastic net penalty which is a spectral extension of the conventional elastic net over a matrix. The spectral elastic net enjoys a property of grouping effect, i.e., strongly correlated columns or rows tend to be selected altogether or not. Since the optimization problem for SMM is convex, this encourages us to devise an alternating direction method of multipliers (ADMM) algorithm for solving the problem. Experimental results on EEG and image classifi- cation data show that our model is more robust and efficient than the state-of-the-art methods. 1. Introduction Classical classification methods such as support vector machines (SVMs) (Cortes & Vapnik, 1995) and logistic regression (Hastie et al., 2001) have been originally built on Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). the case that input samples are represented as vectors or scalars. However, it is also often met that input samples are naturally represented as two-dimensional matrices or tensors. When using classical classification methods to data of matrix representation, we have to reshape the input matrices into vectors. However, this would destroy the structure information of the data matrix, e.g., the correlation of different channels of electroencephalogram (EEG) data (Zhou & Li, 2014) and the spatial relationship of the nearby pixels of image data (Wolf et al., 2007). Moreover, if the data matrix is stacked (reshaped) into a vector, the dimensionality of the resulting vector typically becomes very high, which in turn leads to the curse of dimensionality. There has been some work on classification methods which attempt to exploit the correlation between the columns or rows of the data matrix. Usually, such a classification method introduces a matrix of regression coefficients to leverage the correlation within the data matrix. For example, Wolf et al. (2007) proposed a rank-k SVM, which models the regression matrix as the sum of k rank-one orthogonal matrices. Pirsiavash et al. (2009) devised a bilinear SVM by factorizing the regression matrix into two lowrank matrices. Cai et al. (2006) proposed a similar bilinear framework called support tensor machines for text categorization. These methods essentially take advantage of the low-rank assumption, which can be used for describing the correlation within a matrix. However, their treatments result in non-convex optimization problems. In this paper we are also concerned with the classification problems on a set of data matrices. Our work is motivated by the use of the nuclear norm (a.k.a., the trace norm) in low-rank matrix approximation (Srebro & Shraibman, 2005), matrix completion (Candes & Recht ` , 2009; Liu et al., 2013; Salakhutdinov & Srebro, 2010; Huang et al., 2013), and multi-task learning problems (Pong et al., 2010;