Neurocomputing 72(2009)3182-3190 Contents lists available at ScienceDirect Neurocomputing ELSEVIER journal homepage:www.elsevier.com/locate/neucom Kernel nonnegative matrix factorization for spectral EEG feature extraction Hyekyoung Lee 3,Andrzej Cichockib,Seungjin Choia.* Department of Computer Science,Pohang University of Science and Technology.San 31 Hyoja-dong.Nam-gu.Pohang 790-784,Republic of Korea Laboratory for Advanced Brain Signal Processing.Brain Science Institute,RIKEN 2-1 Hirosawa,Wako-shi,Saitama 351-0198.Japan ARTICLEINFO ABSTRACT Article history: Nonnegative matrix factorization(NMF)seeks a decomposition of a nonnegative matrix X0 into a Received 25 June 2008 product of two nonnegative factor matrices U0and V>0.such that a discrepancy between X and UVT Received in revised form is minimized.Assuming U=XW in the decomposition(for W0),kernel NMF(KNMF)is easily derived 18 February 2009 Accepted 2 March 2009 in the framework of least squares optimization.In this paper we make use of KNMF to extract Communicated by T.Heskes discriminative spectral features from the time-frequency representation of electroencephalogram(EEG) Available online 1 April 2009 data,which is an important task in EEG classification.Especially when KNMF with linear kernel is used spectral features are easily computed by a matrix multiplication,while in the standard NMF Keywords: multiplicative update should be performed repeatedly with the other factor matrix fixed,or the EEG classification Feature extraction pseudo-inverse of a matrix is required.Moreover in KNMF with linear kernel,one can easily perform Kernel methods feature selection or data selection,because of its sparsity nature.Experiments on two EEG datasets in Multiplicative updates brain computer interface(BCl)competition indicate the useful behavior of our proposed methods. Nonnegative matrix factorization @2009 Elsevier B.V.All rights reserved. 1.Introduction data-driven feature extraction method which automatically determine discriminative spectral bands.To this end,NMF was Nonnegative matrix factorization (NMF)is a multivariate proposed as a promising method and was proven to be a good analysis method which is proven to be useful in learning a solution [11,3]. faithful representation of nonnegative data such as images. Several interesting variations of NMF were recently proposed spectrograms,and documents [16,9.NMF seeks a decomposition in [15].including convex-NMF,semi-NMF,kernel NMF (KNMF) of a nonnegative data matrix X20 into a product of two and so on.These extensions were mainly discussed in a task of nonnegative factor matrices such that X~UVT for U20 and clustering,emphasizing their applicability to a data matrix V>0.NMF allows only nonsubtractive combinations of nonnega- without sign restriction.However,we pay our attention to KNMF tive basis vectors to approximate the original nonnegative data, in the case of nonnegative data matrix,which can be easily possibly providing a parts-based representation [9]. developed,assuming U=XW for W=0 in the decomposition of Electroencephalogram (EEG)is the most popular sensory X=UV.We derive a multiplicative updating algorithm for signal used for brain computer interface (BCI).NMF was shown KNMF from KKT conditions,as in our recent work [3].where to be useful in determining discriminative basis vectors which g-divergence was used as an error measure for NMF. well reflect meaningful spectral characteristics without the cross- In this paper we make use of KNMF to extract discriminative validation in a motor imagery task [11].Exemplary spectral spectral features from the time-frequency representation of EEG characteristics of EEG involving motor,might be u rhythm data,which is an important task in EEG classification.The main (8-12 Hz)and B rhythm (18-25 Hz)which decrease during contribution of this paper is to show how KNMF is applied to movement or in preparation for movement (event-related extract spectral EEG features,emphasizing its useful aspects that desynchronization,ERD)and increase after movement and in are summarized below: relaxation(event-related synchronization,ERS)[17].ERD and ERS could be used as relevant features for the task of motor imagery When KNMF with linear kernel is used,spectral features are EEG classification.However,those phenomena might happen in easily computed by a matrix multiplication,while in the different frequency bands,depending on subjects,for instance,in standard NMF multiplicative update should be performed 16-20Hz,not in 8-12Hz [8].Thus,it is desirable to develop a repeatedly with the other factor matrix fixed,or the pseudo- inverse of a matrix is required. *Corresponding author.TeL:+8254279 2259:fax:+82542792299. KNMF with linear kernel is a special case of convex-NMF [4]. E-mail address:seungjin@postech.ac.kr (S.Choi). provided that each column in W satisfies the sum-to- URL:http://www.postech.ac.kr/~seungjin (S.Choi). one constraint.We show that KNMF yields more sparse 0925-2312/$-see front matter 2009 Elsevier B.V.All rights reserved. doi:10.1016j.neucom.2009.03.005
Kernel nonnegative matrix factorization for spectral EEG feature extraction Hyekyoung Lee a , Andrzej Cichocki b , Seungjin Choi a, a Department of Computer Science, Pohang University of Science and Technology, San 31 Hyoja-dong, Nam-gu, Pohang 790-784, Republic of Korea b Laboratory for Advanced Brain Signal Processing, Brain Science Institute, RIKEN 2-1 Hirosawa, Wako-shi, Saitama 351-0198, Japan article info Article history: Received 25 June 2008 Received in revised form 18 February 2009 Accepted 2 March 2009 Communicated by T. Heskes Available online 1 April 2009 Keywords: EEG classification Feature extraction Kernel methods Multiplicative updates Nonnegative matrix factorization abstract Nonnegative matrix factorization (NMF) seeks a decomposition of a nonnegative matrix XX0 into a product of two nonnegative factor matrices UX0 and VX0, such that a discrepancy between X and UV> is minimized. Assuming U ¼ XW in the decomposition (for WX0), kernel NMF (KNMF) is easily derived in the framework of least squares optimization. In this paper we make use of KNMF to extract discriminative spectral features from the time–frequency representation of electroencephalogram (EEG) data, which is an important task in EEG classification. Especially when KNMF with linear kernel is used, spectral features are easily computed by a matrix multiplication, while in the standard NMF multiplicative update should be performed repeatedly with the other factor matrix fixed, or the pseudo-inverse of a matrix is required. Moreover in KNMF with linear kernel, one can easily perform feature selection or data selection, because of its sparsity nature. Experiments on two EEG datasets in brain computer interface (BCI) competition indicate the useful behavior of our proposed methods. & 2009 Elsevier B.V. All rights reserved. 1. Introduction Nonnegative matrix factorization (NMF) is a multivariate analysis method which is proven to be useful in learning a faithful representation of nonnegative data such as images, spectrograms, and documents [16,9]. NMF seeks a decomposition of a nonnegative data matrix XX0 into a product of two nonnegative factor matrices such that X UV> for UX0 and VX0. NMF allows only nonsubtractive combinations of nonnegative basis vectors to approximate the original nonnegative data, possibly providing a parts-based representation [9]. Electroencephalogram (EEG) is the most popular sensory signal used for brain computer interface (BCI). NMF was shown to be useful in determining discriminative basis vectors which well reflect meaningful spectral characteristics without the crossvalidation in a motor imagery task [11]. Exemplary spectral characteristics of EEG involving motor, might be m rhythm (8–12 Hz) and b rhythm (18–25 Hz) which decrease during movement or in preparation for movement (event-related desynchronization, ERD) and increase after movement and in relaxation (event-related synchronization, ERS) [17]. ERD and ERS could be used as relevant features for the task of motor imagery EEG classification. However, those phenomena might happen in different frequency bands, depending on subjects, for instance, in 16–20 Hz, not in 8–12 Hz [8]. Thus, it is desirable to develop a data-driven feature extraction method which automatically determine discriminative spectral bands. To this end, NMF was proposed as a promising method and was proven to be a good solution [11,3]. Several interesting variations of NMF were recently proposed in [15], including convex-NMF, semi-NMF, kernel NMF (KNMF), and so on. These extensions were mainly discussed in a task of clustering, emphasizing their applicability to a data matrix without sign restriction. However, we pay our attention to KNMF in the case of nonnegative data matrix, which can be easily developed, assuming U ¼ XW for WX0 in the decomposition of X ¼ UV>. We derive a multiplicative updating algorithm for KNMF from KKT conditions, as in our recent work [3], where a-divergence was used as an error measure for NMF. In this paper we make use of KNMF to extract discriminative spectral features from the time–frequency representation of EEG data, which is an important task in EEG classification. The main contribution of this paper is to show how KNMF is applied to extract spectral EEG features, emphasizing its useful aspects that are summarized below: When KNMF with linear kernel is used, spectral features are easily computed by a matrix multiplication, while in the standard NMF multiplicative update should be performed repeatedly with the other factor matrix fixed, or the pseudoinverse of a matrix is required. KNMF with linear kernel is a special case of convex-NMF [4], provided that each column in W satisfies the sum-toone constraint. We show that KNMF yields more sparse ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2009.03.005 Corresponding author. Tel.: +82 54 279 2259; fax: +82 54 279 2299. E-mail address: seungjin@postech.ac.kr (S. Choi). URL: http://www.postech.ac.kr/seungjin (S. Choi). Neurocomputing 72 (2009) 3182–3190
H.Lee et al.Neurocomputing 72 (2009)3182-3190 3183 representation,compared to the standard NMF in the case of mother wavelet is given by spectral EEG data,confirming the result that was first studied in [4].In addition,we show that feature selection or data 平o()=π-1/eiwone-2 (1) selection can be easily performed in the case of KNNF with where wo is the characteristic eigenfrequency (generally linear kernel,because of its sparsity nature. taken to be 6).Scaling and temporal shifting of the mother All these useful behaviors of KNMF are verified through wavelet leads to ydop controlled by the factor n=(t-t)/d(f) experiments on two EEG datasets in BCI competition. where The rest of this paper is organized as follows.The next section wo+V2+w哈 d(f)= (2) provides a brief overview of NMF,illustrating how spectral 4πf features are extracted from EEG data using NMF.Section 3 where f is the main receptive frequency.The wavelet transform of explains KNMF and its application to spectral EEG feature cat time and frequency f is their convolution with scaled and extraction.Experiments on two EEG datasets in BCI competition shifted wavelets.The amplitude of the wavelet transform,P is are presented in Section 4.Finally conclusions are drawn in given by Section 5. p哈=Ic地*平d0nol. (3) for k =1,....K. 2.NMF for spectral EEG feature extraction Then we construct the data matrix X by collecting n x F spectral matrices computed at K different channels, We begin with illustrating how to construct a data matrix from EEG data X=[p(1)p2)....p]Rnxm (4) where m =KF.Fig.3 shows the time-domain EEG signals in the 2.1.Data matrix construction upper panel which are measured at eight different channels (C3.Cz,C4.CP1,CP2.P3.P2.P4)and their corresponding time- We construct the data matrix X eR"xm from the time-domain frequency representation in the lower panel,where the EEG signal such that each column vector in X is associated with horizontal axis represents frequency and the vertical axis is the frequency profile over trials. associated with time. We denote by the time-domain EEG signal measured at the kth channel (k=1.....K).We also denote by(time and 2.2.NMF and feature extraction frequency indices run over t=1,...,n and f =1,...,F,respec- tively)the corresponding time-frequency representation of the NMF seeks a decomposition of X e R"xm that is of the form EEG signal d,which is typically computed by short-time Fourier X≈UVT, transform(STFT)or wavelet transform.Then the spectral matrix is (5) given byp=[PRxF.In the case of EEG data with N trials, where UeRxr and VeRmxr are restricted to be nonnegative the time index is given by n=TN where T is the number of matrices as well.Matrices U and V,in our problem setting,are samples in each trial. interpreted as follows (see also Fig.4). We provide a brief illustration of how to computeP using complex Morlet wavelet transform which is very popular in Column vectors in V are basis vectors which reflects r handling EEG data.Time-frequency representation of EEG data is representative spectral characteristics learned from an en- computed by filtering it with complex Morlet wavelets,where the semble of EEG data samples. C3 Cz C4 CP1 CP2 P3 Pz P4 8203082030820 308203082030820308203082030 Fig.1.The classification accuracy averaged over 140 trials is plotted for three methods(NMF,KNMF,and FS).All three methods work well,while KNMF is slightly better than other two methods
representation, compared to the standard NMF in the case of spectral EEG data, confirming the result that was first studied in [4]. In addition, we show that feature selection or data selection can be easily performed in the case of KNNF with linear kernel, because of its sparsity nature. All these useful behaviors of KNMF are verified through experiments on two EEG datasets in BCI competition. The rest of this paper is organized as follows. The next section provides a brief overview of NMF, illustrating how spectral features are extracted from EEG data using NMF. Section 3 explains KNMF and its application to spectral EEG feature extraction. Experiments on two EEG datasets in BCI competition are presented in Section 4. Finally conclusions are drawn in Section 5. 2. NMF for spectral EEG feature extraction We begin with illustrating how to construct a data matrix from EEG data. 2.1. Data matrix construction We construct the data matrix X 2 Rnm from the time-domain EEG signal such that each column vector in X is associated with the frequency profile over trials. We denote by c ðkÞ t the time-domain EEG signal measured at the kth channel ðk ¼ 1; ... ; KÞ. We also denote by PðkÞ t;f (time and frequency indices run over t ¼ 1; ... ; n and f ¼ 1; ... ; F, respectively) the corresponding time–frequency representation of the EEG signal c ðkÞ t , which is typically computed by short-time Fourier transform (STFT) or wavelet transform. Then the spectral matrix is given by PðkÞ ¼ ½PðkÞ t;f 2 RnF . In the case of EEG data with N trials, the time index is given by n ¼ TN where T is the number of samples in each trial. We provide a brief illustration of how to compute PðkÞ t;f using complex Morlet wavelet transform which is very popular in handling EEG data. Time–frequency representation of EEG data is computed by filtering it with complex Morlet wavelets, where the mother wavelet is given by C0ðZÞ ¼ p1=4eiw0ZeZ2=2, (1) where w0 is the characteristic eigenfrequency (generally taken to be 6). Scaling and temporal shifting of the mother wavelet leads to Ct;dðfÞ controlled by the factor Z ¼ ðt tÞ=dðfÞ where dðfÞ ¼ w0 þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 þ w2 0 q 4pf , (2) where f is the main receptive frequency. The wavelet transform of c ðkÞ t at time t and frequency f is their convolution with scaled and shifted wavelets. The amplitude of the wavelet transform, PðkÞ t;f , is given by PðkÞ t;f ¼ kc ðkÞ t Ct;dðfÞðtÞk, (3) for k ¼ 1; ... ; K. Then we construct the data matrix X by collecting n F spectral matrices computed at K different channels, X ¼ ½Pð1Þ ; Pð2Þ ; ... ; PðKÞ 2 Rnm, (4) where m ¼ KF. Fig. 3 shows the time-domain EEG signals in the upper panel which are measured at eight different channels (C3; Cz; C4; CP1; CP2; P3; Pz; P4) and their corresponding time– frequency representation in the lower panel, where the horizontal axis represents frequency and the vertical axis is associated with time. 2.2. NMF and feature extraction NMF seeks a decomposition of X 2 Rnm that is of the form X UV>, (5) where U 2 Rnr and V 2 Rmr are restricted to be nonnegative matrices as well. Matrices U and V, in our problem setting, are interpreted as follows (see also Fig. 4). Column vectors in V are basis vectors which reflects r representative spectral characteristics learned from an ensemble of EEG data samples. ARTICLE IN PRESS C3 8 Cz C4 CP1 CP2 P3 Pz P4 20 30 8 20 30 8 20 30 8 20 308 20 30 8 20 308 20 30 8 20 30 Fig. 1. The classification accuracy averaged over 140 trials is plotted for three methods (NMF, KNMF, and FS). All three methods work well, while KNMF is slightly better than other two methods. H. Lee et al. / Neurocomputing 72 (2009) 3182–3190 3183
3184 H.Lee et al.Neurocomputing 72 (2009)3182-3190 Row vectors in U are features,reflecting how learned spectral When a test data matrix Xtest is given,its associated feature patterns are encoded for spectral data vectors (associated with matrix Utest can be computed in two different ways: row vectors of X). The feature matrix Utest is determined by LS projection, We provide a simple alternative derivation of multiplicative updates for least squares (LS)NMF,which was originally Utest =Xtest[VT]'. (14) developed by Lee and Seung [10].We consider LS objective where represents the pseudo-inverse.In such a case,Urest function given by might have negative elements but work well [11]. We iterate the update rule (12)until convergence,with V 9=x-UW2, (6) (learned in the training phase)fixed. where X denotes the Frobenius norm of a matrix X.Taking Sparseness or orthogonality constraints were suggested for more nonnegative constraints U20 and V2o into account,the fruitful representation in the framework of NMF [14,6,21. Lagrangian is given by =X-UVT2-tr(AUT)-tr(QVT). (7) 3.Kernel NMF where A≥0and2≥0 are Lagrangian multipliers. KKT optimality conditions require 3.1.Algorithm UVTV-XV=A. (8) A simple trick to develop kernel NMF is to assume that feature VUTU-XTU=2. (9) profiles (column vectors of U)are convex combinations of frequency profiles (column vectors of X)of spectral data in NMF which result from a/aU =0 and a/ov=0,respectively.KKT decomposition,i.e., complementary slackness conditions require: U=XW. (15) UVV-XM⊙U=0. (10) where each column in W satisfies the sum-to-one constraint. VU'U-XUOV=0. (11) Incorporating the assumption (15)into the LS objective function where o denotes Hadarmard product (element-wise multiplica- (6)yields tion).These suggest the following multiplicative updates: =X-XWVT2. (16) U←U⊙ XV (12) In order to handle inequality constraints W>0 and V>0,we UVTV consider the Lagrangian,as in (7). V←Vo: XTU =-XWVT2-tr(AWT)-tr(QVT). (13) (17) VUU where A and are Lagrangian multipliers. where the division is performed in an element-wise manner. As in (8)and (10).KKT optimality conditions require: These rules are exactly same as the one proposed by Lee and Seung [10]. X'XWV'V-X'XV=A. (18) 2 5 152551525 frequency C3 C4 5152551525 1234 frequency basis Fig.2.Basis vectors computed by NMF and KNMF,as well as representative patterns selected by FS.are shown.In each plot,the vertical axis represents frequency components between 4 and 30Hz.The upper half is for Ca and the lower half is for C4.The number of factors was chosen as r=6 for both NMF and KNMF.FS chose 10 features,where blue color is associated with"not-selected".Its selected patter is closely related to NMF and KNMF factors.(For interpretation of the references to color in this figure legend.the reader is referred to the web version of this article.)
Row vectors in U are features, reflecting how learned spectral patterns are encoded for spectral data vectors (associated with row vectors of X). We provide a simple alternative derivation of multiplicative updates for least squares (LS) NMF, which was originally developed by Lee and Seung [10]. We consider LS objective function given by J ¼ 1 2kX UV>k2, (6) where kXk denotes the Frobenius norm of a matrix X. Taking nonnegative constraints UX0 and VX0 into account, the Lagrangian is given by L ¼ 1 2kX UV>k2 trfKU>g trfXV>g, (7) where KX0 and XX0 are Lagrangian multipliers. KKT optimality conditions require UV>V XV ¼ K, (8) VU>U X>U ¼ X, (9) which result from @L=@U ¼ 0 and @L=@V ¼ 0, respectively. KKT complementary slackness conditions require: ½UV>V XV U ¼ 0, (10) ½VU>U X>U V ¼ 0, (11) where denotes Hadarmard product (element-wise multiplication). These suggest the following multiplicative updates: U U XV UV>V , (12) V V X>U VU>U , (13) where the division is performed in an element-wise manner. These rules are exactly same as the one proposed by Lee and Seung [10]. When a test data matrix Xtest is given, its associated feature matrix Utest can be computed in two different ways: The feature matrix Utest is determined by LS projection, Utest ¼ Xtest½V> y , (14) where y represents the pseudo-inverse. In such a case, Utest might have negative elements but work well [11]. We iterate the update rule (12) until convergence, with V (learned in the training phase) fixed. Sparseness or orthogonality constraints were suggested for more fruitful representation in the framework of NMF [14,6,2]. 3. Kernel NMF 3.1. Algorithm A simple trick to develop kernel NMF is to assume that feature profiles (column vectors of U) are convex combinations of frequency profiles (column vectors of X) of spectral data in NMF decomposition, i.e., U ¼ XW, (15) where each column in W satisfies the sum-to-one constraint. Incorporating the assumption (15) into the LS objective function (6) yields JK ¼ 1 2kX XWV>k2. (16) In order to handle inequality constraints WX0 and VX0, we consider the Lagrangian, as in (7), LK ¼ 1 2kX XWV>k2 trfKW>g trfXV>g, (17) where K and X are Lagrangian multipliers. As in (8) and (10), KKT optimality conditions require: X>XWV>V X>XV ¼ K, (18) ARTICLE IN PRESS C3 C4 5 5 1234 5 1 2 3 4 X VT time frequency frequency C3 mi t e basis basis =U C4 15 25 5 15 25 15 25 15 25 Fig. 2. Basis vectors computed by NMF and KNMF, as well as representative patterns selected by FS, are shown. In each plot, the vertical axis represents frequency components between 4 and 30 Hz. The upper half is for C3 and the lower half is for C4. The number of factors was chosen as r ¼ 6 for both NMF and KNMF. FS chose 10 features, where blue color is associated with ‘‘not-selected’’. Its selected pattern is closely related to NMF and KNMF factors. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 3184 H. Lee et al. / Neurocomputing 72 (2009) 3182–3190
H.Lee et al.Neurocomputing 72 (2009)3182-3190 3185 VWXXW-XXW=2. (19) where []>0 and []>0.In such a case,multiplicative update for parameters has the form which are the consequences of ak/aw =0 and ak/av=0. respectively. ⊙←日⊙ VxJ" (25) Define a kernel matrix (Gram matrix)as K=X'X,the (i,j)- Vx厅)· entry of which represents a similarity between two frequency where ()denotes the element-wise power and n is a learning profiles i and j.Then,KKT complementary slackness conditions yield rate(0<1).It can be easily seen that the multiplicative update (25)preserves the nonnegativity of the parameter while IKWVV-KV]oW=0. (20) VK =0 when the convergence is achieved [2]. Note that derivatives with respect to W and V are computed as IVWKW-KW]oV=0. (21) -”- These relations lead to the following multiplicative updates: KWVTV-KV. V←VOVWKW KW (22) %-- VWKW-KW. W←W⊙ V KWVTV (23) Thus,invoking the relation(25)with n=1,one can easily derive multiplicative updates(22)and(23). The first idea on KNMF was proposed in [4].We further elaborate it,developing multiplicative updates in a different way as well as applying it to a task of EEG classification.As in [4]. 3.2.Sparsity of KNMF one can easily extend KNMF to the case where the data matrix is free from sign restriction.Throughout this paper we Our KNMF with linear kernel is a special case of convex-NMF use only linear kernel K =X'X.However,the method holds for [15].Thus,it follows from the result in [15]that factor matrices W K=ΦΦwhereΦ=[Xg】is a transformed matrix in a feature and V in KNMF are naturally sparse.We briefly review the result space. on this sparsity [15.Then we provide an illustrative example, We also provide an alternative derivation of rules (22)and showing a useful behavior of KNMF due to the sparsity property (23).Suppose that the gradient of the objective function(16)has a and confirming the sparsity of factor matrices in KNMF when it decomposition of the form was applied to spectral EEG data. Suppose that the singular value decomposition(SVD)of the V/K=[Vfx]-[V/x]. (24) data matrix X is given by X=PEQ.Then,the objective value in basis 1 basis 2 0.2 0.2 0.1 0.1 0 0 813182328 813182328 813182328813182328 C3 C4 C3 C4 basis 3 basis 4 0.2 0.05 0.1 0 0 813182328813182328 813182328813182328 C3 C4 C3 C4 basis 5 basis 6 0.1 0.2 0.05 0.1 0 813182328813182328 813182328813182328 C3 C4 c3 basis 7 basis 8 0.2 0.2 0.1 0.1 0 813182328 813182328 813182328813182328 C3 C4 C3 C4 Fig.3.Exemplary EEG data(IDIAP dataset,the details on which are explained in Section 4.1)are shown in the time-domain(upper panel)and in the time-frequency domain (lower panel).Waveforms of EEG in the time-domain are shown in the upper panel,each of which is measured at eight different channels i6S.c28 on the-the ume m2 he文。三ce2anop2aoma (96=12×8)
VW>X>XW X>XW ¼ X, (19) which are the consequences of @LK =@W ¼ 0 and @LK =@V ¼ 0, respectively. Define a kernel matrix (Gram matrix) as K ¼ X>X, the ði; jÞ- entry of which represents a similarity between two frequency profiles i and j. Then, KKT complementary slackness conditions yield ½KWV>V KV W ¼ 0, (20) ½VW>KW KW V ¼ 0. (21) These relations lead to the following multiplicative updates: V V KW VW>KW , (22) W W KV KWV>V . (23) The first idea on KNMF was proposed in [4]. We further elaborate it, developing multiplicative updates in a different way as well as applying it to a task of EEG classification. As in [4], one can easily extend KNMF to the case where the data matrix is free from sign restriction. Throughout this paper we use only linear kernel K ¼ X>X. However, the method holds for K ¼ U>U where U ¼ ½fðXijÞ is a transformed matrix in a feature space. We also provide an alternative derivation of rules (22) and (23). Suppose that the gradient of the objective function (16) has a decomposition of the form rJK ¼ ½rJK þ ½rJK , (24) where ½rJK þ40 and ½rJK 40. In such a case, multiplicative update for parameters Y has the form Y Y ½rJK ½rJK þ :Z , (25) where ð Þ:Z denotes the element-wise power and Z is a learning rate (0oZp1). It can be easily seen that the multiplicative update (25) preserves the nonnegativity of the parameter Y, while rJK ¼ 0 when the convergence is achieved [2]. Note that derivatives with respect to W and V are computed as @JK @W ¼ @JK @W þ @JK @W ¼ KWV>V KV, @JK @V ¼ @JK @V þ @JK @V ¼ VW>KW KW. Thus, invoking the relation (25) with Z ¼ 1, one can easily derive multiplicative updates (22) and (23). 3.2. Sparsity of KNMF Our KNMF with linear kernel is a special case of convex-NMF [15]. Thus, it follows from the result in [15] that factor matrices W and V in KNMF are naturally sparse. We briefly review the result on this sparsity [15]. Then we provide an illustrative example, showing a useful behavior of KNMF due to the sparsity property and confirming the sparsity of factor matrices in KNMF when it was applied to spectral EEG data. Suppose that the singular value decomposition (SVD) of the data matrix X is given by X ¼ PRQ >. Then, the objective value in ARTICLE IN PRESS 8 0 0.1 0.2 C3 C4 basis 1 0 0.1 0.2 basis 2 0 0.05 basis 3 0 0.1 0.2 basis 4 0 0.05 0.1 basis 5 0 0.1 0.2 basis 6 0 0.1 0.2 basis 7 0 0.1 0.2 basis 8 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 8 C3 C4 13 18 23 28 8 13 18 23 28 Fig. 3. Exemplary EEG data (IDIAP dataset, the details on which are explained in Section 4.1) are shown in the time-domain (upper panel) and in the time–frequency domain (lower panel). Waveforms of EEG in the time-domain are shown in the upper panel, each of which is measured at eight different channels (C3; Cz; C4; CP1; CP2; P3; Pz ; P4). Corresponding time–frequency representations are shown in the lower panel, where frequency (horizontal axis in each plot) ranges over ½8; 10; 12; ... ; 28; 30 (i.e., the number of frequency bands is 12). In this case, the data matrix X 2 Rn96 is constructed by collecting 12 frequency profiles at each channel (96 ¼ 12 8). H. Lee et al. / Neurocomputing 72 (2009) 3182–3190 3185
3186 H.Lee et al.Neurocomputing 72 (2009)3182-3190 the optimization problem (16)can be rewritten as VNMF and VKNMF is not big.In both cases,basis vectors are close to ideal ones. X-XWVT12=t((-VW)XTX(I-WVT)) We compute encoding matrices by NMF and KNMF,which are 21q-ww)2. (26) given by i-1 0.8937 0.2211 where o;is the jth diagonal entry of >g;is the jth column vector 1.2362 0.0989 of Q.and XTX=>qjqf for m<n. 1.4090 0.0900 The term qf(I-WVT)in (26)is the projection of (I-WVT) UNMF= 0.2201 0.7642 onto a principal direction,so the optimization problem(16)can be considered as the minimization of the sum of the differences 0.55980.7165 between the data and its projected values.One can easily see that 0.54840.8232 in(26)the projection of (I-WVT)onto principal directions has 0.2859 0.8125 larger weights while the projection of(I-WVT)onto nonprinci- pal directions has smaller weights.It was pointed out that the 1.07330.5595 sparsity is enforced heavily in the principal subspace and lightly 0.20190.0000 1.2277 0.3677 in the nonprincipal subspace.It was shown in [15]that KNMF 0.0230 0.3730 1.4543 0.3868 with linear kernel tends to provide sparse solution. Fig.5 shows bases computed by NMF and KNMF from Graz UKNME XW =X 0.00000.6144 0.22481.3152 data (the details on which are described in Section 4.2)where C 0.0708 0.0126 0.74111.4109 and C4 channels were used in motor imagery EEG data.Bases are 0.70430.0000 0.7049 1.5332 associated with the distributions of frequency components in the range of 4-30 Hz.Certainly,one can easily see that bases in KNMF 0.40641.5194 are sparser than those in NMF. Now we provide an illustrative example,showing the sparsity Each row vector in encoding matrices UNMF and UKNMF are of factor matrices in KNMF is useful in determining discriminative indicator variables for clusters.For example,the first row in basis vectors,compared to the standard NMF.To this end,we first UNMF has two entries 0.8937 and 0.2211 where the first element is generate a data matrix X: bigger than the second element,which implies the associated data (the first row vector of X)belongs to cluster 1.Investigating UNMF 0.36920.13200.82120.45091.3685 and UKNMF in this way,one can see that both encoding matrices 0.11120.94210.01540.54701.6256 identify the first three row vectors of X as cluster 1 and the last 0.78030.9561 0.04300.2963 1.7802 four row vectors of X as cluster 2. 0.38971.57521.16900.74470.0811 We provide another example with a data matrix in the case of lower signal-to-noise ratio.We consider the data matrix X 0.24171.05981.64910.18900.9294 given by 0.40391.23481.73170.6868 0.7757 0.09651.35321.64770.18350.4868 0.33710.26300.91330.10670.8998 0.16220.6541015240.96190.7599 First three rows were generated by adding uniformly distributed 0.7943 0.6892 0.8258 0.00461.3001 ([0,1])random numbers to [0,0,0,0,11.Last four rows were 0.31121.24821.03830.77490.4314 constructed by adding uniformly distributed (0,11)random numbers to [0,1,1,0,0]. 0.52850.9505 1.49610.81730.9106 In an ideal case,it is expected that NMF or KNMF compute two 0.16560.5838 0.57820.86870.1818 representative basis vectors,each of which is [0,0.0.0,1]and 0.60200.72900.9427 0.08440.2638 [0,1,1,0,0]',respectively,and identify two clusters (first three rows and last four rows of X).Basis matrices computed by NMF First three rows were generated by adding uniformly distributed and KNMF are given by ([0,1])random numbers to [0,0,0,0.0.5].Last four rows were constructed by adding uniformly distributed (0,11)random 0.34020.1865 0.34350.0442 numbers to [0,0.5,0.5,0,0].The data matrix was generated in a 0.44111.4140 0.26160.7895 way similar to previous example,however,a distinction is lower VNMF= 0.0000 2.0089 VKNMF 0.00001.0587 signal-to-noise ratio (0.5 is used instead of 1)with the same 0.2842 0.4462 0.22740.2324 pattern of zeros and nonzeros.Basis matrices computed by NMF 1.33860.0441 1.24390.0000 and KNMF are given by Each column vector of VNMF and VKNMF resemble ideal basis 0.73560.0534 0.46940.1771 vectors since the last entry in the first column vector is much 0.62971.0885 0.00001.0268 larger than the rest of entries and the second and third entries in VLsm时= 1.16610.6776 Vkan时 0.00201.1943 the second column vector are much larger than the rest of entries. 0 1.3856 0.0000 0.7247 We compute the sparseness for each column vector in VNMF and 1.14160.1694 1.12520.0000 VKNMF using the measure in [6]defined by We used(27)to compute the sparseness of each column vector √而-(E)/W∑号 of VNMF and VKNMF that is given by (0.2430,0.3693)and 5(x)= (27) √m-1 (0.7495,0.3593).The sparseness difference is larger,compared to previous example.Moreover,basis vectors in KNMF still look where xi is the ith element of the m-dimensional vector x.The close to [0,0.0.0,0.5]T and [0.0.5.0.5,0,0],since the last entry in sparseness is computed as (0.4846,0.4926)for VNMF and the first column vector is much larger than the rest of entries and (0.5518,0.5278)for VKNMF.The difference in sparseness between the second and third entries in the second column vector are
the optimization problem (16) can be rewritten as kX XWV>k2 ¼ trfðI VW>ÞX>XðI WV>Þg ¼ Xm j¼1 s2 j kq> j ðI WV>Þk2, (26) where sj is the jth diagonal entry of R, qj is the jth column vector of Q, and X>X ¼ Pm j¼1s2 j qjq> j for mon. The term q> j ðI WV>Þ in (26) is the projection of ðI WV>Þ onto a principal direction, so the optimization problem (16) can be considered as the minimization of the sum of the differences between the data and its projected values. One can easily see that in (26) the projection of ðI WV>Þ onto principal directions has larger weights while the projection of ðI WV>Þ onto nonprincipal directions has smaller weights. It was pointed out that the sparsity is enforced heavily in the principal subspace and lightly in the nonprincipal subspace. It was shown in [15] that KNMF with linear kernel tends to provide sparse solution. Fig. 5 shows bases computed by NMF and KNMF from Graz data (the details on which are described in Section 4.2) where C3 and C4 channels were used in motor imagery EEG data. Bases are associated with the distributions of frequency components in the range of 4–30 Hz. Certainly, one can easily see that bases in KNMF are sparser than those in NMF. Now we provide an illustrative example, showing the sparsity of factor matrices in KNMF is useful in determining discriminative basis vectors, compared to the standard NMF. To this end, we first generate a data matrix X: X ¼ 0:3692 0:1320 0:8212 0:4509 1:3685 0:1112 0:9421 0:0154 0:5470 1:6256 0:7803 0:9561 0:0430 0:2963 1:7802 0:3897 1:5752 1:1690 0:7447 0:0811 0:2417 1:0598 1:6491 0:1890 0:9294 0:4039 1:2348 1:7317 0:6868 0:7757 0:0965 1:3532 1:6477 0:1835 0:4868 0 BBBBBBBBBBB@ 1 CCCCCCCCCCCA . First three rows were generated by adding uniformly distributed (U½0; 1) random numbers to ½0; 0; 0; 0; 1. Last four rows were constructed by adding uniformly distributed (U½0; 1) random numbers to ½0; 1; 1; 0; 0. In an ideal case, it is expected that NMF or KNMF compute two representative basis vectors, each of which is ½0; 0; 0; 0; 1 > and ½0; 1; 1; 0; 0 >, respectively, and identify two clusters (first three rows and last four rows of X). Basis matrices computed by NMF and KNMF are given by VNMF ¼ 0:3402 0:1865 0:4411 1:4140 0:0000 2:0089 0:2842 0:4462 1:3386 0:0441 0 BBBBBB@ 1 CCCCCCA ; VKNMF ¼ 0:3435 0:0442 0:2616 0:7895 0:0000 1:0587 0:2274 0:2324 1:2439 0:0000 0 BBBBBB@ 1 CCCCCCA . Each column vector of VNMF and VKNMF resemble ideal basis vectors since the last entry in the first column vector is much larger than the rest of entries and the second and third entries in the second column vector are much larger than the rest of entries. We compute the sparseness for each column vector in VNMF and VKNMF using the measure in [6] defined by xðxÞ ¼ ffiffiffiffi mp Pjxij = ffiffiffiffiffiffiffiffiffiffiffi Px2 i q ffiffiffiffi mp 1 , (27) where xi is the ith element of the m-dimensional vector x. The sparseness is computed as ð0:4846; 0:4926Þ for VNMF and ð0:5518; 0:5278Þ for VKNMF. The difference in sparseness between VNMF and VKNMF is not big. In both cases, basis vectors are close to ideal ones. We compute encoding matrices by NMF and KNMF, which are given by UNMF ¼ 0:8937 0:2211 1:2362 0:0989 1:4090 0:0900 0:2201 0:7642 0:5598 0:7165 0:5484 0:8232 0:2859 0:8125 0 BBBBBBBBBBBBBB@ 1 CCCCCCCCCCCCCCA , UKNMF ¼ XW ¼ X 0:2019 0:0000 0:0230 0:3730 0:0000 0:6144 0:0708 0:0126 0:7043 0:0000 0 BBBBBBBB@ 1 CCCCCCCCA ¼ 1:0733 0:5595 1:2277 0:3677 1:4543 0:3868 0:2248 1:3152 0:7411 1:4109 0:7049 1:5332 0:4064 1:5194 0 BBBBBBBBBBBBBB@ 1 CCCCCCCCCCCCCCA . Each row vector in encoding matrices UNMF and UKNMF are indicator variables for clusters. For example, the first row in UNMF has two entries 0:8937 and 0:2211 where the first element is bigger than the second element, which implies the associated data (the first row vector of X) belongs to cluster 1. Investigating UNMF and UKNMF in this way, one can see that both encoding matrices identify the first three row vectors of X as cluster 1 and the last four row vectors of X as cluster 2. We provide another example with a data matrix in the case of lower signal-to-noise ratio. We consider the data matrix X given by X ¼ 0:3371 0:2630 0:9133 0:1067 0:8998 0:1622 0:6541 0:1524 0:9619 0:7599 0:7943 0:6892 0:8258 0:0046 1:3001 0:3112 1:2482 1:0383 0:7749 0:4314 0:5285 0:9505 1:4961 0:8173 0:9106 0:1656 0:5838 0:5782 0:8687 0:1818 0:6020 0:7290 0:9427 0:0844 0:2638 0 BBBBBBBBBBB@ 1 CCCCCCCCCCCA . First three rows were generated by adding uniformly distributed (U½0; 1) random numbers to ½0; 0; 0; 0; 0:5. Last four rows were constructed by adding uniformly distributed (U½0; 1) random numbers to ½0; 0:5; 0:5; 0; 0. The data matrix was generated in a way similar to previous example, however, a distinction is lower signal-to-noise ratio (0.5 is used instead of 1) with the same pattern of zeros and nonzeros. Basis matrices computed by NMF and KNMF are given by Vlsnmf ¼ 0:7356 0:0534 0:6297 1:0885 1:1661 0:6776 0 1:3856 1:1416 0:1694 0 BBBBBB@ 1 CCCCCCA ; Vknmf ¼ 0:4694 0:1771 0:0000 1:0268 0:0020 1:1943 0:0000 0:7247 1:1252 0:0000 0 BBBBBB@ 1 CCCCCCA . We used (27) to compute the sparseness of each column vector of VNMF and VKNMF that is given by ð0:2430; 0:3693Þ and ð0:7495; 0:3593Þ. The sparseness difference is larger, compared to previous example. Moreover, basis vectors in KNMF still look close to ½0; 0; 0; 0; 0:5 > and ½0; 0:5; 0:5; 0; 0 >, since the last entry in the first column vector is much larger than the rest of entries and the second and third entries in the second column vector are ARTICLE IN PRESS 3186 H. Lee et al. / Neurocomputing 72 (2009) 3182–3190
H.Lee et al.Neurocomputing 72 (2009)3182-3190 3187 much larger than the rest of entries,while the second column where Wi:(for j=1,....m)is the jth row vector of w.For data vector of VNMF does not. selection,we sort column vectors of X in descending order by Corresponding encoding matrices are computed as values of Ri and choose column vectors associated with ksm largest values of Ri.The same idea can be applied to X',which is 0.69010.0124 feature selection. 0.17590.5521 0.9624 0 UNMF 0.42510.6951 4.Numerical experiments 0.77940.5652 Numerical experiments for EEG classification are conducted, 0.1049 0.5817 consisting of three steps where pre-processing and classification 0.5485 0.1917 are common for three different spectral feature extraction 0.75690.4804 methods considered for comparison: 0.2539 0.0408 0.6072 0.5139 (1)Pre-processing:Transformation of time-domain EEG signals to 0.0001 0.3361 1.17090.5852 the time-frequency domain by wavelet transform or short- time Fourier transform. UKNMF XW =X 0.0016 0.3848 0.40191.0158 (2)Spectral feature extraction: 0.0000 0.2366 0.81461.1117 (a)NMF:The basis matrix V is learned by multipli- 0.7445 0.0017 0.17840.6313 cative updates (12)and (13)and the feature matrix associated with the test data matrix X.is computed by 0.35080.6528 U.=X.IVT. where one can see that UKNMF reflects correct clusters(in the first (b)KNMF:The factor matrix W is learned by multiplicative three rows,left elements are bigger than right elements,and in updates(22)and (23)and the feature matrix associated the last four rows,right elements are bigger than left elements). with the test data matrix X.is computed by U.=X.W. while UNME fails to identify clusters correctly. (c)Feature selection using KNMF,referred to as FS:Apply KNMF to XT =[x1,....xn]E Rmxm (instead of X).estimat- ing the factor matrix W.Choose row vectors of X 3.3.KNMF for spectral feature extraction associated with k largest values of relative importance Ri computed by (29).Row vectors of the feature matrix Given a basis matrix V learned by NMF,the feature matrix U is correspond to these selected row vectors of X. computed by iterating the update(12)only with V fixed.Or LS (3)Classification:The time-dependent linear discriminant analy- projection is applied to compute U=X[VT]'.while U is not an sis [13.11]or Viterbi algorithm [12].depending on tasks or nonnegative matrix. data types. On the other hand,the feature matrix is computed as U=XW in KNMF.Thus,features associated with unseen data xrest e Rare easily computed by Table 1 Utest =Xiest W. (28) Classification accuracy of IDIAP data. We can use KNMF for data selection or feature selection.Recall Methods Sub1 Sub2 Sub3 Avg. that our data matrix has the form X=[x1,...,xn]where x;is NMF 84.93 70.51 55.28 70.24 m-dimensional data vector.We define a relative importance of KNMF 85.16 75.58 58.26 73.00 column vectors in X as ES 83.56 76.73 54.13 71.47 号=w 70.31 56.02 I (29) BCI comp.winner 79.60 68.65 0.35 0.4 0.3 0.35 -sub3 0.25 0.3 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 C3 Cz C4 CP1 CP2 P3 Pz P4 8 1012141618202224 Fig.4.The pictorial illustration of feature extraction using NMF is shown in the case of Graz dataset(the details on the dataset is explained in Section 4.2).The data matrix r eve h mima pure Thethe hara(rndc channels.The last two rows represent B rhythm around 15 Hz for both channels.Columns of U are corresponding feature profiles
much larger than the rest of entries, while the second column vector of VNMF does not. Corresponding encoding matrices are computed as UNMF ¼ 0:6901 0:0124 0:1759 0:5521 0:9624 0 0:4251 0:6951 0:7794 0:5652 0:1049 0:5817 0:5485 0:1917 0 BBBBBBBBBBBBBB@ 1 CCCCCCCCCCCCCCA , UKNMF ¼ XW ¼ X 0:2539 0:0408 0:0001 0:3361 0:0016 0:3848 0:0000 0:2366 0:7445 0:0017 0 BBBBBBBB@ 1 CCCCCCCCA ¼ 0:7569 0:4804 0:6072 0:5139 1:1709 0:5852 0:4019 1:0158 0:8146 1:1117 0:1784 0:6313 0:3508 0:6528 0 BBBBBBBBBBBBBB@ 1 CCCCCCCCCCCCCCA , where one can see that UKNMF reflects correct clusters (in the first three rows, left elements are bigger than right elements, and in the last four rows, right elements are bigger than left elements), while UNMF fails to identify clusters correctly. 3.3. KNMF for spectral feature extraction Given a basis matrix V learned by NMF, the feature matrix U is computed by iterating the update (12) only with V fixed. Or LS projection is applied to compute U ¼ X½V> y , while U is not an nonnegative matrix. On the other hand, the feature matrix is computed as U ¼ XW in KNMF. Thus, features associated with unseen data xtest 2 Rm are easily computed by utest ¼ x> testW. (28) We can use KNMF for data selection or feature selection. Recall that our data matrix has the form X ¼ ½x1; ... ; xn > where xi is m-dimensional data vector. We define a relative importance of column vectors in X as Rj ¼ kWj;:k kWk , (29) where Wj;: (for j ¼ 1; ... ; m) is the jth row vector of W. For data selection, we sort column vectors of X in descending order by values of Rj and choose column vectors associated with kpm largest values of Rj. The same idea can be applied to X>, which is feature selection. 4. Numerical experiments Numerical experiments for EEG classification are conducted, consisting of three steps where pre-processing and classification are common for three different spectral feature extraction methods considered for comparison: (1) Pre-processing: Transformation of time-domain EEG signals to the time–frequency domain by wavelet transform or shorttime Fourier transform. (2) Spectral feature extraction: (a) NMF: The basis matrix V is learned by multiplicative updates (12) and (13) and the feature matrix associated with the test data matrix X is computed by U ¼ X½V> y . (b) KNMF: The factor matrix W is learned by multiplicative updates (22) and (23) and the feature matrix associated with the test data matrix X is computed by U ¼ XW. (c) Feature selection using KNMF, referred to as FS: Apply KNMF to X> ¼ ½x1; ... ; xn 2 Rmn (instead of X), estimating the factor matrix W. Choose row vectors of X associated with k largest values of relative importance Rj computed by (29). Row vectors of the feature matrix correspond to these selected row vectors of X. (3) Classification: The time-dependent linear discriminant analysis [13,11] or Viterbi algorithm [12], depending on tasks or data types. ARTICLE IN PRESS Table 1 Classification accuracy of IDIAP data. Methods Sub1 Sub2 Sub3 Avg. NMF 84.93 70.51 55.28 70.24 KNMF 85.16 75.58 58.26 73.00 FS 83.56 76.73 54.13 71.47 BCI comp. winner 79.60 70.31 56.02 68.65 C3 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 sub1 sub2 sub3 8 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Cz C4 CP1 CP2 P3 Pz P4 10 12 14 16 18 20 22 24 Fig. 4. The pictorial illustration of feature extraction using NMF is shown in the case of Graz dataset (the details on the dataset is explained in Section 4.2). The data matrix X is constructed by collecting frequency profiles (from two channels C3 and C4) in its columns. NMF is applied to determine two factor matrices U and V with r ¼ 4. Row vectors of V> are basis vectors, each of which represents discriminative frequency pattern. The first two rows reveal the characteristics of m rhythm (8–12 Hz) for C3 and C4 channels. The last two rows represent b rhythm around 15 Hz for both channels. Columns of U are corresponding feature profiles. H. Lee et al. / Neurocomputing 72 (2009) 3182–3190 3187
3188 H.Lee et al.Neurocomputing 72 (2009)3182-3190 For our empirical study,we used two datasets:one is the dataset 4.1.3.Feature extraction and classification V in BCI competition Ill,which was provided by the IDIAP Feature matrices UE R"xr computed by three different Research Institute [71,and the other is the dataset lll in BCI methods,NMF.KNMF,and FS are used as inputs to Viterbi competition Il,which was provided by the Laboratory of algorithm [5]for on-line classification of un-cued EEG signals. Brain-Computer Interfaces(BCI-Lab),Graz University of Techno- More details on the Viterbi classifier are found in [12]. logy [1,13]. Classification accuracy is summarized in Table 1.KNMF outperforms NMF across all subjects and shows the best performance across most of cases.FS method shows compatible 4.1.IDIAP dataset performance to NMF.while FS provides easily-interpretable features.Fig.6 shows the relative importance of spatial(related 4.1.1.Data description to channel)and spectral (related to frequency components) The IDIAP dataset contains EEG data recorded from three features for each subject.Each subject has his/her own spatial normal subjects and involves three tasks,including the imagina- and spectral features. tion of repetitive self-paced left/right hand movements and the generation of words beginning with the same random letter.In 4.2.Graz dataset contrast to the Graz dataset,EEG data are not split into trials,since the subjects are continuously performing any of the mental tasks (i.e.,no trial structure) 4.2.1.Data description The Graz dataset involves left/right imagery hand movements We use the precomputed features which obtained by the power spectral density (PSD)in the band 8-30 Hz every 62.5 ms and consists of 140 labelled trials for training and 140 unlabelled (i.e.,16 times per second)over the last second of data with a trials for test.Each trial has a duration of 9s,where a visual cue (arrow)is presented pointing to the left or the right after 3-s frequency resolution of 2 Hz for the eight centro-parietal channels C3.Cz.C4.CP1.CP2.P3.P2,and P after the raw EEG potentials were preparation period and imagination task is carried out for 6 first spatially filtered by means of a surface Laplacian.As a result, seconds.It contains EEG acquired from three different channels an EEG sample is a 96-dimensional vector (eight channels times (with sampling frequency 128 Hz)C3,Cz and C4.In our study we 12 frequency components). use only two channels,C3 and C4.because even-related desyn- chronization has contralateral dominance and C2 channel contains little information for discriminant analysis.Requirements for 4.1.2.Preprocessing result comparison are to provide a continuous classification The data matrix is constructed by normal- accuracy for each time point of trial during imagination session. izing spectral componentsP(precomputed features).i.e. 4.2.2.Preprocessing = By filtering it with complex Morlet wavelets [11].we obtain the (30) time-frequency representation of the EEG data,Pk for f∈{4,5,..,30)Hz,k=1,2(C3andC4 channels).t=1,.,T for fe(8,10.....28.30)Hz.k=1,2.....8 (corresponding to and i=1,....N.T is the number of data points in each trial and N eight different channels,including C3.Cz.C4.CP1.CP2.P3.Pz. is the number of trials.The data matrix is reshaped by and P4).t=1,...,10528 where 10528 is the number of Xirain Xtest E R(272)x(2760140),where T=760 and N=140. data points in the training set (note that there is no trial structure in this dataset).In the same way,we make the test data matrix, 4.2.3.Feature extraction and classification Xer∈R3504x96 The procedure of feature extraction is same as above.For the single-trial online classification for Graz data (with trial struc- ture),we use a Gaussian probabilistic model-based classifier [13] where Gaussian class-conditional probabilities for a single point 0.9 in time t are integrated temporally by taking the expectation of the class probabilities with respect to the discriminative power at 0.85 each point in time.The classification results are shown in Fig.1. 0.8 The basis vectors of NMF and KNMF are shown in left and middle 88.57%(FS)】 figures of Fig.2.respectively.The averaged sparseness of KNMF is 0.75 0.7104,while the averaged sparseness of NMF is 0.3227.Thus, 0.7 90.00%(KNMF) KNMF factors are sparser than NMF factors.Right figure of Fig.2 89.29%(NMF) shows the selected features.Only using 10-selected features,we 0.65 can get the similar classification result in Fig.1.Fig.7 shows the 0.6 100-selected data from 106 400 data points (only 7000 randomly selected data points are plotted in figure).For visualization,we 0.55 reduce the dimension from 54 to 2 using KNMF.Its basis vectors 0.5 are shown in left figure of Fig.7.Although the data are selected in -NME unsupervised framework,they are representative and 0.45 --.--KNMF discriminative data of each class. 0.4 0 100 200 300 400 500 600 700 5. Conclusions Fig.5.Eight basis vectors computed by NMF (blue solid line)and by KNMF(red dotted line)are shown using Graz dataset.Basis vectors are sorted according to the mutual information between them of NMF and of KNMF.KNMF produces more We have presented multiplicative updates for KNMF that is a sparse factors.(For interpretation of the references to color in this figure legend. kernelized version of the standard NMF.Assuming U=XW the reader is referred to the web version of this article.) KNMF was easily derived in the framework of LS optimization
For our empirical study, we used two datasets: one is the dataset V in BCI competition III, which was provided by the IDIAP Research Institute [7], and the other is the dataset III in BCI competition II, which was provided by the Laboratory of Brain–Computer Interfaces (BCI-Lab), Graz University of Technology [1,13]. 4.1. IDIAP dataset 4.1.1. Data description The IDIAP dataset contains EEG data recorded from three normal subjects and involves three tasks, including the imagination of repetitive self-paced left/right hand movements and the generation of words beginning with the same random letter. In contrast to the Graz dataset, EEG data are not split into trials, since the subjects are continuously performing any of the mental tasks (i.e., no trial structure). We use the precomputed features which obtained by the power spectral density (PSD) in the band 8–30 Hz every 62.5 ms (i.e., 16 times per second) over the last second of data with a frequency resolution of 2 Hz for the eight centro-parietal channels C3, Cz, C4, CP1, CP2, P3, Pz, and P4 after the raw EEG potentials were first spatially filtered by means of a surface Laplacian. As a result, an EEG sample is a 96-dimensional vector (eight channels times 12 frequency components). 4.1.2. Preprocessing The data matrix Xtrain 2 R10 52896 is constructed by normalizing spectral components PðkÞ t;f (precomputed features), i.e., P¯ ðkÞ t;f ¼ PðkÞ t;f P f PðkÞ t;f , (30) for f 2 f8; 10; ... ; 28; 30g Hz, k ¼ 1; 2; ... ; 8 (corresponding to eight different channels, including C3, Cz, C4, CP1, CP2, P3, Pz, and P4), t ¼ 1; ... ; 10 528 where 10 528 is the number of data points in the training set (note that there is no trial structure in this dataset). In the same way, we make the test data matrix, Xtest 2 R350496. 4.1.3. Feature extraction and classification Feature matrices U 2 Rnr computed by three different methods, NMF, KNMF, and FS are used as inputs to Viterbi algorithm [5] for on-line classification of un-cued EEG signals. More details on the Viterbi classifier are found in [12]. Classification accuracy is summarized in Table 1. KNMF outperforms NMF across all subjects and shows the best performance across most of cases. FS method shows compatible performance to NMF, while FS provides easily-interpretable features. Fig. 6 shows the relative importance of spatial (related to channel) and spectral (related to frequency components) features for each subject. Each subject has his/her own spatial and spectral features. 4.2. Graz dataset 4.2.1. Data description The Graz dataset involves left/right imagery hand movements and consists of 140 labelled trials for training and 140 unlabelled trials for test. Each trial has a duration of 9 s, where a visual cue (arrow) is presented pointing to the left or the right after 3-s preparation period and imagination task is carried out for 6 seconds. It contains EEG acquired from three different channels (with sampling frequency 128 Hz) C3, Cz and C4. In our study we use only two channels, C3 and C4, because even-related desynchronization has contralateral dominance and Cz channel contains little information for discriminant analysis. Requirements for result comparison are to provide a continuous classification accuracy for each time point of trial during imagination session. 4.2.2. Preprocessing By filtering it with complex Morlet wavelets [11], we obtain the time–frequency representation of the EEG data, PðkÞ it;f for f 2 f4; 5; ... ; 30g Hz, k ¼ 1; 2 (C3 and C4 channels), t ¼ 1; ... ; T and i ¼ 1; ... ;N. T is the number of data points in each trial and N is the number of trials. The data matrix is reshaped by Xtrain ¼ Xtest 2 Rð272Þð2760140Þ , where T ¼ 760 and N ¼ 140. 4.2.3. Feature extraction and classification The procedure of feature extraction is same as above. For the single-trial online classification for Graz data (with trial structure), we use a Gaussian probabilistic model-based classifier [13] where Gaussian class-conditional probabilities for a single point in time t are integrated temporally by taking the expectation of the class probabilities with respect to the discriminative power at each point in time. The classification results are shown in Fig. 1. The basis vectors of NMF and KNMF are shown in left and middle figures of Fig. 2, respectively. The averaged sparseness of KNMF is 0:7104, while the averaged sparseness of NMF is 0:3227. Thus, KNMF factors are sparser than NMF factors. Right figure of Fig. 2 shows the selected features. Only using 10-selected features, we can get the similar classification result in Fig. 1. Fig. 7 shows the 100-selected data from 106 400 data points (only 7000 randomly selected data points are plotted in figure). For visualization, we reduce the dimension from 54 to 2 using KNMF. Its basis vectors are shown in left figure of Fig. 7. Although the data are selected in unsupervised framework, they are representative and discriminative data of each class. 5. Conclusions We have presented multiplicative updates for KNMF that is a kernelized version of the standard NMF. Assuming U ¼ XW, KNMF was easily derived in the framework of LS optimization ARTICLE IN PRESS 0 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 FS NMF KNMF 89.29% (NMF) 90.00% (KNMF) 88.57% (FS) 100 200 300 400 500 600 700 Fig. 5. Eight basis vectors computed by NMF (blue solid line) and by KNMF (red dotted line) are shown using Graz dataset. Basis vectors are sorted according to the mutual information between them of NMF and of KNMF. KNMF produces more sparse factors. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 3188 H. Lee et al. / Neurocomputing 72 (2009) 3182–3190
H.Lee et al.Neurocomputing 72(2009)3182-3190 3189 8 13 3 18 18 23 23 28 8 13 18 23 23 28 28 34 56 1 2 34 56 NMF KNMF FS Fig.6.The relative importance of spatial and spectral features is shown for three different subjects,in the left(channels)and right(frequency components)panel,. respectively.That is,the left panel shows which channels are more important,while the right panel shows which frequency components are more important in a given task. 1.8 1.6 EFT 13 18 14 23 1.2 8 8 0.8 13 0.6 18 RIGHT 23 0.4 0.2 28 0 factor 2 00.20.4 0.60.8 11.21.4 1.61.8 2 factor encoding value of factor 1 Fig.7.The left panel shows two basis vectors determined by KNMF with r=2.Each of them reveals u rhythm in C3 Ca channels,respectively.The right panel shows associated encoding variables(linear low-dimensional embeddings into two-dimensional Euclidean space).where red circles represent "LEFT"class and blue crosses represent"RIGHT"class.The green solid line is a hyperplane which bisects the line connecting two centers determined by k-means clustering.with considering the scatter of each cluster.Among 10000 data points.we selected 100 data points by our data selection scheme.Those selected data points were clustered along either horizontal axis (cyan)or vertical axis(magenta).implying that selected data points are representative and discriminative.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) with nonnegativity constraints.We have successfully applied Birbaumer.The BCI competition 2003:progress and perspectives in detection KNMF to a task of learning discriminative spectral feature from and discrimination of EEG single trials,IEEE Transactions on Biomedical EEG data for classification.Experiments on two benchmark EEG Engineering 51(6)(2004). 12]S.Choi.Aigorithms for orthogonal nonnegative matrix factorization,in: datasets confirmed the performance gain over standard NMF. Proceedings of the International Joint Conference on Neural Networks.Hong Although KNMF was first proposed in [4].we further elaborated it Kong.2008. here and confirmed its useful behavior (easiness in feature [3]A.Cichocki,H.Lee,Y.-D.Kim,S.Choi,Nonnegative matrix factori- zation with x-divergence,Pattern Recognition Letters 29 (9) (2008) extraction and sparsity)with real-world EEG data. 1433-1440. [4]C.Ding.T.Li.M.I.Jordan.Convex and semi-nonnegative matrix factori- zations.Technical Report 60428.Lawrence Berkeley National Laboratory. Acknowledgments 2006. [5]G.D.Fomey.The Viterbi algorithm.Proceedings of the IEEE 61 (1973) 268-278. This work was supported by KOSEF Basic Research Program [6]P.O.Hoyer,Non-ne gative matrix factorization with ness constraints (Grant R01-2006-000-11142-0)and National Core Research Center Journal of Machine Learning Research 5(2004)1457-1469. for Systems Bio-Dynamics. [7]J.del R.Millan.On the need for on-line learning in brain-computer interfaces in:Proceedings of the International Joint Conference on Neural Networks Budapest.Hungary,2004. References [T.N.Lal.M.Schroder.T.Hinterberger.J.Weston,M.Bogdan.N.Birbaumer.B. Scholkopf,Support vector channel selection in BCI,Technical Report 120,Max Planck Institute for Biological Cybernetics,2003. [1]B.Blankertz.K-R.Muller.G.Curio.T.M.Vaughan,G.Schalk.J.R.Wolpaw.A. [9]D.D.Lee.H.S.Seung.Learning the parts of objects by non-negative matrix Schlogl,C.Neuper.G.Pfurtscheller,T.Hinterberger.M.Schroder.N. factorization,Nature 401 (1999)788-791
with nonnegativity constraints. We have successfully applied KNMF to a task of learning discriminative spectral feature from EEG data for classification. Experiments on two benchmark EEG datasets confirmed the performance gain over standard NMF. Although KNMF was first proposed in [4], we further elaborated it here and confirmed its useful behavior (easiness in feature extraction and sparsity) with real-world EEG data. Acknowledgments This work was supported by KOSEF Basic Research Program (Grant R01-2006-000-11142-0) and National Core Research Center for Systems Bio-Dynamics. References [1] B. Blankertz, K.-R. Mu¨ller, G. Curio, T.M. Vaughan, G. Schalk, J.R. Wolpaw, A. Schlo¨gl, C. Neuper, G. Pfurtscheller, T. Hinterberger, M. Schroder, N. Birbaumer, The BCI competition 2003: progress and perspectives in detection and discrimination of EEG single trials, IEEE Transactions on Biomedical Engineering 51 (6) (2004). [2] S. Choi, Algorithms for orthogonal nonnegative matrix factorization, in: Proceedings of the International Joint Conference on Neural Networks, Hong Kong, 2008. [3] A. Cichocki, H. Lee, Y.-D. Kim, S. Choi, Nonnegative matrix factorization with a-divergence, Pattern Recognition Letters 29 (9) (2008) 1433–1440. [4] C. Ding, T. Li, M.I. Jordan, Convex and semi-nonnegative matrix factorizations, Technical Report 60428, Lawrence Berkeley National Laboratory, 2006. [5] G.D. Forney, The Viterbi algorithm, Proceedings of the IEEE 61 (1973) 268–278. [6] P.O. Hoyer, Non-negative matrix factorization with sparseness constraints, Journal of Machine Learning Research 5 (2004) 1457–1469. [7] J. del R. Milla´n, On the need for on-line learning in brain–computer interfaces, in: Proceedings of the International Joint Conference on Neural Networks, Budapest, Hungary, 2004. [8] T.N. Lal, M. Schroder, T. Hinterberger, J. Weston, M. Bogdan, N. Birbaumer, B. Scho¨lkopf, Support vector channel selection in BCI, Technical Report 120, Max Planck Institute for Biological Cybernetics, 2003. [9] D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (1999) 788–791. ARTICLE IN PRESS 1 8 13 18 23 28 8 13 18 23 28 1 8 13 18 23 28 8 13 18 23 28 KNMF FS C3 C4 23456 23456 1 NMF Fig. 6. The relative importance of spatial and spectral features is shown for three different subjects, in the left (channels) and right (frequency components) panel, respectively. That is, the left panel shows which channels are more important, while the right panel shows which frequency components are more important in a given task. 00.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 factor 1 8 13 18 23 28 8 13 18 23 28 factor 2 encoding value of factor 1 encoding value of factor 2 LEFT RIGHT 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Fig. 7. The left panel shows two basis vectors determined by KNMF with r ¼ 2. Each of them reveals m rhythm in C3 C4 channels, respectively. The right panel shows associated encoding variables (linear low-dimensional embeddings into two-dimensional Euclidean space), where red circles represent ‘‘LEFT’’ class and blue crosses represent ‘‘RIGHT’’ class. The green solid line is a hyperplane which bisects the line connecting two centers determined by k-means clustering, with considering the scatter of each cluster. Among 10 000 data points, we selected 100 data points by our data selection scheme. Those selected data points were clustered along either horizontal axis (cyan) or vertical axis (magenta), implying that selected data points are representative and discriminative.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) H. Lee et al. / Neurocomputing 72 (2009) 3182–3190 3189
3190 H.Lee et al.Neurocomputing 72 (2009)3182-3190 [10]D.D.Lee.H.S.Seung.Algorithms for non-negative matrix factorization, team leader of the laboratory for Artificial Brain Systems,at Frontier Research in:Advances in Neural Information Processing Systems.13.MIT Press. Program RIKEN (Japan).in the Brain Information Processing Group.He is currently Cambridge,2001. the head of the laboratory for Advanced Brain Signal Processing.at RIKEN Brain [11]H.Lee,A.Cichocki,S.Choi,Nonnegative matrix factorization for motor Science Institute(JAPAN)in the Brain-Style Computing Group directed by Professor imagery EEGclassification,in:Proceedings of the International Conference on Shun-ichi Amari.He is co-author of more than 250 scientific papers and three Artificial Neural Networks,Athens,Greece,Springer,Berlin,2006. internationally recognized monographs (two of them translated to Chinese): 12]H.Lee,Y.-D.Kim,A.Cichocki,S.Choi.Nonnegative tensor factorization fo Adaptive Blind Signal and Image Processing (Wiley.April 2003-revised edition) continuous EEG classification,International Journal of Neural Systems 17(4) CMOS Switched-Capacitor and Continuous-Time Integrated Circuits and Systems (2007)305-317. (Springer,1989)and Neural Networks for Optimizations and Signal Processing 13]S.Lemm.C.Schafer.G.Curio,BCI competition 2003-data set Ill:probabilistic (Teubner-Wiley.1994).He is Editor in Chief of International Journal Computational modeling of sensorimotor u rhythms for classification of imaginary hand Intelligence and Neuroscience. movements,IEEE Transactions on Biomedical Engineering 51(6)(2004). [14]S.Z.Li.X.W.Hou.H.J.Zhang.Q.S.Cheng.Learning spatially localized parts- based representation,in:Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition,Kauai.Hawaii,2001,pp.207-212. [15]T.Li.C.Ding.The relationships among various nonnegative matrix factoriza- Seungjin Choi received the B.S.and M.S.degrees in Electrical Engineering from Seoul National University. tion methods for clustering.in:Proceedings of the IEEE International Korea,in 1987 and 1989,respectively,and the Ph.D. Conference on Data Mining.Hong Kong,2006. degree in Electrical Engineering from the University of [16]P.Paatero.U.Tapper,Positive matrix factorization:a non-negative factor Notre Dame.Indiana.in 1996.He was a Visiting model with optimal utilization of error estimates of data values,Environ- Assistant Professor in the Department of Electrical metrics5(1994)111-126. Engineering at University of Notre Dame,Indiana, [17]J.R.Wolpaw.N.Birbaumer.D.J.McFarland.G.Pfurtscheller.T.M.Vaughan. during the Fall semester of 1996.He was with the Brain-computer interfaces for communication and control,Clinical Neuro- physiology113(2002)767-791. Laboratory for Artificial Brain Systems.RIKEN.Japan,in 1997 and was an Assistant Professor in the School of Electrical and Electronics Engineering.Chungbuk Na- tional University from 1997 to 2000.He is currently a Professor of Computer Science at Pohang University of Hyekyoung Lee received the B.S.degree in electrical Science and Technology.Korea.His primary recent research has focused on engineering in 2000.and the M.S.and Ph.D.degrees in statistical machine learning and probabilistic models,including manifold learning. computer science in 2003 and 2009.from Pohang matrix factorization,semi-supervised learning.kernel machines,and independent University of Science and Technology.Pohang.Korea. component analysis,with applications to brain computer interface.bioinformatics, Her research interests include statistical machine information retrieval and pattern recognition. learning and EEG-based brain computer interface. engineering from Warsaw University of Technology (Poland).Since 1972,he has been with the Institute of Theory of Electrical Engineering,Measurement and Information Systems,Faculty of Electrical Engineering at the Warsaw University of Technology,where he obtained a title of a full Professor in 1995.He spent several years at University Erlangen-Nuerenberg (Germany).at the Chair of Applied and Theoretical Electrical Engineering directed by Professor Rolf Un- behauen,as an Alexander-von-Humboldt Research Fellow and Guest Professor.In 1995-1997 he was a
[10] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, in: Advances in Neural Information Processing Systems, 13, MIT Press, Cambridge, 2001. [11] H. Lee, A. Cichocki, S. Choi, Nonnegative matrix factorization for motor imagery EEG classification, in: Proceedings of the International Conference on Artificial Neural Networks, Athens, Greece, Springer, Berlin, 2006. [12] H. Lee, Y.-D. Kim, A. Cichocki, S. Choi, Nonnegative tensor factorization for continuous EEG classification, International Journal of Neural Systems 17 (4) (2007) 305–317. [13] S. Lemm, C. Scha¨fer, G. Curio, BCI competition 2003-data set III: probabilistic modeling of sensorimotor m rhythms for classification of imaginary hand movements, IEEE Transactions on Biomedical Engineering 51 (6) (2004). [14] S.Z. Li, X.W. Hou, H.J. Zhang, Q.S. Cheng, Learning spatially localized partsbased representation, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, 2001, pp. 207–212. [15] T. Li, C. Ding, The relationships among various nonnegative matrix factorization methods for clustering, in: Proceedings of the IEEE International Conference on Data Mining, Hong Kong, 2006. [16] P. Paatero, U. Tapper, Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values, Environmetrics 5 (1994) 111–126. [17] J.R. Wolpaw, N. Birbaumer, D.J. McFarland, G. Pfurtscheller, T.M. Vaughan, Brain-computer interfaces for communication and control, Clinical Neurophysiology 113 (2002) 767–791. Hyekyoung Lee received the B.S. degree in electrical engineering in 2000, and the M.S. and Ph.D. degrees in computer science in 2003 and 2009, from Pohang University of Science and Technology, Pohang, Korea. Her research interests include statistical machine learning and EEG-based brain computer interface. Andrzej Cichocki received the M.Sc. (with honors), Ph.D. and Dr.Sc. (Habilitation) degrees, all in electrical engineering from Warsaw University of Technology (Poland). Since 1972, he has been with the Institute of Theory of Electrical Engineering, Measurement and Information Systems, Faculty of Electrical Engineering at the Warsaw University of Technology, where he obtained a title of a full Professor in 1995. He spent several years at University Erlangen-Nuerenberg (Germany), at the Chair of Applied and Theoretical Electrical Engineering directed by Professor Rolf Unbehauen, as an Alexander-von-Humboldt Research Fellow and Guest Professor. In 1995–1997 he was a team leader of the laboratory for Artificial Brain Systems, at Frontier Research Program RIKEN (Japan), in the Brain Information Processing Group. He is currently the head of the laboratory for Advanced Brain Signal Processing, at RIKEN Brain Science Institute (JAPAN) in the Brain-Style Computing Group directed by Professor Shun-ichi Amari. He is co-author of more than 250 scientific papers and three internationally recognized monographs (two of them translated to Chinese): Adaptive Blind Signal and Image Processing (Wiley, April 2003-revised edition), CMOS Switched-Capacitor and Continuous-Time Integrated Circuits and Systems (Springer, 1989) and Neural Networks for Optimizations and Signal Processing (Teubner-Wiley, 1994). He is Editor in Chief of International Journal Computational Intelligence and Neuroscience. Seungjin Choi received the B.S. and M.S. degrees in Electrical Engineering from Seoul National University, Korea, in 1987 and 1989, respectively, and the Ph.D. degree in Electrical Engineering from the University of Notre Dame, Indiana, in 1996. He was a Visiting Assistant Professor in the Department of Electrical Engineering at University of Notre Dame, Indiana, during the Fall semester of 1996. He was with the Laboratory for Artificial Brain Systems, RIKEN, Japan, in 1997 and was an Assistant Professor in the School of Electrical and Electronics Engineering, Chungbuk National University from 1997 to 2000. He is currently a Professor of Computer Science at Pohang University of Science and Technology, Korea. His primary recent research has focused on statistical machine learning and probabilistic models, including manifold learning, matrix factorization, semi-supervised learning, kernel machines, and independent component analysis, with applications to brain computer interface, bioinformatics, information retrieval and pattern recognition. ARTICLE IN PRESS 3190 H. Lee et al. / Neurocomputing 72 (2009) 3182–3190