Relation Regularized Matrix Factorization Wu-Jun Li Dit-Yan Yeung Department of Computer Science and Engineering Hong Kong University of Science and Technology,Hong Kong,China liwujun,dyyeung@cse.ust.hk Abstract which can be seen as a regularized version of SVD by control- ling the complexity of the factors,-has been successfully used In many applications,the data,such as web pages for many applications,such as collaborative filtering [Rennie and research papers,contain relation (link)struc- and Srebro,2005].Many other variants of MF methods and ture among entities in addition to textual content their applications can be found in ISingh and Gordon.2008]. information.Matrix factorization (MF)methods, Although MF methods have achieved very promising per- such as latent semantic indexing (LSI),have been formance in many applications,most of them are designed successfully used to map either content information to handle only one matrix at a time.The matrix can be ei- or relation information into a lower-dimensional la- ther the feature representation(content information)of a set tent space for subsequent processing.However. of objects (entities),or the link structure (relation informa- how to simultaneously model both the relation in- tion)among a set of objects.For example,LSI was originally formation and the content information effectively proposed for content-based information analysis by perform- with an MF framework is still an open research ing SVD on the bag-of-words representation of a set of docu- problem.In this paper,we propose a novel MF ments,and MMMF has been successfully applied to collabo- method called relation regularized matrix factor- rative filtering which employs only the relationship among a ization(RRMF)for relational data analysis.By set of entities [Rennie and Srebro,2005] using relation information to regularize the con- However,the data in many applications,such as web pages tent MF procedure,RRMF seamlessly integrates and research papers,contain both textual content information both the relation information and the content infor- and relation structure.These two kinds of information often mation into a principled framework.We propose complement each other because they are typically collected a linear-time learning algorithm with convergence at different semantic levels.For example,a citation/reference guarantee to learn the parameters of RRMF.Exten- relation between two papers provides a very strong evidence sive experiments on real data sets show that RRMF for them to belong to the same topic,although sometimes they can achieve state-of-the-art performance bear low similarity in their content due to the sparse nature of the bag-of-words representation.Similarly,the content infor- mation can also provide additional insights about the relation 1 Introduction among entities.Hence,naively discarding any one kind of information would not allow us to fully utilize all the rele- Matrix factorization (MF)methods Singh and Gordon. 2008],which try to project objects (entities)into a lower- vant information available.Ideally,we should strive for inte- grating both kinds of information seamlessly into a common dimensional latent space,have been widely used for vari- framework for data analysis. ous data analysis applications.One of the most popular MF methods is latent semantic indexing(LSI)[Deerwester et al., In [Zhu et al.,2007],a joint link-content MF method,ab- 1990],which uses singular value decomposition (SVD)to breviated as LCMF here,was proposed to seamlessly inte- map the content of documents into a lower-dimensional la- grate content and relation information into an MF framework tent semantic space.The objective is to retain as much in- through a set of shared factors for the factorization of both content and link structures.The authors of LCMF argue formation of the documents as possible while simultaneously that their method is more reasonable than some other simple removing the noise.Subsequent analysis,such as cluster- ing or classification,can be performed based on this latent methods for combining content and link information,such as that in [Kurland and Lee,2005],which seek to convert space representation.Maximum margin matrix factorization (MMMF)Srebro et al..2004:Rennie and Srebro.2005]. 2MMMF can make use of different loss functions.The original MMMF [Srebro et al.,2004]only used the hinge loss,but many sub- Due to the page limit constraint,many related MF references are sequent works also used other loss functions such as smooth hinge not cited in this paper.We refer the readers to [Singh and Gordon, [Rennie and Srebro,2005]and squared loss.Here we refer in par- 2008]for many such references. ticular to the squared loss function
Relation Regularized Matrix Factorization Wu-Jun Li & Dit-Yan Yeung Department of Computer Science and Engineering Hong Kong University of Science and Technology, Hong Kong, China {liwujun, dyyeung}@cse.ust.hk Abstract In many applications, the data, such as web pages and research papers, contain relation (link) structure among entities in addition to textual content information. Matrix factorization (MF) methods, such as latent semantic indexing (LSI), have been successfully used to map either content information or relation information into a lower-dimensional latent space for subsequent processing. However, how to simultaneously model both the relation information and the content information effectively with an MF framework is still an open research problem. In this paper, we propose a novel MF method called relation regularized matrix factorization (RRMF) for relational data analysis. By using relation information to regularize the content MF procedure, RRMF seamlessly integrates both the relation information and the content information into a principled framework. We propose a linear-time learning algorithm with convergence guarantee to learn the parameters of RRMF. Extensive experiments on real data sets show that RRMF can achieve state-of-the-art performance. 1 Introduction Matrix factorization (MF) methods [Singh and Gordon, 2008], which try to project objects (entities) into a lowerdimensional latent space, have been widely used for various data analysis applications.1 One of the most popular MF methods is latent semantic indexing (LSI) [Deerwester et al., 1990], which uses singular value decomposition (SVD) to map the content of documents into a lower-dimensional latent semantic space. The objective is to retain as much information of the documents as possible while simultaneously removing the noise. Subsequent analysis, such as clustering or classification, can be performed based on this latent space representation. Maximum margin matrix factorization (MMMF) [Srebro et al., 2004; Rennie and Srebro, 2005], 1Due to the page limit constraint, many related MF references are not cited in this paper. We refer the readers to [Singh and Gordon, 2008] for many such references. which can be seen as a regularized version of SVD by controlling the complexity of the factors,2 has been successfully used for many applications, such as collaborative filtering [Rennie and Srebro, 2005]. Many other variants of MF methods and their applications can be found in [Singh and Gordon, 2008]. Although MF methods have achieved very promising performance in many applications, most of them are designed to handle only one matrix at a time. The matrix can be either the feature representation (content information) of a set of objects (entities), or the link structure (relation information) among a set of objects. For example, LSI was originally proposed for content-based information analysis by performing SVD on the bag-of-words representation of a set of documents, and MMMF has been successfully applied to collaborative filtering which employs only the relationship among a set of entities [Rennie and Srebro, 2005]. However, the data in many applications, such as web pages and research papers, contain both textual content information and relation structure. These two kinds of information often complement each other because they are typically collected at different semantic levels. For example, a citation/reference relation between two papers provides a very strong evidence for them to belong to the same topic, although sometimes they bear low similarity in their content due to the sparse nature of the bag-of-words representation. Similarly, the content information can also provide additional insights about the relation among entities. Hence, naively discarding any one kind of information would not allow us to fully utilize all the relevant information available. Ideally, we should strive for integrating both kinds of information seamlessly into a common framework for data analysis. In [Zhu et al., 2007], a joint link-content MF method, abbreviated as LCMF here, was proposed to seamlessly integrate content and relation information into an MF framework through a set of shared factors for the factorization of both content and link structures. The authors of LCMF argue that their method is more reasonable than some other simple methods for combining content and link information, such as that in [Kurland and Lee, 2005], which seek to convert 2MMMF can make use of different loss functions. The original MMMF [Srebro et al., 2004] only used the hinge loss, but many subsequent works also used other loss functions such as smooth hinge [Rennie and Srebro, 2005] and squared loss. Here we refer in particular to the squared loss function
one type of information into the other.Experimental results to pursue research reported in this paper in [Zhu et al.,2007]show that LCMF can outperform other We propose in this paper a novel MF method.The ba- state-of-the-art methods. sic idea is to make the latent factors of two entities as However,the links in different applications might cap- close as possible if there exists a link between them.More ture different semantics.As a result,the model assumption specifically,we utilize relation information to regularize the adopted by LCMF might not be satisfied for some applica- tions,which would cause LCMF to fail in such cases.Let content MF procedure,resulting in a principled framework which seamlessly integrates content and relation informa- us take a synthetic link structure from Zhu et al.,2007]as tion.We refer to our method as relation regularized matrix an example to illustrate this point.This example is shown factorization,which will be abbreviated as RRMF in the se- in Figure 1(a),in which there are eight nodes,corresponding to eight entities,and eight directed links.After performing quel for convenience.To learn the parameters of RRMF,we propose a linear-time algorithm,which makes RRMF suitable link MF based on Zhu et al.'s method,the entities will be for large-scale problems.Furthermore,the learning proce- automatically grouped into five clusters,each corresponding dure of RRMF can be proved to be convergent.Experimen- to one ellipse in Figure 1(a).The learned latent factors of the tal results on a data set with Type II links demonstrate that entities [Zhu et al.,2007]are shown in Figure 1(b),in which RRMF can dramatically outperform LCMF and other state- the latent factors for VI to V8 are listed from the first row to of-the-art methods. the last row in order.We can see that the entities in each clus- ter,say V2 and V3,have the same latent factor representation. Although RRMF is motivated to model Type II links,it From this example,it is not difficult to reveal the model as- can also be used for applications with Type I links by adding sumption behind LCMF:if two entities link to or are linked a simple step to preprocess the link structure of the data.This by one common entity,the two entities will have similar la- will be discussed in detail in the following sections. tent factor representations.This is a very promising property if the application scenario indeed satisfies this assumption. One such application is the web page classification task on 2 Relation Regularized Matrix Factorization WebKB [Craven et al.,1998].For example,the homepages of faculties are always linked by or link to the homepage of the department,but the homepages of two faculties seldom 2.1 Notations link to each other.The advantage of this property has been verified by the promising accuracy of LCMF on the WebKB We use boldface uppercase letters,such as K,to denote ma- data set Zhu et al..2007 trices,and boldface lowercase letters,such as z,to denote vectors.The ith row and the ith column of a matrix K are denoted as Ki and K.i,respectively.Kii denotes the ele- 8 ment at the ith row and ith column in K.zi denotes the ith element in z.X denotes the content matrix of size n x m, (a)Synthetic link structure (b)Result of link MF with n being the number of entities and m the number of fea- Figure 1:A synthetic example of link structure from [Zhu et al., tures.A is the adjacency matrix of the n entities.D denotes 2007]for illustration. the number of latent factors.I denotes the identity matrix whose dimensionality depends on the context.For a matrix However,one problem with LCMF is that the factor rep- K,K 0 means that K is positive semi-definite (psd)and resentations of two linked entities are not guaranteed to be K>0 means that K is positive definite (pd) similar.For example,the latent factor representation of V8 is relatively far away from that of V6 or V7 after LCMF learning.This property is undesirable for many other appli- 2.2 Model Formulation cations.For example,to classify research papers according to their topic,the citation (link)information is usually very important.More specifically,two papers with a citation rela- Like in LSI [Deerwester et al.,1990],we adopt a similar MF tion between them are most likely about the same topic,and method to approximate the content matrix: consequently,the learned latent factors of two linked papers 映X-Uv22+UP+IVIP), (1) should be similar.Unfortunately.this cannot be handled well by LCMF,which is also verified by the relatively poor perfor- where we use an n x D matrix U and an m x D matrix V mance of LCMF on the Cora data set McCallum et al..2000] to denote the latent D-dimensional representations of all the for paper classification. documents(or entities in a more general case)and words (or From the analysis above,we can see that there exist at least features in a more general case),respectively,with U,.for two types of links with different semantics.In this paper,we document i and Vi.for word j.a is a hyperparameter. call the first type of links,such as those in the WebKB data To integrate the relation (link)information into the MF pro- set,Type I links,and the other type of links,such as those in cedure,we use the relations among entities to regularize the the Cora data set,Tipe II links.As discussed above,LCMF latent factors.The basic idea of our method is to make the la- can handle well applications with Type I links but not those tent representations oftwo entities as close as possible if there with Type II links.It is this limitation that has motivated us exists a relation between them.We can achieve this goal by
one type of information into the other. Experimental results in [Zhu et al., 2007] show that LCMF can outperform other state-of-the-art methods. However, the links in different applications might capture different semantics. As a result, the model assumption adopted by LCMF might not be satisfied for some applications, which would cause LCMF to fail in such cases. Let us take a synthetic link structure from [Zhu et al., 2007] as an example to illustrate this point. This example is shown in Figure 1(a), in which there are eight nodes, corresponding to eight entities, and eight directed links. After performing link MF based on Zhu et al. ’s method, the entities will be automatically grouped into five clusters, each corresponding to one ellipse in Figure 1(a). The learned latent factors of the entities [Zhu et al., 2007] are shown in Figure 1(b), in which the latent factors for V1 to V8 are listed from the first row to the last row in order. We can see that the entities in each cluster, say V2 and V3, have the same latent factor representation. From this example, it is not difficult to reveal the model assumption behind LCMF: if two entities link to or are linked by one common entity, the two entities will have similar latent factor representations. This is a very promising property if the application scenario indeed satisfies this assumption. One such application is the web page classification task on WebKB [Craven et al., 1998]. For example, the homepages of faculties are always linked by or link to the homepage of the department, but the homepages of two faculties seldom link to each other. The advantage of this property has been verified by the promising accuracy of LCMF on the WebKB data set [Zhu et al., 2007]. V1 V8 V6 V7 V4 V5 V2 V3 -.8 -.5 .3 -.1 -.0 -.0 .4 .6 -.1 -.4 -.0 .4 .6 -.1 -.4 .3 -.2 .3 -.4 .3 .3 -.2 .3 -.4 .3 -.4 .5 .0 -.2 .6 -.4 .5 .0 -.2 .6 -.1 .1 -.4 -.8 -.4 (a) Synthetic link structure (b) Result of link MF Figure 1: A synthetic example of link structure from [Zhu et al. , 2007] for illustration. However, one problem with LCMF is that the factor representations of two linked entities are not guaranteed to be similar. For example, the latent factor representation of V8 is relatively far away from that of V6 or V7 after LCMF learning. This property is undesirable for many other applications. For example, to classify research papers according to their topic, the citation (link) information is usually very important. More specifically, two papers with a citation relation between them are most likely about the same topic, and consequently, the learned latent factors of two linked papers should be similar. Unfortunately, this cannot be handled well by LCMF, which is also verified by the relatively poor performance of LCMF on the Cora data set [McCallum et al., 2000] for paper classification. From the analysis above, we can see that there exist at least two types of links with different semantics. In this paper, we call the first type of links, such as those in the WebKB data set, Type I links, and the other type of links, such as those in the Cora data set, Type II links. As discussed above, LCMF can handle well applications with Type I links but not those with Type II links. It is this limitation that has motivated us to pursue research reported in this paper. We propose in this paper a novel MF method. The basic idea is to make the latent factors of two entities as close as possible if there exists a link between them. More specifically, we utilize relation information to regularize the content MF procedure, resulting in a principled framework which seamlessly integrates content and relation information. We refer to our method as relation regularized matrix factorization, which will be abbreviated as RRMF in the sequel for convenience. To learn the parameters of RRMF, we propose a linear-time algorithm, which makes RRMF suitable for large-scale problems. Furthermore, the learning procedure of RRMF can be proved to be convergent. Experimental results on a data set with Type II links demonstrate that RRMF can dramatically outperform LCMF and other stateof-the-art methods. Although RRMF is motivated to model Type II links, it can also be used for applications with Type I links by adding a simple step to preprocess the link structure of the data. This will be discussed in detail in the following sections. 2 Relation Regularized Matrix Factorization 2.1 Notations We use boldface uppercase letters, such as K, to denote matrices, and boldface lowercase letters, such as z, to denote vectors. The ith row and the jth column of a matrix K are denoted as Ki∗ and K∗j , respectively. Kij denotes the element at the ith row and jth column in K. zi denotes the ith element in z. X denotes the content matrix of size n × m, with n being the number of entities and m the number of features. A is the adjacency matrix of the n entities. D denotes the number of latent factors. I denotes the identity matrix whose dimensionality depends on the context. For a matrix K, K 0 means that K is positive semi-definite (psd) and K 0 means that K is positive definite (pd). 2.2 Model Formulation Like in LSI [Deerwester et al., 1990], we adopt a similar MF method to approximate the content matrix: min U,V 1 2 kX − UVT k 2 + α 2 (kUk 2 + kVk 2 ), (1) where we use an n × D matrix U and an m × D matrix V to denote the latent D-dimensional representations of all the documents (or entities in a more general case) and words (or features in a more general case), respectively, with Ui∗ for document i and Vj∗ for word j. α is a hyperparameter. To integrate the relation (link) information into the MF procedure, we use the relations among entities to regularize the latent factors. The basic idea of our method is to make the latent representations of two entities as close as possible if there exists a relation between them. We can achieve this goal by
minimizing the following objective function and V7 will also be connected.Learning RRMF based on these added links is now in line with the underlying seman- ∑∑A,lIU.-U.2 tics.From our experiments on the WebKB data set,we find that RRMF can still achieve performance comparable with 1 LCMF just by adopting the simple link preprocessing strat- egy as described above. 2.3 Learning In this subsection,we first prove that the objective function in (3)is convex with respect to (w.r.t.)any one of its param- eters,U and V.Then,we propose an alternating projection algorithm to learn the parameters in linear time and show that ∑UaU,d=tr(U'cU), (2) it is guaranteed to converge to a local optimum. d= Convexity of the Objective Function where Aij=1 if there is a relation between entities i and Lemma 1 The Laplacian matrix C is psd.i.e..C0. j,and otherwise Aij=0,and C =D-A is known as Proof:For any,n×1 vector x≠0,xTcx= the Laplacian matrix [Chung,1997]with D being a diagonal ∑=1∑=1A-xP≥0.■ matrix whose diagonal elements Dii=;Aij,and tr() denotes the trace of a matrix. Lemma2∑VV,*之0. Combining(1)and(2),we obtain the following objective function which we seek to minimize during learning: Proof:Forany D×1 vectorz≠0,z(∑1VVr)z --UVP+(U+V)+(UCU) ∑1 TVT.V.=∑1(W#z2≥0.■ Theorem 1 f is convex w.r.t.U. =lx-UVT2+r[U(oI+Bc)U]+gu(VVT),(3) Proof:We first rewrite f as follows:f which seamlessly integrates the content information and the ∑∑(X-U.vg2+∑,Ur(al+ relation information into a principled framework.Because BC)U.+C1,where C1 is a constant independent of U we use relation information to regularize the content MF pro- Letg=2∑1∑1(X-UVg)2=C2+ cedure,we call the model in(3)relation regularized matrix factorization,abbreviated as RRMF. ∑[U.(∑1VV)U-2(∑1XV)Ug: Remark 1 The normalized Laplacian matrix [Chung,1997]. where C2 is a constant independent of U.It is easy to see that GV.V,.. 0g given by C D-1/2CD-1/2 I-D-1/2AD-1/2 can be used to substitute C in (2)and (3)for regulariza- and 02g tion.If C is adopted.the objective function in (2)will be auo(if).If we use u (UU2.....,Un.)T to denote the vector repre- The sentation of U,the second-order derivative (Hes- sian)of g w.r.t.u will be a block-diagonal matrix: choice between C and C depends on applications.In this Gu≌ diag[G1 G2.....Gn].According to paper.the lemmas and theorems are derived based on C,but Lemma 2,det(G;)>0,where det()denotes the determi- they also hold for C with slight changes in the derivations. nant of a matrix..Because det(Gu)-=ΠI-ldet(Gi)≥0, From(3),it is easy to see that the model assumption behind we can conclude that g is convex w.r.t.U. RRMF is in line with the semantics of Type II links.Hence, LethU.(B)U!If we use a RRMF can be expected to achieve good performance for ap- (U1,UT2,...,UTp)T to denote another vector represen- plications with Type II links,such as research paper classifi- tation of U.the Hessian of h w.r.t.a will also be a block- cation,which will be verified by experiment in Section 3. diagonal matrix:Ha=diag H1,H2..Hl, On the other hand,RRMF can also be used for applications where H is defined as follows:H 82h with Type I links.All we need to do is just to preprocess the link structure in the data with a very simple strategy,but the aI+BC.According to Lemma 1,we can conclude that model and learning algorithms need not be changed.Accord- det(Ha)>0,and hence h is convex w.r.t.U. ing to the semantics of Type I links,two entities linking to Hence,f=g+h+Ci is convex w.r.t.U. or linked by one common entity will likely be from the same Theorem 2 f is convex w.r.t.V. class.One simple way to preprocess the link structure is to artificially add a link between two entities if they link to or are Proof:We first rewrite f as follows:f linked by a common entity.The added links will then satisfy C+∑=IV(∑UgU+al)V =- the semantics of Type II links.For example,in Figure 1(a), 2()V.],where Ca is a constant inde- after preprocessing,V2 and V3 will be connected and V6 pendent of V.If we use v=(V1.,V2.,...,Vm)T
minimizing the following objective function: l = 1 2 Xn i=1 Xn j=1 AijkUi∗ − Uj∗k 2 = 1 2 Xn i=1 Xn j=1 " Aij X D d=1 (Uid − Ujd) 2 # = 1 2 X D d=1 Xn i=1 Xn j=1 Aij (Uid − Ujd) 2 = X D d=1 UT ∗dLU∗d = tr(UTLU), (2) where Aij = 1 if there is a relation between entities i and j, and otherwise Aij = 0, and L = D − A is known as the Laplacian matrix [Chung, 1997] with D being a diagonal matrix whose diagonal elements Dii = P j Aij , and tr(·) denotes the trace of a matrix. Combining (1) and (2), we obtain the following objective function which we seek to minimize during learning: f = 1 2 kX − UVT k 2 + α 2 (kUk 2 + kVk 2 ) + β 2 tr(U T LU) = 1 2 kX − UVT k 2 + 1 2 tr[U T (αI + βL)U] + α 2 tr(VVT ), (3) which seamlessly integrates the content information and the relation information into a principled framework. Because we use relation information to regularize the content MF procedure, we call the model in (3) relation regularized matrix factorization, abbreviated as RRMF. Remark 1 The normalized Laplacian matrix [Chung, 1997], given by L˜ = D−1/2LD−1/2 = I − D−1/2AD−1/2 , can be used to substitute L in (2) and (3) for regularization. If L˜ is adopted, the objective function in (2) will be ˜l = 1 2 PD d=1 " Pn i=1 Pn j=1 Aij √ Uid Dii − √ Ujd Djj 2 # . The choice between L and L˜ depends on applications. In this paper, the lemmas and theorems are derived based on L, but they also hold for L˜ with slight changes in the derivations. From (3), it is easy to see that the model assumption behind RRMF is in line with the semantics of Type II links. Hence, RRMF can be expected to achieve good performance for applications with Type II links, such as research paper classifi- cation, which will be verified by experiment in Section 3. On the other hand, RRMF can also be used for applications with Type I links. All we need to do is just to preprocess the link structure in the data with a very simple strategy, but the model and learning algorithms need not be changed. According to the semantics of Type I links, two entities linking to or linked by one common entity will likely be from the same class. One simple way to preprocess the link structure is to artificially add a link between two entities if they link to or are linked by a common entity. The added links will then satisfy the semantics of Type II links. For example, in Figure 1(a), after preprocessing, V2 and V3 will be connected and V6 and V7 will also be connected. Learning RRMF based on these added links is now in line with the underlying semantics. From our experiments on the WebKB data set, we find that RRMF can still achieve performance comparable with LCMF just by adopting the simple link preprocessing strategy as described above. 2.3 Learning In this subsection, we first prove that the objective function in (3) is convex with respect to (w.r.t.) any one of its parameters, U and V. Then, we propose an alternating projection algorithm to learn the parameters in linear time and show that it is guaranteed to converge to a local optimum. Convexity of the Objective Function Lemma 1 The Laplacian matrix L is psd, i.e., L 0. Proof: For any n × 1 vector x 6= 0, x TLx = 1 2 Pn i=1 Pn j=1 Aij (xi − xj ) 2 ≥ 0. Lemma 2 Pm j=1 VT j∗Vj∗ 0. Proof: For any D×1 vector z 6= 0, z T Pm j=1 VT j∗Vj∗ z = Pm j=1 z T VT j∗Vj∗z = Pm j=1(Vj∗z) 2 ≥ 0. Theorem 1 f is convex w.r.t. U. Proof: We first rewrite f as follows: f = 1 2 Pn i=1 Pm j=1(Xij − Ui∗VT j∗ ) 2 + 1 2 PD d=1 UT ∗d (αI + βL)U∗d + C1, where C1 is a constant independent of U. Let g = 1 2 Pn i=1 Pm j=1(Xij − Ui∗VT j∗ ) 2 = C2 + 1 2 Pn i=1 h Ui∗ Pm j=1 VT j∗Vj∗ UT i∗ − 2 Pm j=1 XijVj∗ UT i∗ i , where C2 is a constant independent of U. It is easy to see that Gi , ∂ 2 g ∂UT i∗ ∂Ui∗ = Pm j=1 VT j∗Vj∗, and ∂ 2 g ∂UT i∗ ∂Uk∗ = 0 (if i 6= k). If we use u = (U1∗, U2∗, . . . , Un∗) T to denote the vector representation of U, the second-order derivative (Hessian) of g w.r.t. u will be a block-diagonal matrix: Gu , ∂ 2 g ∂u∂uT = diag[G1, G2, . . . , Gn]. According to Lemma 2, det(Gi) ≥ 0, where det(·) denotes the determinant of a matrix. Because det(Gu) = Qn i=1 det(Gi) ≥ 0, we can conclude that g is convex w.r.t. U. Let h = 1 2 PD d=1 U∗d(αI + βL)UT ∗d . If we use a = (UT ∗1 , UT ∗2 , . . . , UT ∗D) T to denote another vector representation of U, the Hessian of h w.r.t. a will also be a blockdiagonal matrix: Ha , ∂ 2h ∂a∂aT = diag[H1, H2, . . . , HD], where Hd is defined as follows: Hd , ∂ 2h ∂U∗d∂UT ∗d = αI + βL. According to Lemma 1, we can conclude that det(Ha) > 0, and hence h is convex w.r.t. U. Hence, f = g + h + C1 is convex w.r.t. U. Theorem 2 f is convex w.r.t. V. Proof: We first rewrite f as follows: f = C3 + 1 2 Pm j=1[Vj∗ Pn i=1 UT i∗Ui∗ + αI VT j∗ − 2 (Pn i=1 XijUi∗) VT j∗ ], where C3 is a constant independent of V. If we use v = (V1∗, V2∗, . . . , Vm∗) T
to denote the vector representation of V.the Hes- matrix can be expressed as the inverse of each block,the up- sian of f w.r.t.v will be a block-diagonal matrix: date of the whole matrix V can naturally be decomposed into K≌ a2f the update of each row Vi. avavT =diag K1,K2,...,Kml,where UU+al.It is easy to verify that Unlike the learning of U,all the Hessian matrices for Kj 0 and Kv0.Hence,f is convex w.r.t.V. different rows of V are equal,i.e,K;=Kk=K UUi.+aI.Furthermore,K is a small matrix of Learning Strategy size D x D,where D is typically a small number and is less We adopt an alternating projection method to learn the pa- than 50 in our experiments.Hence,we can directly update rameters U and V.More specifically,each time we fix one Vj.as follows:Vj.=(XiUi)K-1 parameter and then update the other one.This procedure will be repeated for several iterations until some termination con- 2.4 Convergence and Complexity Analysis dition is satisfied.We will prove that the learning algorithm is convergent. Theorem 3 The learning algorithm will converge. Proof:(Sketch)In each iteration,the learning algorithm en- Learning U According to Theorem 1,one straightforward sures that the objective function value in(3)always decreases way to learn U is to set the gradient of f w.r.t.U to 0 and Furthermore,the objective function is bounded below by 0. solve the corresponding linear system.However,this is com- Hence,the learning algorithm will converge.Because f is putationally demanding ifn is large because we have to invert not jointly convex w.r.t.U and V,the solution is a local opti- a Hessian matrix of size as large as nD x nD.In this paper, mum.■ we adopt an alternative strategy to perform optimization on We have seen that in each iteration,the time required for U,which is to optimize one column U.d at a time with the updating U is O(n),and it is easy to see that the time com- other columns fixed.Because f is convex w.r.t.U,this alter- plexity for updating V is also O(n).In our experiments,typ- native strategy will be guaranteed to converge to the optimal ically the algorithm can converge in less than 50 iterations. solution. In fact,we find that 5 iterations are sufficient to achieve good Because f is convex w.r.t.U,f is also convex w.r.t.U.d performance.Hence,the overall time complexity of the learn- with all other variables fixed.Hence,computing the gradient ing algorithm is O(n). of f w.r.t.U.d and setting it to 0,we can optimize U.d by solving the following linear system: 3 Experiments F(d)U.d =e(d) (4) 3.1 Data Sets and Evaluation Scheme where F(d)=E(d)+aI BC,and E(d) 2g 82g 2g 82g We use the same data sets,WebKB [Craven et al.,1998]and diag(au品aa2品2aa0品nd)with品a Cora McCallum et al.,2000],and the same bag-of-words =1Vid:and een)r with eld representation with the same original link structure as those =Vid(Xij-Ui-VT.+UidVjd). in Zhu et al.,2007]to evaluate our method.The WebKB One direct way to solve the linear system in (4)is to data set contains about 6,000 web pages collected from the set U.d=F(d)]-1e(d).However,the computation cost web sites of computer science departments of four universi- ties(Cornell,Texas,Washington,and Wisconsin).Each web is O(n3),which is computationally prohibitive for gen- page is labeled with one out of seven categories:student, eral text classification applications because n is always very professor,course,project,staff,department,and"other".It large.Here,we propose to use the steepest descent method should be noted that to train our RRMF method we adopt [Shewchuk,1994]to iteratively update U.:r(t)=e(d)- the simple preprocessing strategy for the link structure in this r(t)'r(t) F(dU.a(t),6(t)=U.(t+1)=U.a(t)+ data set.That is,if two web pages are co-linked by another 6(t)r(t). common web page,we add a link between these two pages. Steepest descent will guarantee the algorithm to converge After preprocessing,all the directed links are converted into to the global minimum with the objective function value de- undirected links.The characteristics about the WebKB data creased in each iteration.Suppose the number of nonzero set are briefly summarized in Table 1. elements in C is M.We can see that the computation cost Table 1:Characteristics of the WebKB data set in each iteration is O(n+M).If the number of iterations #classes #entities #terms is K,the time complexity will be O(K(n+M)).From our Cornell 827 4.134 experiments,good performance can be achieved with a small lexas 814 4,029 value of K,such as K=10 in our following experiments. Washington 1166 4165 Furthermore,M is typically a constant multiple of n.Hence, Wisconsin 1,210 4.189 the overall complexity is roughly O(n),which is dramatically The Cora data set contains the abstracts and references of less than O(n). about 34,000 research papers from the computer science com- munity.For fair comparison,we adopt the same subset of the Learning V One very nice property of V is that the Hes- data as that in [Zhu et al.,2007]to test our method.The task sian K is block-diagonal,where each nonzero block corre- is to classify each paper into one of the subfields of data struc- sponds to a row of V.Since the inverse of a block-diagonal ture(DS),hardware and architecture(HA),machine learning
to denote the vector representation of V, the Hessian of f w.r.t. v will be a block-diagonal matrix: Kv , ∂ 2 f ∂v∂vT = diag[K1, K2, . . . , Km], where Kj = Pn i=1 UT i∗Ui∗ + αI. It is easy to verify that Kj 0 and Kv 0. Hence, f is convex w.r.t. V. Learning Strategy We adopt an alternating projection method to learn the parameters U and V. More specifically, each time we fix one parameter and then update the other one. This procedure will be repeated for several iterations until some termination condition is satisfied. We will prove that the learning algorithm is convergent. Learning U According to Theorem 1, one straightforward way to learn U is to set the gradient of f w.r.t. U to 0 and solve the corresponding linear system. However, this is computationally demanding if n is large because we have to invert a Hessian matrix of size as large as nD × nD. In this paper, we adopt an alternative strategy to perform optimization on U, which is to optimize one column U∗d at a time with the other columns fixed. Because f is convex w.r.t. U, this alternative strategy will be guaranteed to converge to the optimal solution. Because f is convex w.r.t. U, f is also convex w.r.t. U∗d with all other variables fixed. Hence, computing the gradient of f w.r.t. U∗d and setting it to 0, we can optimize U∗d by solving the following linear system: F (d)U∗d = e (d) , (4) where F (d) = E(d) + αI + βL, and E(d) = diag( ∂ 2 g ∂U1d∂U1d , ∂ 2 g ∂U2d∂U2d , . . . , ∂ 2 g ∂Und∂Und ) with ∂ 2 g ∂Uid∂Uid = Pm j=1 V 2 jd, and e (d) = (e (d) 1 , e (d) 2 , . . . , e (d) n ) T with e (d) i = Pm j=1 Vjd(Xij − Ui∗VT j∗ + UidVjd). One direct way to solve the linear system in (4) is to set U∗d = [F (d) ] −1e (d) . However, the computation cost is O(n 3 ), which is computationally prohibitive for general text classification applications because n is always very large. Here, we propose to use the steepest descent method [Shewchuk, 1994] to iteratively update U∗d: r(t) = e (d) − F (d)U∗d(t), δ(t) = r(t) T r(t) r(t)T F(d)r(t) , U∗d(t + 1) = U∗d(t) + δ(t)r(t). Steepest descent will guarantee the algorithm to converge to the global minimum with the objective function value decreased in each iteration. Suppose the number of nonzero elements in L is M. We can see that the computation cost in each iteration is O(n + M). If the number of iterations is K, the time complexity will be O(K(n + M)). From our experiments, good performance can be achieved with a small value of K, such as K = 10 in our following experiments. Furthermore, M is typically a constant multiple of n. Hence, the overall complexity is roughly O(n), which is dramatically less than O(n 3 ). Learning V One very nice property of V is that the Hessian Kv is block-diagonal, where each nonzero block corresponds to a row of V. Since the inverse of a block-diagonal matrix can be expressed as the inverse of each block, the update of the whole matrix V can naturally be decomposed into the update of each row Vj∗. Unlike the learning of U, all the Hessian matrices for different rows of P V are equal, i.e., Kj = Kk = K = n i=1 UT i∗Ui∗ + αI. Furthermore, K is a small matrix of size D × D, where D is typically a small number and is less than 50 in our experiments. Hence, we can directly update Vj∗ as follows:Vj∗ = (Pn i=1 XijUi∗) K−1 . 2.4 Convergence and Complexity Analysis Theorem 3 The learning algorithm will converge. Proof: (Sketch) In each iteration, the learning algorithm ensures that the objective function value in (3) always decreases. Furthermore, the objective function is bounded below by 0. Hence, the learning algorithm will converge. Because f is not jointly convex w.r.t. U and V, the solution is a local optimum. We have seen that in each iteration, the time required for updating U is O(n), and it is easy to see that the time complexity for updating V is also O(n). In our experiments, typically the algorithm can converge in less than 50 iterations. In fact, we find that 5 iterations are sufficient to achieve good performance. Hence, the overall time complexity of the learning algorithm is O(n). 3 Experiments 3.1 Data Sets and Evaluation Scheme We use the same data sets, WebKB [Craven et al., 1998] and Cora [McCallum et al., 2000], and the same bag-of-words representation with the same original link structure as those in [Zhu et al., 2007] to evaluate our method. The WebKB data set contains about 6,000 web pages collected from the web sites of computer science departments of four universities (Cornell, Texas, Washington, and Wisconsin). Each web page is labeled with one out of seven categories: student, professor, course, project, staff, department, and “other”. It should be noted that to train our RRMF method we adopt the simple preprocessing strategy for the link structure in this data set. That is, if two web pages are co-linked by another common web page, we add a link between these two pages. After preprocessing, all the directed links are converted into undirected links. The characteristics about the WebKB data set are briefly summarized in Table 1. Table 1: Characteristics of the WebKB data set. #classes #entities #terms Cornell 7 827 4,134 Texas 7 814 4,029 Washington 7 1,166 4,165 Wisconsin 6 1,210 4,189 The Cora data set contains the abstracts and references of about 34,000 research papers from the computer science community. For fair comparison, we adopt the same subset of the data as that in [Zhu et al., 2007] to test our method. The task is to classify each paper into one of the subfields of data structure (DS), hardware and architecture (HA), machine learning
Link-content MF:This is the joint link-content MF Table 2:Characteristics of the Cora data set. method in IZhu et al.,2007]. #classes #entities #terms 751 6,234 Link-content sup.MF:This is the supervised counterpart 400 3.989 of link-content MF by using the data labels to guide the ML 1,617 8,329 MF procedure,which is introduced in [Zhu et al.,2007]. 1575 7,949 (ML),and programming language(PL).The characteristics 3.3 Convergence Speed of the Cora data set are summarized in Table 2. We use the HA data set to illustrate the convergence speed As in [Zhu et al.,2007],we adopt 5-fold cross validation of RRMF.The objective function values against the itera- to evaluate our method.More specifically,we randomly split tion number T are plotted in Figure 2(a).from which we the data into five folds(subsets),and then repeat the test five can see that RRMF converges very fast.The average clas- times,in each one of which we use one fold for testing and sification accuracy of the 5-fold cross validation against T is the other four folds for training.As in Zhu et al..2007. shown in Figure 2(b).We can see that even though the initial the number of factors D is set to 50 for RRMF.The a in value (corresponding to T =0)is not satisfactory (accuracy (3)is fixed to 1,and B is specified by cross-validation on =65.5%),RRMF can still achieve very promising and sta- the training data.After obtaining the factors,a linear support ble performance without requiring many iterations.Because vector machine(SVM)is trained for classification based on we can achieve promising performance when T>5,we set the low-dimensional representation,which is the same as the T=5 in all our following experiments. test procedure for the methods in Zhu et al..2007.The x103 average classification accuracies and the standard deviations 12 0.9 over the five repeats are adopted as the performance metric. 10 0.85 8 3.2 Baselines 0.8 6 The methods adopted for our comparative study belong to two classes.The first class contains the baselines used in IZhu et al.,2007: 8 0.65 SVM on content:This method ignores the link structure 0 10 2030 0 50 01020304050 in the data,and applies SVM only on the content infor- mation in the original bag-of-words representation (a)Objective function (b)Accuracy SVM on links:This method ignores the content informa- Figure 2:Convergence properties of RRMF. tion,and treats links as the features,i.e,the ith feature is link-to-pagei- 3.4 Performance .SVM on link-content:The content features and link fea- tures of the two methods above are combined to give the The Cora data set satisfies the model assumption of RRMF feature representation. but it does not satisfy the model assumption of link-content Directed graph regularization:This is the method intro- MF.We first test RRMF on Cora to verify that when the duced in Zhou et al.,2005,which is only based on the model assumption is satisfied.RRMF can dramatically out- link structure. perform link-content MF and other methods.The average classification accuracies with standard deviations are shown .PLSI+PHITS:This method,described in [Cohn and in Figure 3,from which we can see that RRMF does out- Hofmann,2000],combines text content information and perform other methods dramatically.Even though the link- link structure for analysis. content sup.MF method uses label information for MF, The second class contains some methods that are most re- RRMF can still give much better result,showing that it is lated to RRMF: indeed very effective.Comparing RRMF to PCA,we can see that the good performance of RRMF does not come from PCA:This method first applies principal component a good initialization but from the learning algorithm itself. analysis(PCA)on the content information to get a low- Comparing RRMF to MMMF,we can see that the relational dimensional representation,based on which SVM is information is very useful.Comparing RRMF to directed trained for classification.Because we use PCA to ini- graph regularization,we can see that the content information tialize U and V in RRMF,this method serves to demon- also does great help for classification. strate whether the good performance of RRMF comes from a good initialization value or from the learning pro- We also test RRMF on the WebKB data set.Here,we adopt the normalized Laplacian.The performance is shown cedure of RRMF. in Figure 4.It should be noted that the WebKB data set does MMMF:This method applies MF only on the content not satisfy the model assumption of RRMF.However,with information,which is a special case of(3)by setting B= simple preprocessing,RRMF can still achieve performance 0.It is used to show that the relation information does comparable to the link-content MF method and its supervised help a lot in relational data analysis. counterpart,and outperform other methods
Table 2: Characteristics of the Cora data set. #classes #entities #terms DS 9 751 6,234 HA 7 400 3,989 ML 7 1,617 8,329 PL 9 1,575 7,949 (ML), and programming language (PL). The characteristics of the Cora data set are summarized in Table 2. As in [Zhu et al., 2007], we adopt 5-fold cross validation to evaluate our method. More specifically, we randomly split the data into five folds (subsets), and then repeat the test five times, in each one of which we use one fold for testing and the other four folds for training. As in [Zhu et al., 2007], the number of factors D is set to 50 for RRMF. The α in (3) is fixed to 1, and β is specified by cross-validation on the training data. After obtaining the factors, a linear support vector machine (SVM) is trained for classification based on the low-dimensional representation, which is the same as the test procedure for the methods in [Zhu et al., 2007]. The average classification accuracies and the standard deviations over the five repeats are adopted as the performance metric. 3.2 Baselines The methods adopted for our comparative study belong to two classes. The first class contains the baselines used in [Zhu et al., 2007]: • SVM on content: This method ignores the link structure in the data, and applies SVM only on the content information in the original bag-of-words representation. • SVM on links: This method ignores the content information, and treats links as the features, i.e, the ith feature is link-to-pagei . • SVM on link-content: The content features and link features of the two methods above are combined to give the feature representation. • Directed graph regularization: This is the method introduced in [Zhou et al., 2005], which is only based on the link structure. • PLSI+PHITS: This method, described in [Cohn and Hofmann, 2000], combines text content information and link structure for analysis. The second class contains some methods that are most related to RRMF: • PCA: This method first applies principal component analysis (PCA) on the content information to get a lowdimensional representation, based on which SVM is trained for classification. Because we use PCA to initialize U and V in RRMF, this method serves to demonstrate whether the good performance of RRMF comes from a good initialization value or from the learning procedure of RRMF. • MMMF: This method applies MF only on the content information, which is a special case of (3) by setting β = 0. It is used to show that the relation information does help a lot in relational data analysis. • Link-content MF: This is the joint link-content MF method in [Zhu et al., 2007]. • Link-content sup. MF: This is the supervised counterpart of link-content MF by using the data labels to guide the MF procedure, which is introduced in [Zhu et al., 2007]. 3.3 Convergence Speed We use the HA data set to illustrate the convergence speed of RRMF. The objective function values against the iteration number T are plotted in Figure 2(a), from which we can see that RRMF converges very fast. The average classification accuracy of the 5-fold cross validation against T is shown in Figure 2(b). We can see that even though the initial value (corresponding to T = 0) is not satisfactory (accuracy = 65.5%), RRMF can still achieve very promising and stable performance without requiring many iterations. Because we can achieve promising performance when T ≥ 5, we set T = 5 in all our following experiments. 0 10 20 30 40 50 0 2 4 6 8 10 12 x 105 T Objective Function Value 0 10 20 30 40 50 0.65 0.7 0.75 0.8 0.85 0.9 T Accuracy (a) Objective function (b) Accuracy Figure 2: Convergence properties of RRMF. 3.4 Performance The Cora data set satisfies the model assumption of RRMF but it does not satisfy the model assumption of link-content MF. We first test RRMF on Cora to verify that when the model assumption is satisfied, RRMF can dramatically outperform link-content MF and other methods. The average classification accuracies with standard deviations are shown in Figure 3, from which we can see that RRMF does outperform other methods dramatically. Even though the linkcontent sup. MF method uses label information for MF, RRMF can still give much better result, showing that it is indeed very effective. Comparing RRMF to PCA, we can see that the good performance of RRMF does not come from a good initialization but from the learning algorithm itself. Comparing RRMF to MMMF, we can see that the relational information is very useful. Comparing RRMF to directed graph regularization, we can see that the content information also does great help for classification. We also test RRMF on the WebKB data set. Here, we adopt the normalized Laplacian. The performance is shown in Figure 4. It should be noted that the WebKB data set does not satisfy the model assumption of RRMF. However, with simple preprocessing, RRMF can still achieve performance comparable to the link-content MF method and its supervised counterpart, and outperform other methods
4 Conclusion and Future Work In this paper,we propose a novel MF framework,called SVM oo content RRMF,to model relational data of Type II links.One promis- k ing property of RRMF is that it can also be used to model data of Type I links just by the inclusion of a very simple preprocessing step.Experimental results verify that RRMF can dramatically outperform other methods on data of Type II mtent sup. links,and can achieve performance comparable with state- of-the-art methods on data of Type I links.Another attrac- tive property of RRMF is that its training time is linear in the number of training entities,which makes it scalable to large- DS HA scale problems.Moreover,the learning algorithm of RRMF Data Set Figure 3:Average classification accuracies with standard devia- is guaranteed to be convergent and is very stable. tions on the Cora data set. Incorporating label information into RRMF will be pur- sued in our future work.Furthermore,although the collec- 100 tive classification methods [Sen et al.,2008],latent Wishart processes(LWP)[Li et al.,2009]and relational topic model SVM on content (RTM)[Chang and Blei,2009]are not closely related to RRMF,performing experimental comparison between them will be very interesting and will help to reveal the relation- ships between several approaches of relational learning work MMF which are currently disparate link-content MF link-content sup.MF RRM Acknowledgements We thank Shenghuo Zhu for the preprocessed data sets.This re- search has been supported by General Research Fund 621407 from the Research Grants Council of Hong Kong. Figure 4:Average classification accuracies with standard devia- tions on the WebKB data set. References 3.5 Sensitivity to Parameters [Chang and Blei,2009]Jonathan Chang and David M.Blei.Relational topic models for document networks.In A/S747S,2009. We have illustrated the effect of the number of iterations in [Chung,1997]Fan Chung.Spectral Graph Theory.Number 9 in Regional Confer- ence Series in Mathematics.American Mathematical Society,1997. Section 3.3.Here,we examine the sensitivity of RRMF to [Cohn and Hofmann,2000]David A.Cohn and Thomas Hofmann.The missing link B and the number of factors D.We again use the HA data -a probabilistic model of document content and hypertext connectivity.In NIPS. 2000. set for illustration.Figure 5 illustrates the accuracy of RRMF when B and D take different values.From Figure 5(a),we [Craven etal,199]Mark Craven,Dan DiPasquo,Dayne Freitag,Andrew McCallum. Tom M.Mitchell,Kamal Nigam,and Sean Slattery.Learning to extract symbolic can see that the smaller the B,the worse the performance will knowledge from the world wide web.In A44//144/,pages 509-516,1998. be.Larger B means that the relational information plays a [Deerwester et al.,1990]Scott C.Deerwester,Susan T.Dumais,Thomas K.Landauer, more significant role in the learning procedure.Hence,we George W.Furnas,and Richard A.Harshman.Indexing by latent semantic analysis. can conclude that the relational information is very important. ASS,41(61990 When B exceeds some value (around 30),the performance [Kurland and Lee,2005]Oren Kurland and Lillian Lee.Pagerank without hyperlinks: structural re-ranking using links induced by language models.In S/GIR,2005. becomes very stable,which means RRMF is not sensitive to [Li et al,2009]Wu-Jun Li,Zhihua Zhang,and Dit-Yan Yeung.Latent Wishart pro- B.From 5(b),we can see that as D increases,the accuracy cesses for relational kemnel learning.In A/STATS,2009. also increases.But larger D will incur higher computation [McCallum et al,2000]Andrew McCallum,Kamal Nigam,Jason Rennie,and Kristie cost.Hence,in real applications,we need to consider the Seymore.Automating the construction of internet portals with machine learning. IRer:,32:127-163,2000. tradeoff between computation cost and accuracy [Rennie and Srebro,2005]Jason D.M.Rennie and Nathan Srebro.Fast maximum margin matrix factorization for collaborative prediction.In /CML,2005. [Sen et al,2008]Prithviraj Sen,Galileo Mark Namata,Mustafa Bilgic,Lise Getoor, Brian Gallagher,and Tina Eliassi-Rad.Collective classification in network data.A/ Magazine,29(3),2008. [Shewchuk,1994]Jonathan R Shewchuk.An introduction to the conjugate gradient method without the agonizing pain.Technical report,Pittsburgh,PA.USA,1994. [Singh and Gordon,2008]Ajit Paul Singh and Geoffrey J.Gordon.A unified view of matrix factorization models.In ECML/PKDD (2),2008. 150 [Srebro etal,2004]Nathan Srebro,Jason D.M.Rennie,and Tommi Jaakkola Maximum-margin matrix factorization.In N/PS,2004. (a)Effect of B (b)Effect of D [Zhou etal,2005]Dengyong Zhou,Jiayuan Huang,and Bemhard Scholkopf.Lear- Figure 5:Sensitivity to parameters of RRMF. ing from labeled and unlabeled data on a directed graph.In /CML,2005. [Zhu et al,2007]Shenghuo Zhu,Kai Yu,Yun Chi,and Yihong Gong.Combining content and link for classification using matrix factorization.In S/G/R,2007
DS HA ML PL 45 50 55 60 65 70 75 80 85 90 Data Set Accuracy (in %) SVM on content SVM on links SVM on link−content directed graph regularization PLSI+PHITS PCA MMMF link−content MF link−content sup. MF RRMF Figure 3: Average classification accuracies with standard deviations on the Cora data set. Cornell Texas Washington Wisconsin 65 70 75 80 85 90 95 100 Data Set Accuracy (in %) SVM on content SVM on links SVM on link−content directed graph regularization PLSI+PHITS PCA MMMF link−content MF link−content sup. MF RRMF Figure 4: Average classification accuracies with standard deviations on the WebKB data set. 3.5 Sensitivity to Parameters We have illustrated the effect of the number of iterations in Section 3.3. Here, we examine the sensitivity of RRMF to β and the number of factors D. We again use the HA data set for illustration. Figure 5 illustrates the accuracy of RRMF when β and D take different values. From Figure 5(a), we can see that the smaller the β, the worse the performance will be. Larger β means that the relational information plays a more significant role in the learning procedure. Hence, we can conclude that the relational information is very important. When β exceeds some value (around 30), the performance becomes very stable, which means RRMF is not sensitive to β. From 5(b), we can see that as D increases, the accuracy also increases. But larger D will incur higher computation cost. Hence, in real applications, we need to consider the tradeoff between computation cost and accuracy. 0 50 100 150 200 60 65 70 75 80 85 90 95 beta Accuracy (in %) 10 20 30 40 50 80 85 90 D Accuracy (in %) (a) Effect of β (b) Effect of D Figure 5: Sensitivity to parameters of RRMF. 4 Conclusion and Future Work In this paper, we propose a novel MF framework, called RRMF, to model relational data of Type II links. One promising property of RRMF is that it can also be used to model data of Type I links just by the inclusion of a very simple preprocessing step. Experimental results verify that RRMF can dramatically outperform other methods on data of Type II links, and can achieve performance comparable with stateof-the-art methods on data of Type I links. Another attractive property of RRMF is that its training time is linear in the number of training entities, which makes it scalable to largescale problems. Moreover, the learning algorithm of RRMF is guaranteed to be convergent and is very stable. Incorporating label information into RRMF will be pursued in our future work. Furthermore, although the collective classification methods [Sen et al., 2008], latent Wishart processes (LWP) [Li et al., 2009] and relational topic model (RTM) [Chang and Blei, 2009] are not closely related to RRMF, performing experimental comparison between them will be very interesting and will help to reveal the relationships between several approaches of relational learning work which are currently disparate. Acknowledgements We thank Shenghuo Zhu for the preprocessed data sets. This research has been supported by General Research Fund 621407 from the Research Grants Council of Hong Kong. References [Chang and Blei, 2009] Jonathan Chang and David M. Blei. Relational topic models for document networks. In AISTATS, 2009. [Chung, 1997] Fan Chung. Spectral Graph Theory. Number 92 in Regional Conference Series in Mathematics. American Mathematical Society, 1997. [Cohn and Hofmann, 2000] David A. Cohn and Thomas Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In NIPS, 2000. [Craven et al., 1998] Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom M. Mitchell, Kamal Nigam, and Sean Slattery. Learning to extract symbolic ´ knowledge from the world wide web. In AAAI/IAAI, pages 509–516, 1998. [Deerwester et al., 1990] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6), 1990. [Kurland and Lee, 2005] Oren Kurland and Lillian Lee. Pagerank without hyperlinks: structural re-ranking using links induced by language models. In SIGIR, 2005. [Li et al., 2009] Wu-Jun Li, Zhihua Zhang, and Dit-Yan Yeung. Latent Wishart processes for relational kernel learning. In AISTATS, 2009. [McCallum et al., 2000] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Inf. Retr., 3(2):127–163, 2000. [Rennie and Srebro, 2005] Jason D. M. Rennie and Nathan Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. [Sen et al., 2008] Prithviraj Sen, Galileo Mark Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, and Tina Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3), 2008. [Shewchuk, 1994] Jonathan R Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Technical report, Pittsburgh, PA, USA, 1994. [Singh and Gordon, 2008] Ajit Paul Singh and Geoffrey J. Gordon. A unified view of matrix factorization models. In ECML/PKDD (2), 2008. [Srebro et al., 2004] Nathan Srebro, Jason D. M. Rennie, and Tommi Jaakkola. Maximum-margin matrix factorization. In NIPS, 2004. [Zhou et al., 2005] Dengyong Zhou, Jiayuan Huang, and Bernhard Scholkopf. Learn- ¨ ing from labeled and unlabeled data on a directed graph. In ICML, 2005. [Zhu et al., 2007] Shenghuo Zhu, Kai Yu, Yun Chi, and Yihong Gong. Combining content and link for classification using matrix factorization. In SIGIR, 2007