Sparse Probabilistic Relational Projection Wu-Jun Lif and Dit-Yan Yeung t Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering,Shanghai Jiao Tong University,China Department of Computer Science and Engineering Hong Kong University of Science and Technology,Hong Kong,China liwujunecs.situ.edu.cn,dyyeungecse.ust.hk Abstract PPCA are proposed by putting some sparsity-inducing pri- ors such as the Jeffreys prior on the projection matrix. Probabilistic relational PCA (PRPCA)can learn a pro- jection matrix to perform dimensionality reduction for All the variants of PCA and sparse PCA mentioned above relational data.However.the results learned by PRPCA assume that the instances are independent and identically lack interpretability because each principal component distributed (i.i.d.).Hence.they are not suitable for model- is a linear combination of all the original variables.In ing relational data(Getoor and Taskar 2007;Li and Yeung this paper,we propose a novel model,called sparse 2009;Li,Zhang,and Yeung 2009;Li 2010:Li and Yeung probabilistic relational projection (SPRP).to learn a 2011;Li,Yeung,and Zhang 2011)in which the instances are sparse projection matrix for relational dimensionality not i.i.d.In relational data.besides the content information, reduction.The sparsity in SPRP is achieved by impos- there also exist links (i.e.,relations)between the instances ing on the projection matrix a sparsity-inducing prior in the data.The attributes of the linked instances are often such as the Laplace prior or Jeffreys prior.We propose correlated rather than i.i.d.(Li,Yeung,and Zhang 2009). an expectation-maximization(EM)algorithm to learn the parameters of SPRP.Compared with PRPCA,the One typical example of relational data is a collection of re- sparsity in SPRP not only makes the results more inter- search papers which contain both paper content and citations pretable but also makes the projection operation much between the papers.The existence of a citation relation be- more efficient without compromising its accuracy.All tween two papers often implies that they are about the same these are verified by experiments conducted on several topic.In (Li,Yeung,and Zhang 2009),probabilistic rela- real applications tional PCA(PRPCA),which extends PPCA by eliminating the ii.d.assumption,is proposed to perform dimensional- ity reduction for relational data.By explicitly modeling the Introduction covariance between instances,PRPCA dramatically outper- Principal component analysis (PCA)(Jolliffe 2002)and forms PCA and PPCA.However,as in PCA and PPCA.the probabilistic PCA (PPCA)(Tipping and Bishop 1999)are results learned by PRPCA also lack interpretability. very popular dimensionality reduction methods which have In this paper,we propose a novel model,called sparse been widely used to explore the structure of a high- probabilistic relational projection(SPRP),to learn a sparse dimensional data set by mapping the data set into a low- projection matrix for relational dimensionality reduction. dimensional space via a projection (or called transforma- Compared with PRPCA,the sparsity in SPRP not only tion)matrix.However,it is difficult to interpret the re- makes the results more interpretable but also makes the pro- sults of PCA and PPCA because each principal compo- jection operation much more efficient without compromis- nent is a linear combination of all the original variables. ing its accuracy. To achieve interpretability,some sparse versions of PCA or PPCA have been proposed by enforcing many entries of Notation the projection matrix to go to zero.Sparse PCA(SPCA) (Zou.Hastie,and Tibshirani 2006)first reformulates PCA For the convenience of presentation and comparison,we as a regression-type optimization problem and then applies adopt the same notation as that in (Li,Yeung,and Zhang the elastic net (Zou and Hastie 2005)constraint on the re- 2009).More specifically,we use boldface lowercase letters, gression coefficients to get a sparse projection matrix.In such as v.to denote vectors and v;to denote the ith ele- (Sigg and Buhmann 2008),sparsity is achieved by putting ment of v.Boldface uppercase letters,such as F,are used a 1-norm (constraint on the projection matrix during the to denote matrices,with the ith row and the jth column of expectation-maximization(EM)(Dempster,Laird,and Ru- F denoted by Fi and F.j,respectively.Fj is the element bin 1977)learning procedure of PPCA.In (Archambeau and at the ith row and jth column of F.We use F to denote Bach 2008)and (Guan and Dy 2009),sparse versions of the determinant of a matrix F,tr(F)to denote its trace,FT Copyright C)2012,Association for the Advancement of Artificial As in (Li,Yeung,and Zhang 2009),we use the term 'content Intelligence (www.aaai.org).All rights reserved. information'to refer to the feature vectors describing the instances
Sparse Probabilistic Relational Projection Wu-Jun Li† and Dit-Yan Yeung‡ † Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering, Shanghai Jiao Tong University, China ‡ Department of Computer Science and Engineering Hong Kong University of Science and Technology, Hong Kong, China liwujun@cs.sjtu.edu.cn, dyyeung@cse.ust.hk Abstract Probabilistic relational PCA (PRPCA) can learn a projection matrix to perform dimensionality reduction for relational data. However, the results learned by PRPCA lack interpretability because each principal component is a linear combination of all the original variables. In this paper, we propose a novel model, called sparse probabilistic relational projection (SPRP), to learn a sparse projection matrix for relational dimensionality reduction. The sparsity in SPRP is achieved by imposing on the projection matrix a sparsity-inducing prior such as the Laplace prior or Jeffreys prior. We propose an expectation-maximization (EM) algorithm to learn the parameters of SPRP. Compared with PRPCA, the sparsity in SPRP not only makes the results more interpretable but also makes the projection operation much more efficient without compromising its accuracy. All these are verified by experiments conducted on several real applications. Introduction Principal component analysis (PCA) (Jolliffe 2002) and probabilistic PCA (PPCA) (Tipping and Bishop 1999) are very popular dimensionality reduction methods which have been widely used to explore the structure of a highdimensional data set by mapping the data set into a lowdimensional space via a projection (or called transformation) matrix. However, it is difficult to interpret the results of PCA and PPCA because each principal component is a linear combination of all the original variables. To achieve interpretability, some sparse versions of PCA or PPCA have been proposed by enforcing many entries of the projection matrix to go to zero. Sparse PCA (SPCA) (Zou, Hastie, and Tibshirani 2006) first reformulates PCA as a regression-type optimization problem and then applies the elastic net (Zou and Hastie 2005) constraint on the regression coefficients to get a sparse projection matrix. In (Sigg and Buhmann 2008), sparsity is achieved by putting a 1-norm (L1) constraint on the projection matrix during the expectation-maximization (EM) (Dempster, Laird, and Rubin 1977) learning procedure of PPCA. In (Archambeau and Bach 2008) and (Guan and Dy 2009), sparse versions of Copyright c 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. PPCA are proposed by putting some sparsity-inducing priors such as the Jeffreys prior on the projection matrix. All the variants of PCA and sparse PCA mentioned above assume that the instances are independent and identically distributed (i.i.d.). Hence, they are not suitable for modeling relational data (Getoor and Taskar 2007; Li and Yeung 2009; Li, Zhang, and Yeung 2009; Li 2010; Li and Yeung 2011; Li, Yeung, and Zhang 2011) in which the instances are not i.i.d. In relational data, besides the content information,1 there also exist links (i.e., relations) between the instances in the data. The attributes of the linked instances are often correlated rather than i.i.d. (Li, Yeung, and Zhang 2009). One typical example of relational data is a collection of research papers which contain both paper content and citations between the papers. The existence of a citation relation between two papers often implies that they are about the same topic. In (Li, Yeung, and Zhang 2009), probabilistic relational PCA (PRPCA), which extends PPCA by eliminating the i.i.d. assumption, is proposed to perform dimensionality reduction for relational data. By explicitly modeling the covariance between instances, PRPCA dramatically outperforms PCA and PPCA. However, as in PCA and PPCA, the results learned by PRPCA also lack interpretability. In this paper, we propose a novel model, called sparse probabilistic relational projection (SPRP), to learn a sparse projection matrix for relational dimensionality reduction. Compared with PRPCA, the sparsity in SPRP not only makes the results more interpretable but also makes the projection operation much more efficient without compromising its accuracy. Notation For the convenience of presentation and comparison, we adopt the same notation as that in (Li, Yeung, and Zhang 2009). More specifically, we use boldface lowercase letters, such as v, to denote vectors and vi to denote the ith element of v. Boldface uppercase letters, such as F, are used to denote matrices, with the ith row and the jth column of F denoted by Fi∗ and F∗j , respectively. Fij is the element at the ith row and jth column of F. We use |F| to denote the determinant of a matrix F, tr(F) to denote its trace, F T 1As in (Li, Yeung, and Zhang 2009), we use the term ‘content information’ to refer to the feature vectors describing the instances
for its transpose and F-1 for its inverse.F0 means that F is positive semi-definite(psd)and F0 means that F is positive definite (pd).PQ denotes the Kronecker prod- uct (Gupta and Nagar 2000)of P and Q.In is the identity matrix of size n x n and e is a vector of I's whose dimen- (a)PRPCA (b)SPRP sionality depends on the context.N()is overloaded for both Figure 1:Graphical models of PRPCA and SPRP,in which T is multivariate normal distributions and matrix variate normal the observation matrix,X and Z are the latent variable matrices, distributions(Gupta and Nagar 2000).We use cov()to de- μ,W and a2 are the parameters to learn,入is the hyperparameter, note the covariance operation and()to denote the expecta- and the other quantities are kept constant. tion operation.The operation diag(v)converts the vector v into a diagonal matrix in which the ith diagonal entry is vi. As in (Tipping and Bishop 1999)and (Li,Yeung, In (Li,Yeung,and Zhang 2009),two maximum likelihood and Zhang 2009).denotes a set of observed d- estimation (MLE)methods,one based on a closed-form so- dimensional data (content)vectors,u denotes the data sam- lution and another based on EM,are proposed to learn the ple mean,the d x q matrix W denotes the g principal parameters of PRPCA. axes (often called factor loadings or projection matrix),and xn=WT(tn-u)denotes the corresponding g princi- Sparse Probabilistic Relational Projection pal components (or called latent variables)of t.We further use the d x N matrix T to denote the content matrix with In this paper,we propose to put a sparsity-inducing prior T=tn and the g x N matrix X to denote the latent vari- on W to encourage many of its entries to go to zero. ables of T with X.n=WT(tn -u).As in (Li,Yeung,and The resulting model is called sparse probabilistic relational Zhang 2009),we assume that the links are undirected.For projection(SPRP)due to its sparsity property.Although data with directed links,we first convert the directed links there exist many sparsity-inducing priors in the literature, into undirected links which can keep the original physical e.g.,(Figueiredo 2003;Caron and Doucet 2008;Archam- meaning of the links (Li,Yeung,and Zhang 2009).The ad- beau and Bach 2008;Guan and Dy 2009),here we consider jacency (link)matrix of the N instances is denoted by A. only two of them,the Laplace prior and Jeffreys prior.Using Aij =1 if there exists a link between instances i and j,and other sparsity-inducing priors in SPRP is expected to follow otherwise Aij=0.Moreover,A =0,which means that the same principle and will be left to our future pursuit. there exist no self-links. SPRP with Laplace Prior Probabilistic Relational PCA In SPCA (Zou,Hastie,and Tibshirani 2006),sparsity is With matrix variate normal distributions (Gupta and Na- achieved by putting an LI regularization term on the projec- gar 2000),the generative model of PRPCA (Li,Yeung.and tion matrix.Here we learn a sparse W for SPRP in a simi- Zhang 2009)is defined as: lar way.However,unlike SPCA which is formulated from a non-probabilistic view,SRPR is based on a probabilistic for- Y1日~Na,N(0,oIa⑧Φ) mulation which can automatically learn the hyperparameters X|日~Ng,N(0,Ig⑧Φ) while the non-probabilistic formulation cannot. T=WX+ueT+T. We adopt the Laplace (i.e.,double-exponential)prior (Park and Casella 2008;Guan and Dy 2009)for W: where e=fu,W,o2}denotes the set of parameters, ④=△-1and△≌Iw+(Iw+A)T(Iw+A)with being typically a very small positive number to make A 0.In )pf-v} (Li,Yeung,and Zhang 2009).is called relational covari- ance which reflects the covariance between the instances. (wI)=ΠΠ(W1) Then,we can get the following results: T|X,日~Na,v(WX+ueT,a2L48Φ) -()"oxpf-/wi}. T|日~Na,N(ueT,(WWr+aIa)⑧Φ). Based on the generative process,Figure 1(a)shows the where lll denotes the absolute value for a scalar and the graphical model of PRPCA. I-norm for a matrix. If we set C WWT+o2Id,the log-likelihood of the Using Bayes'rule,the log-posterior of e can be com- observation matrix T in PRPCA is puted as follows: c=aTIe)=-yac+c-到+ lnp(Θ|T)=lnp(TlΘ)+lnp(Θ)+co where c is a constant independent of the parameters e and =-Yac+(c-到-vw: H is defined as follows: +Inp(u)+Inp(o2)+c1, (2) H=T-ue)△(T-ue)Y (1) where co and cI are constants independent of
for its transpose and F −1 for its inverse. F 0 means that F is positive semi-definite (psd) and F 0 means that F is positive definite (pd). P ⊗ Q denotes the Kronecker product (Gupta and Nagar 2000) of P and Q. In is the identity matrix of size n × n and e is a vector of 1’s whose dimensionality depends on the context. N (·) is overloaded for both multivariate normal distributions and matrix variate normal distributions (Gupta and Nagar 2000). We use cov(·) to denote the covariance operation and h·i to denote the expectation operation. The operation diag(v) converts the vector v into a diagonal matrix in which the ith diagonal entry is vi . As in (Tipping and Bishop 1999) and (Li, Yeung, and Zhang 2009), {tn} N n=1 denotes a set of observed ddimensional data (content) vectors, µ denotes the data sample mean, the d × q matrix W denotes the q principal axes (often called factor loadings or projection matrix), and xn = WT (tn − µ) denotes the corresponding q principal components (or called latent variables) of tn. We further use the d × N matrix T to denote the content matrix with T∗n = tn and the q × N matrix X to denote the latent variables of T with X∗n = WT (tn −µ). As in (Li, Yeung, and Zhang 2009), we assume that the links are undirected. For data with directed links, we first convert the directed links into undirected links which can keep the original physical meaning of the links (Li, Yeung, and Zhang 2009). The adjacency (link) matrix of the N instances is denoted by A. Aij = 1 if there exists a link between instances i and j, and otherwise Aij = 0. Moreover, Aii = 0, which means that there exist no self-links. Probabilistic Relational PCA With matrix variate normal distributions (Gupta and Nagar 2000), the generative model of PRPCA (Li, Yeung, and Zhang 2009) is defined as: Υ | Θ ∼ Nd,N (0, σ2 Id ⊗ Φ), X | Θ ∼ Nq,N (0, Iq ⊗ Φ), T = WX + µe T + Υ, where Θ = {µ,W, σ2} denotes the set of parameters, Φ = ∆−1 and ∆ , γIN +(IN +A) T (IN +A) with γ being typically a very small positive number to make ∆ 0. In (Li, Yeung, and Zhang 2009), Φ is called relational covariance which reflects the covariance between the instances. Then, we can get the following results: T | X, Θ ∼ Nd,N (WX + µe T , σ2 Id ⊗ Φ), T | Θ ∼ Nd,N µe T ,(WWT + σ 2 Id) ⊗ Φ . Based on the generative process, Figure 1(a) shows the graphical model of PRPCA. If we set C = WWT + σ 2 Id, the log-likelihood of the observation matrix T in PRPCA is L = ln p(T | Θ) = − N 2 h ln |C| + tr(C−1H) i + c, where c is a constant independent of the parameters Θ and H is defined as follows: H = (T − µe T )∆(T − µe T ) T N . (1) T X Iq σ 2 W µ Φ T X Iq σ 2 W µ Φ Z λ (a) PRPCA (b) SPRP Figure 1: Graphical models of PRPCA and SPRP, in which T is the observation matrix, X and Z are the latent variable matrices, µ, W and σ 2 are the parameters to learn, λ is the hyperparameter, and the other quantities are kept constant. In (Li, Yeung, and Zhang 2009), two maximum likelihood estimation (MLE) methods, one based on a closed-form solution and another based on EM, are proposed to learn the parameters of PRPCA. Sparse Probabilistic Relational Projection In this paper, we propose to put a sparsity-inducing prior on W to encourage many of its entries to go to zero. The resulting model is called sparse probabilistic relational projection (SPRP) due to its sparsity property. Although there exist many sparsity-inducing priors in the literature, e.g., (Figueiredo 2003; Caron and Doucet 2008; Archambeau and Bach 2008; Guan and Dy 2009), here we consider only two of them, the Laplace prior and Jeffreys prior. Using other sparsity-inducing priors in SPRP is expected to follow the same principle and will be left to our future pursuit. SPRP with Laplace Prior In SPCA (Zou, Hastie, and Tibshirani 2006), sparsity is achieved by putting an L1 regularization term on the projection matrix. Here we learn a sparse W for SPRP in a similar way. However, unlike SPCA which is formulated from a non-probabilistic view, SRPR is based on a probabilistic formulation which can automatically learn the hyperparameters while the non-probabilistic formulation cannot. We adopt the Laplace (i.e., double-exponential) prior (Park and Casella 2008; Guan and Dy 2009) for W: p(Wij | λ) = √ λ 2 exp n − √ λkWijk1 o , p(W| λ) = Y d i=1 Yq j=1 p(Wij | λ) = √ λ 2 dq exp n − √ λkWk1 o , where k · k1 denotes the absolute value for a scalar and the 1-norm for a matrix. Using Bayes’ rule, the log-posterior of Θ can be computed as follows: ln p(Θ | T) = ln p(T | Θ) + ln p(Θ) + c0 = − N 2 h ln |C| + tr(C−1H) i − √ λkWk1 + ln p(µ) + ln p(σ 2 ) + c1, (2) where c0 and c1 are constants independent of Θ
For simplicity,here we adopt the maximum a posteriori M-step:The Q-function is maximized to update the pa- (MAP)strategy to estimate the parameters.Due to the Li rameters: constraint on W,the MAP estimation of W will naturally induce sparsity,which means that many entries of W will be Θ(t+1)=argmax Q(Θ|Θ(t) automatically driven to zero during the learning procedure. In this paper,we assume that p(u)and p(o2)are uniform. The whole EM learning procedure is summarized in Al- Remark 1 Of course,we may also put non-uniform priors, gorithm 1 and the detailed derivation can be found in(Li such as conjugate priors,on u and o2.Here,uniform pri- 2010).Note that as in (Li.Yeung,and Zhang 2009),we use ors for u and o-are adopted mainly for fair comparison W and o2 to denote the old values and W and 2 for the because they are also adopted in PRPCA.Alternatively,we updated ones. may also treat all the parameters as random variables and resort to fully Bayesian methods,such as variational meth- Algorithm 1 EM algorithm for SPRP ods (Jordan et al.1999),for learning and inference.How- Initialize W and o2 ever,since the focus of this paper is on demonstrating the fort=1to T promising advantages of sparsity under the PRPCA frame- E-step:Compute the sufficient statistics work,all these possible variants are left to future extensions. M=WTW+o2Ig. It is not easy to directly optimize the objective function (X)=M-WT(T-ueT). in(2)though.As in (Park and Casella 2008;Guan and Dy (X△XT)=Na2M-1+(X)△(X)T, 2009),we adopt a hierarchical interpretation of the Laplace (安〉= prior: M-step:Update the parameters 210={-} 入 for i=1 to d forZ≥0, (3) ag(,,W尖), WilZi ~N(0,Zij ) (4) Wi.=Hi.WM->:[(o2Ig +M-WTHW)M-+ It is easy to show that this hierarchical reformulation is end for equivalent to the original Laplace prior,because 2=但-HWM-1WI d end for p(W|λ)= p(Wijl Zis)p(ZijX)dzis To learn the hyperparameter A,we may resort to the cross- 2 A expvWill validation strategy.Alternatively,we may treat A as one of (5) the parameters,just like W,to get the following EM updat- ing equation:入= 2dg Figure 1(b)depicts the graphical model of SPRP as com- 1Dg-12g1 pared with that of PRPCA in Figure 1(a). SPRP with Jeffreys Prior Learning By setting the gradient of Inp(T)with re- The Jeffreys prior(Figueiredo 2003;Guan and Dy 2009)is a spect to u to zero,we get the (closed-form)MAP estimate noninformative prior.To use the Jeffreys prior,we only need for u as follows: T△e to change the density function ofi in(3)to:p(i) h-erAe (6) We can see that there exist no hyperparameters in the Jef- freys prior but the Laplace prior does have the hyperparam- For the other parameters (W and o2)of SPRP,we derive an eter A which needs to be learned. (iterative)EM(Dempster,Laird,and Rubin 1977)algorithm With the Jeffreys prior,the log-posterior can be computed to learn them.For the rest of this paper,we still use to refer as follows: to the parameters but they only contain W and o2 because u can be directly computed by(6).During the EM learning Inp(T)=Inp(Te)+Inp(e)+c2 procedure,we treat (Z,X}as missing data and {T,Z,X as complete data.The EM algorithm for MAP estimation 2血C+tr(C-H-hIwI operates by alternating between the following two steps: +Inp(u)+Inp(o2)+c3, (7 E-step:The expectation of the complete-data log- posterior with respect to the distribution of the missing where c2 and ca are constants independent of the parameters. variables [Z,X}is computed.This expected value is of- We can see that the only difference between(2)and (7)lies ten called the O-function which is defined as follows: in the difference between the regularization terms vW and In Wll1. Q(Θ|Θ(t)= To learn the parameters for SPRP with the Jeffreys prior, dZ dX p(Z,X|Θ(t),T)np(日|T,Z,X): we only need to change(六)and∑in Algorithm1 as fol- 1ows:(六)=,2:=diag(Wi,,W%)
For simplicity, here we adopt the maximum a posteriori (MAP) strategy to estimate the parameters. Due to the L1 constraint on W, the MAP estimation of W will naturally induce sparsity, which means that many entries of W will be automatically driven to zero during the learning procedure. In this paper, we assume that p(µ) and p(σ 2 ) are uniform. Remark 1 Of course, we may also put non-uniform priors, such as conjugate priors, on µ and σ 2 . Here, uniform priors for µ and σ 2 are adopted mainly for fair comparison because they are also adopted in PRPCA. Alternatively, we may also treat all the parameters as random variables and resort to fully Bayesian methods, such as variational methods (Jordan et al. 1999), for learning and inference. However, since the focus of this paper is on demonstrating the promising advantages of sparsity under the PRPCA framework, all these possible variants are left to future extensions. It is not easy to directly optimize the objective function in (2) though. As in (Park and Casella 2008; Guan and Dy 2009), we adopt a hierarchical interpretation of the Laplace prior: p(Zij | λ) = λ 2 exp − λ 2 Zij , for Zij ≥ 0, (3) Wij |Zij ∼ N (0, Zij ). (4) It is easy to show that this hierarchical reformulation is equivalent to the original Laplace prior, because p(Wij | λ) = Z p(Wij |Zij )p(Zij | λ)dZij = √ λ 2 exp n − √ λkWijk1 o . (5) Figure 1(b) depicts the graphical model of SPRP as compared with that of PRPCA in Figure 1(a). Learning By setting the gradient of ln p(Θ | T) with respect to µ to zero, we get the (closed-form) MAP estimate for µ as follows: µb = T∆e e T∆e. (6) For the other parameters (W and σ 2 ) of SPRP, we derive an (iterative) EM (Dempster, Laird, and Rubin 1977) algorithm to learn them. For the rest of this paper, we still use Θ to refer to the parameters but they only contain W and σ 2 because µ can be directly computed by (6). During the EM learning procedure, we treat {Z, X} as missing data and {T, Z, X} as complete data. The EM algorithm for MAP estimation operates by alternating between the following two steps: • E-step: The expectation of the complete-data logposterior with respect to the distribution of the missing variables {Z, X} is computed. This expected value is often called the Q-function which is defined as follows: Q (Θ | Θ(t)) = Z dZ dX p(Z, X | Θ(t), T) ln p(Θ | T, Z, X). • M-step: The Q-function is maximized to update the parameters: Θ(t + 1) = argmax Θ Q (Θ | Θ(t)). The whole EM learning procedure is summarized in Algorithm 1 and the detailed derivation can be found in (Li 2010). Note that as in (Li, Yeung, and Zhang 2009), we use W and σ 2 to denote the old values and Wf and σe 2 for the updated ones. Algorithm 1 EM algorithm for SPRP Initialize W and σ 2 . for t = 1 to T E-step: Compute the sufficient statistics M = WTW + σ 2 Iq, hXi = M−1WT (T − µe T ), hX∆XT i = Nσ2M−1 + hXi∆hXi T , h 1 Zij i = √ λ kWijk1 . M-step: Update the parameters for i = 1 to d Σi = diag kW√i1k1 λ , · · · , kW√iqk1 λ , Wfi∗ = Hi∗WM−1Σi (σ 2 Iq +M−1WT HW)M−1Σi + σ 2 N Iq −1 . end for σe 2 = tr[H−HWM−1WfT ] d . end for To learn the hyperparameter λ, we may resort to the crossvalidation strategy. Alternatively, we may treat λ as one of the parameters, just like W, to get the following EM updating equation: λ = P 2dq d i=1 Pq j=1hZij i . SPRP with Jeffreys Prior The Jeffreys prior (Figueiredo 2003; Guan and Dy 2009) is a noninformative prior. To use the Jeffreys prior, we only need to change the density function of Zij in (3) to: p(Zij ) ∝ 1 Zij . We can see that there exist no hyperparameters in the Jeffreys prior but the Laplace prior does have the hyperparameter λ which needs to be learned. With the Jeffreys prior, the log-posterior can be computed as follows: ln p(Θ | T) = ln p(T | Θ) + ln p(Θ) + c2 = − N 2 h ln |C| + tr(C−1H) i − ln kWk1 + ln p(µ) + ln p(σ 2 ) + c3, (7) where c2 and c3 are constants independent of the parameters. We can see that the only difference between (2) and (7) lies in the difference between the regularization terms √ λkWk1 and ln kWk1. To learn the parameters for SPRP with the Jeffreys prior, we only need to change h 1 Zij i and Σi in Algorithm 1 as follows: h 1 Zij i = 1 W2 ij , Σi = diag W2 i1 , . . . , W2 iq .
Time Complexity community.The content information refers to the paper ab- To train the model,we need O(dN)time to compute H and stracts and the links refer to the citations.The task is to clas- O(Tgd2+Tdg3)for the T EM iterations.Hence,the total sify each paper into one of the subfields of data structure time complexity is O(dN+Tqd2+Tdg3).If d>q2,Tqd2 (DS),hardware and architecture (HA),machine learning will be larger than Idg and so the time complexity will (ML),and programming language (PL).Some characteris- tics of the Cora data set are summarized in Table 2. become O(dN +Tqd2)which is equal to that of PRPCA. If we want to use the learned W to perform projection,the Table 2:Characteristics of the Cora data set. time complexity will depend on the number of nonzero en- Data Set #classes #instances #words tries in W.Generally speaking,SPRP has lower projection DS 751 6,234 cost than PRPCA because the W in SPRP is more sparse HA 400 3.989 than that in PRPCA ML 1.617 8.329 PL 9 1.575 7,949 Experimental Evaluation Because we do not have the dictionary for generating As in (Li,Yeung,and Zhang 2009),we adopt PCA to ini- the bag-of-words representation in the preprocessed WebKB tialize W,initialize o2 to 10-6,and set y to 10-6.We and Cora data sets.we collect another data set,called set the number of EM iterations T to 30 because 30 itera- Cora-IR.The Cora-IR data set contains 350 information re- tions are sufficient for both PRPCA and SPRP to achieve trieval papers from the original Cora set (McCallum et al. good performance.The baseline methods for comparison in- 2000).There are four subfields (classes)in Cora-IR:re- clude PCA,sparse probabilistic projection(SPP)(Archam- trieval,filtering,extraction,and digital library.We use the beau and Bach 2008)and PRPCA.Through the experi- title of each paper for the content information.After pre- ments we want to verify the following claims:(1)PCA processing,we get a dictionary of 787 words.For each cannot effectively exploit the relational information in rela- word,there is at least one instance(paper)containing it.We tional data.Furthermore,it cannot learn interpretable results will use this dictionary to demonstrate the interpretability of (2)Due to its i.i.d.assumption,SPP cannot achieve satis- SPRP. factory performance even though it can learn interpretable In (Li,Yeung,and Zhang 2009),only information about results.(3)PRPCA can effectively exploit the relational in- the words (bag-of-words)is used to represent the con- formation,but it cannot learn interpretable results.(4)SPRP tent information.We expand the original content features not only can effectively exploit the relational information, by adding some extra features extracted from the origi- but it can also learn interpretable results. nal directed links.The ith link feature is referred to as link-to-instancei.For example,if instance k links to instance Data Sets i,the ith link feature of instance k will be 1.Otherwise. it will be 0.In fact,this kind of link features can also be Three data sets are used for our experimental evaluation.The treated as content information.For example,given a paper, first two are the preprocessed WebKB(Craven et al.1998) the link-to-instancei feature actually reflects whether the ref- and Cora (McCallum et al.2000)data sets used in(Zhu et erence part of that paper contains paper i.For a web page, al.2007;Li,Yeung,and Zhang 2009).The third data set is the link-to-instance;feature can also be directly extracted called Cora-IR,which contains the information retrieval pa- from the HTML file (content)of that page.Note that it is pers from the original Cora data set(McCallum et al.2000). somewhat impractical to treat the linked-by-instance;fea- All these data sets use the bag-of-words representation for tures as content features because they cannot be directly ex- the content information. tracted from the content of the instances.For example,the The WebKB data set contains 4,017 web pages from papers citing a specific paper i are not included in the con- the computer science departments of four universities(Cor- tent of paper i.After extracting the link features,we com- nell,Texas,Washington,and Wisconsin).Each web page bine the original bag-of-words with the link features to ob- is labeled with one of seven categories:student,professor, tain the expanded content features.We can see that the way course,project,staff,department,and"other".The original to get the expanded content features also assumes that the links are directed.We adopt the same strategy as that in (Li, instances are i.i.d.We will show that this way of using link Yeung,and Zhang 2009)to convert the directed links into information is not good enough to capture the structure in- undirected ones.Some characteristics of the WebKB data formation in relational data.On the contrary,PRPCA and set are summarized in Table 1. SPRP,which are not based on the i.i.d.assumption,can pro- Table 1:Characteristics of the WebKB data set. vide more effective ways to model relational data.In what follows,we will refer to the original bag-of-words represen- Data Set #classes #instances #words tation as original content features Cornell 7 827 4,134 Texas 7 814 4,029 Washington > 1.166 4,165 Laplace Prior vs.Jeffreys Prior Wisconsin 6 1,210 4,189 We define the degree of sparsity(DoS)of W as follows: The Cora data set used in (Li,Yeung,and Zhang 2009) contains 4,343 research papers from the computer science Dos=number of zero entries in W ×100%. dg
Time Complexity To train the model, we need O(dN) time to compute H and O(T qd2 + T dq3 ) for the T EM iterations. Hence, the total time complexity is O(dN +T qd2 +T dq3 ). If d > q2 , T qd2 will be larger than T dq3 and so the time complexity will become O(dN + T qd2 ) which is equal to that of PRPCA. If we want to use the learnedW to perform projection, the time complexity will depend on the number of nonzero entries in W. Generally speaking, SPRP has lower projection cost than PRPCA because the W in SPRP is more sparse than that in PRPCA. Experimental Evaluation As in (Li, Yeung, and Zhang 2009), we adopt PCA to initialize W, initialize σ 2 to 10−6 , and set γ to 10−6 . We set the number of EM iterations T to 30 because 30 iterations are sufficient for both PRPCA and SPRP to achieve good performance. The baseline methods for comparison include PCA, sparse probabilistic projection (SPP) (Archambeau and Bach 2008) and PRPCA. Through the experiments we want to verify the following claims: (1) PCA cannot effectively exploit the relational information in relational data. Furthermore, it cannot learn interpretable results. (2) Due to its i.i.d. assumption, SPP cannot achieve satisfactory performance even though it can learn interpretable results. (3) PRPCA can effectively exploit the relational information, but it cannot learn interpretable results. (4) SPRP not only can effectively exploit the relational information, but it can also learn interpretable results. Data Sets Three data sets are used for our experimental evaluation. The first two are the preprocessed WebKB (Craven et al. 1998) and Cora (McCallum et al. 2000) data sets used in (Zhu et al. 2007; Li, Yeung, and Zhang 2009). The third data set is called Cora-IR, which contains the information retrieval papers from the original Cora data set (McCallum et al. 2000). All these data sets use the bag-of-words representation for the content information. The WebKB data set contains 4,017 web pages from the computer science departments of four universities (Cornell, Texas, Washington, and Wisconsin). Each web page is labeled with one of seven categories: student, professor, course, project, staff, department, and “other”. The original links are directed. We adopt the same strategy as that in (Li, Yeung, and Zhang 2009) to convert the directed links into undirected ones. Some characteristics of the WebKB data set are summarized in Table 1. Table 1: Characteristics of the WebKB data set. Data Set #classes #instances #words Cornell 7 827 4,134 Texas 7 814 4,029 Washington 7 1,166 4,165 Wisconsin 6 1,210 4,189 The Cora data set used in (Li, Yeung, and Zhang 2009) contains 4,343 research papers from the computer science community. The content information refers to the paper abstracts and the links refer to the citations. The task is to classify each paper into one of the subfields of data structure (DS), hardware and architecture (HA), machine learning (ML), and programming language (PL). Some characteristics of the Cora data set are summarized in Table 2. Table 2: Characteristics of the Cora data set. Data Set #classes #instances #words DS 9 751 6,234 HA 7 400 3,989 ML 7 1,617 8,329 PL 9 1,575 7,949 Because we do not have the dictionary for generating the bag-of-words representation in the preprocessed WebKB and Cora data sets, we collect another data set, called Cora-IR. The Cora-IR data set contains 350 information retrieval papers from the original Cora set (McCallum et al. 2000). There are four subfields (classes) in Cora-IR: retrieval, filtering, extraction, and digital library. We use the title of each paper for the content information. After preprocessing, we get a dictionary of 787 words. For each word, there is at least one instance (paper) containing it. We will use this dictionary to demonstrate the interpretability of SPRP. In (Li, Yeung, and Zhang 2009), only information about the words (bag-of-words) is used to represent the content information. We expand the original content features by adding some extra features extracted from the original directed links. The ith link feature is referred to as link-to-instancei . For example, if instance k links to instance i, the ith link feature of instance k will be 1. Otherwise, it will be 0. In fact, this kind of link features can also be treated as content information. For example, given a paper, the link-to-instancei feature actually reflects whether the reference part of that paper contains paper i. For a web page, the link-to-instancei feature can also be directly extracted from the HTML file (content) of that page. Note that it is somewhat impractical to treat the linked-by-instancei features as content features because they cannot be directly extracted from the content of the instances. For example, the papers citing a specific paper i are not included in the content of paper i. After extracting the link features, we combine the original bag-of-words with the link features to obtain the expanded content features. We can see that the way to get the expanded content features also assumes that the instances are i.i.d. We will show that this way of using link information is not good enough to capture the structure information in relational data. On the contrary, PRPCA and SPRP, which are not based on the i.i.d. assumption, can provide more effective ways to model relational data. In what follows, we will refer to the original bag-of-words representation as original content features. Laplace Prior vs. Jeffreys Prior We define the degree of sparsity (DoS) of W as follows: DoS = number of zero entries in W dq × 100%
From(2).we can see that A in the Laplace prior controls the DoS of the learned W.The larger A is,the larger will Table 4:DoS (in %comparison of PCA,SPP.PRPCA and SPRP. the DoS of W be.Here,we vary A to get different DoS PCA SPP PRPCA SPRP and then evaluate the corresponding accuracy.Due to space Cora-IR 0 90 0 72 limitation,we only report here results on the DS data set be- DS 18 88 20 76 cause other data sets exhibit similar properties.The accuracy Cora HA 16 86 18 72 of PRPCA on the DS data set is 68.1%.The corresponding M 10 90 12 76 results of SPRP with the Laplace prior are shown in Table 3. PL 11 90 13 76 Cornell 0 74 0 48 Table 3:Accuracy(Acc)against DoS for SPRP with the Laplace Texas 0 77 0 WebKB 2 prior. Washington 74 1 47 Wisconsin 0 75 0 47 DoS(%)305060 707680 9096 Acc(%)68.868.768.167.566.966.7 65.963.2 From Table 3,we can discover some interesting properties of SPRP: the discrimination ability of SPRP is much higher than SPP, as to be shown later. In general,the larger the DoS is,the lower will the accu- racy be.This is reasonable because less information about To further compare the results of SPP and SPRP in terms of interpretability,we show some details of the first six the features will be used to construct the principal compo- columns of W in Table 5.In the table,the 'Selected Words'. nents with larger DoS. arranged in descending order in terms of their W values, Compared with PRPCA,SPRP can achieve a DoS as correspond to the top 10 nonzero entries in W.It is easy large as 60%without compromising its accuracy.Even to see that the learned projection matrix of either SPRP or when DoS =70%,the accuracy of SPRP is still compa- SPP does show some discrimination ability.More specifi- rable with that of PRPCA.This shows that the sparsity cally,W.mainly corresponds to the class retrieval,W.2 to pursuit in SPRP is very meaningful because it can obtain filtering,W.3 and W,4 to extraction,and W.5 and W.6 to interpretable results without compromising its accuracy. digital_library.This is very promising because we can use For the Jeffreys prior,there are no hyperparameters to the magnitude of the corresponding principal components tune.After learning,we get an accuracy of 68.1%with to measure the class proportions of each instance.Detailed DoS =76%.Hence,with similar DoS,the Jeffreys prior comparison between the words selected by SPP and SPRP can achieve slightly higher accuracy than the Laplace prior. shows that the words selected by SPRP is more discrimina- From Table 3,we also find that a relatively good tradeoff tive than those selected by SPP.For example,'dictionary'is between the DoS and accuracy can be obtained if 70%< more related to retrieval than 'agents',and 'symbolic'and DoS 80%.Hence,we can say that the Jeffreys prior can wrapper'are more related to extraction than 'multi'and adaptively learn a good DoS.Due to this nice property,we empirical'. only report the results of SPRP with the Jeffreys prior in the For SPRP.116 out of 787 words are not used to construct rest of this paper.For fair comparison,we also use the Jef- any principal component,which means that the entire rows freys prior for SPP(Archambeau and Bach 2008). of W corresponding to those words are zero.Hence,SPRP can also be used to perform feature elimination,which will Interpretability speed up the collection process for new data.For example, For all the projection methods,we set the dimensionality of some eliminated words include 'wwww',aboutness','uu' the latent space to 50.For Cora-IR,we adopt the original erol','stylistic','encounter','classificatin','hypercode' content features because we need to use the selected words broswer','lacking','multispectral',and 'exchanges'.It is for illustration.For all the other data sets.we use the ex- interesting to note that most of them are typos or not related panded content features. to information retrieval at all. The DoS comparison of PCA,SPP,PRPCA and SPRP is shown in Table 4.We can see that the DoS of both PCA and Accuracy PRPCA on WebKB and Cora-IR is either 0 or close to 0. As in (Li,Yeung,and Zhang 2009),we adopt 5-fold cross which means that all the original variables (i.e.,words)will validation to evaluate the accuracy.The dimensionality of be used to compute the principal components for PCA and the latent space is set to 50 for all the dimensionality re- PRPCA.For Cora,there exist some features (words)that no duction methods.After dimensionality reduction,a linear instances (papers)contain them.That is to say,all entries in support vector machine (SVM)is trained for classification the corresponding rows of the content matrix T will be zero. based on the low-dimensional representation.The average We also find that the zeroes in W are from those rows cor- classification accuracy for 5-fold cross validation,together responding to the all-zero rows in T.Hence,we can say that with the standard deviation,is used as the performance met- on Cora,PCA and PRPCA cannot learn sparse projection ric matrices either.Due to this non-sparse property,the results The results on Cora and WebKB are shown in Figure 2 of PCA and PRPCA lack interpretability.On the contrary, and Figure 3,respectively.PRPCA based on the original both SPP and SPRP can learn sparse projection matrices content features is denoted as PRPCA0,which achieves per- Compared with SPP,SPRP achieves lower DoS.However, formance comparable with the state-of-the-art methods (Li
From (2), we can see that λ in the Laplace prior controls the DoS of the learned W. The larger λ is, the larger will the DoS of W be. Here, we vary λ to get different DoS and then evaluate the corresponding accuracy. Due to space limitation, we only report here results on the DS data set because other data sets exhibit similar properties. The accuracy of PRPCA on the DS data set is 68.1%. The corresponding results of SPRP with the Laplace prior are shown in Table 3. Table 3: Accuracy (Acc) against DoS for SPRP with the Laplace prior. DoS (%) 30 50 60 70 76 80 90 96 Acc (%) 68.8 68.7 68.1 67.5 66.9 66.7 65.9 63.2 From Table 3, we can discover some interesting properties of SPRP: • In general, the larger the DoS is, the lower will the accuracy be. This is reasonable because less information about the features will be used to construct the principal components with larger DoS. • Compared with PRPCA, SPRP can achieve a DoS as large as 60% without compromising its accuracy. Even when DoS = 70%, the accuracy of SPRP is still comparable with that of PRPCA. This shows that the sparsity pursuit in SPRP is very meaningful because it can obtain interpretable results without compromising its accuracy. For the Jeffreys prior, there are no hyperparameters to tune. After learning, we get an accuracy of 68.1% with DoS = 76%. Hence, with similar DoS, the Jeffreys prior can achieve slightly higher accuracy than the Laplace prior. From Table 3, we also find that a relatively good tradeoff between the DoS and accuracy can be obtained if 70% < DoS < 80%. Hence, we can say that the Jeffreys prior can adaptively learn a good DoS. Due to this nice property, we only report the results of SPRP with the Jeffreys prior in the rest of this paper. For fair comparison, we also use the Jeffreys prior for SPP (Archambeau and Bach 2008). Interpretability For all the projection methods, we set the dimensionality of the latent space to 50. For Cora-IR, we adopt the original content features because we need to use the selected words for illustration. For all the other data sets, we use the expanded content features. The DoS comparison of PCA, SPP, PRPCA and SPRP is shown in Table 4. We can see that the DoS of both PCA and PRPCA on WebKB and Cora-IR is either 0 or close to 0, which means that all the original variables (i.e., words) will be used to compute the principal components for PCA and PRPCA. For Cora, there exist some features (words) that no instances (papers) contain them. That is to say, all entries in the corresponding rows of the content matrix T will be zero. We also find that the zeroes in W are from those rows corresponding to the all-zero rows in T. Hence, we can say that on Cora, PCA and PRPCA cannot learn sparse projection matrices either. Due to this non-sparse property, the results of PCA and PRPCA lack interpretability. On the contrary, both SPP and SPRP can learn sparse projection matrices. Compared with SPP, SPRP achieves lower DoS. However, Table 4: DoS (in %) comparison of PCA, SPP, PRPCA and SPRP. PCA SPP PRPCA SPRP Cora Cora-IR 0 90 0 72 DS 18 88 20 76 HA 16 86 18 72 ML 10 90 12 76 PL 11 90 13 76 WebKB Cornell 0 74 0 48 Texas 0 77 0 42 Washington 1 74 1 47 Wisconsin 0 75 0 47 the discrimination ability of SPRP is much higher than SPP, as to be shown later. To further compare the results of SPP and SPRP in terms of interpretability, we show some details of the first six columns of W in Table 5. In the table, the ‘Selected Words’, arranged in descending order in terms of their W values, correspond to the top 10 nonzero entries in W. It is easy to see that the learned projection matrix of either SPRP or SPP does show some discrimination ability. More specifi- cally, W∗1 mainly corresponds to the class retrieval, W∗2 to filtering, W∗3 and W∗4 to extraction, and W∗5 and W∗6 to digital library. This is very promising because we can use the magnitude of the corresponding principal components to measure the class proportions of each instance. Detailed comparison between the words selected by SPP and SPRP shows that the words selected by SPRP is more discriminative than those selected by SPP. For example, ‘dictionary’ is more related to retrieval than ‘agents’, and ‘symbolic’ and ‘wrapper’ are more related to extraction than ‘multi’ and ‘empirical’. For SPRP, 116 out of 787 words are not used to construct any principal component, which means that the entire rows of W corresponding to those words are zero. Hence, SPRP can also be used to perform feature elimination, which will speed up the collection process for new data. For example, some eliminated words include ‘wwww’, ‘aboutness’, ‘uu’, ‘erol’, ‘stylistic’, ‘encounter’, ‘classificatin’, ‘hypercode’, ‘broswer’, ‘lacking’, ‘multispectral’, and ‘exchanges’. It is interesting to note that most of them are typos or not related to information retrieval at all. Accuracy As in (Li, Yeung, and Zhang 2009), we adopt 5-fold cross validation to evaluate the accuracy. The dimensionality of the latent space is set to 50 for all the dimensionality reduction methods. After dimensionality reduction, a linear support vector machine (SVM) is trained for classification based on the low-dimensional representation. The average classification accuracy for 5-fold cross validation, together with the standard deviation, is used as the performance metric. The results on Cora and WebKB are shown in Figure 2 and Figure 3, respectively. PRPCA based on the original content features is denoted as PRPCA0, which achieves performance comparable with the state-of-the-art methods (Li
Table 5:Some details of the projection matrices learned by SPRP and SPP. Selected Words(arranged in descending order in terms of their W values) W. information:retrieval:cross;language:extraction;system:evaluation:techniques:dictionary:incremental W.2 ext categorization:learning:classification:information:feature:retrieval:selection:classifiers;algorithm W.3 web:wide:learning:world;information:extract:symbolic:aid:formatting:extraction SPRP W.4 extraction;information:leaming:structured:rules;wrapper:documents;induction:grammatical:machine W text:language:digital:wide;world:high;processing:structured;sources;information W,6 digital:library:learning:libraries:image;market:services;decoding:stanford;metadata W,1 information:retrieval:agents;evaluation:system:cross;language;intelligent:dissemination:distributed W.2 text:categorization;learning:classification;information:feature:selection;extraction:study;case W,3 web:wide;world;leaming:information;search;multi:pattems;server;performance SPP W4 extraction;information;leaming:rules;disclosure;automatically:structured;basis;dictionary:empirical W,5 text:digital:world:wide:information;system:library:categorization:processing:high W.6 digital:library:learning:services;libraries;market;video:access;navigating:agents much lower than that of PRPCA.Table 6 reports the pro- jection time needed to perform projection on the Cora and WebKB data sets.The test is performed with MATLAB im- plementation on a 2.33GHz personal computer.We can see that SPRP is much faster than PRPCA for the projection op- 1.5 eration. Figure 2:Results in average classification accuracy with standard deviation on the Cora data set. Table 6:Projection time (in seconds)comparison. PRPCA SPRP DS 2.431 0.749 HA 0.834 0.284 Cora ML 8.225 2.272 PL 7.507 2.123 Cornell 2.362 1.241 WebKB Texas 2.273 1.323 Figure 3:Results in average classification accuracy with standard Washington 3.588 1.946 deviation on the WebKB data set. Wisconsin 3.778 2.029 Yeung,and Zhang 2009).All the other methods are based on the expanded content features.Compared with PCA Conclusion and Future Work the higher accuracy of PRPCA0 shows that it is not good enough to just extract the extra information from the links In this paper,we have proposed a novel model,SPRP,to and still assume the instances to be i.i.d.Comparison be- learn a sparse projection matrix for relational dimensional- tween PRPCA and PRPCAO shows that slightly better per- ity reduction.Compared with PRPCA,the sparsity in SPRP formance can be achieved with the expanded content fea- not only makes its results interpretable,but it also makes the tures,particularly for the Cora data set.Comparison between projection operation much more efficient without compro- PRPCA and PCA verifies the claim in (Li,Yeung,and Zhang mising its accuracy.Furthermore,SPRP can also be used to 2009)that PRPCA dramatically outperforms PCA by elim- perform feature elimination,which will speed up the collec- inating the i.i.d.assumption.Comparison between SPP and tion process for new data.Compared with traditional sparse PCA shows that the sparsity pursuit does not necessarily de- projection methods based on the i.i.d.assumption,SPRP can teriorate the accuracy for the case with the i.i.d.assumption. learn a more discriminative projection by explicitly model- Comparison between SPRP and SPP shows that under the ing the covariance between instances.SPRP brings about sparsity pursuit case,dramatic accuracy improvement can some theoretical contributions to the area of sparsity pur- also be achieved by explicitly modeling the covariance be- suing because SPRP is the first model that pursues sparsity tween instances,which once again verifies that the i.i.d.as- without requiring the i.i.d.assumption.Hence,it can inspire sumption is unreasonable for relational data.Finally,com- us to relax the i.i.d.assumption in other sparse models as parison between SPRP and PRPCA shows that under the well to further boost model performance. PRPCA framework,we can also achieve sparsity without compromising accuracy. Acknowledgments Projection Cost Li is supported by the NSFC (No.61100125)and the 863 Pro- gram of China (No.2011AA01A202,No.2012AA011003).Yeung When the projection matrix learned is used to perform pro- is supported by General Research Fund 621310 from the Research jection,the sparsity of SPRP will make its projection cost Grants Council of Hong Kong
Table 5: Some details of the projection matrices learned by SPRP and SPP. Selected Words (arranged in descending order in terms of their W values) SPRP W∗1 information; retrieval; cross; language; extraction; system; evaluation; techniques; dictionary; incremental W∗2 text; categorization; learning; classification; information; feature; retrieval; selection; classifiers; algorithm W∗3 web; wide; learning; world; information; extract; symbolic; aid; formatting; extraction W∗4 extraction; information; learning; structured; rules; wrapper; documents; induction; grammatical; machine W∗5 text; language; digital; wide; world; high; processing; structured; sources; information W∗6 digital; library; learning; libraries; image; market; services; decoding; stanford; metadata SPP W∗1 information; retrieval; agents; evaluation; system; cross; language; intelligent; dissemination; distributed W∗2 text; categorization; learning; classification; information; feature; selection; extraction; study; case W∗3 web; wide; world; learning; information; search; multi; patterns; server; performance W∗4 extraction; information; learning; rules; disclosure; automatically; structured; basis; dictionary; empirical W∗5 text; digital; world; wide; information; system; library; categorization; processing; high W∗6 digital; library; learning; services; libraries; market; video; access; navigating; agents DS HA ML PL 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 Accuracy PRPCA0 PCA SPP PRPCA SPRP Figure 2: Results in average classification accuracy with standard deviation on the Cora data set. Cornell Texas Washington Wisconsin 0.8 0.85 0.9 0.95 Accuracy PRPCA0 PCA SPP PRPCA SPRP Figure 3: Results in average classification accuracy with standard deviation on the WebKB data set. Yeung, and Zhang 2009). All the other methods are based on the expanded content features. Compared with PCA, the higher accuracy of PRPCA0 shows that it is not good enough to just extract the extra information from the links and still assume the instances to be i.i.d. Comparison between PRPCA and PRPCA0 shows that slightly better performance can be achieved with the expanded content features, particularly for the Cora data set. Comparison between PRPCA and PCA verifies the claim in (Li, Yeung, and Zhang 2009) that PRPCA dramatically outperforms PCA by eliminating the i.i.d. assumption. Comparison between SPP and PCA shows that the sparsity pursuit does not necessarily deteriorate the accuracy for the case with the i.i.d. assumption. Comparison between SPRP and SPP shows that under the sparsity pursuit case, dramatic accuracy improvement can also be achieved by explicitly modeling the covariance between instances, which once again verifies that the i.i.d. assumption is unreasonable for relational data. Finally, comparison between SPRP and PRPCA shows that under the PRPCA framework, we can also achieve sparsity without compromising accuracy. Projection Cost When the projection matrix learned is used to perform projection, the sparsity of SPRP will make its projection cost much lower than that of PRPCA. Table 6 reports the projection time needed to perform projection on the Cora and WebKB data sets. The test is performed with MATLAB implementation on a 2.33GHz personal computer. We can see that SPRP is much faster than PRPCA for the projection operation. Table 6: Projection time (in seconds) comparison. PRPCA SPRP Cora DS 2.431 0.749 HA 0.834 0.284 ML 8.225 2.272 PL 7.507 2.123 WebKB Cornell 2.362 1.241 Texas 2.273 1.323 Washington 3.588 1.946 Wisconsin 3.778 2.029 Conclusion and Future Work In this paper, we have proposed a novel model, SPRP, to learn a sparse projection matrix for relational dimensionality reduction. Compared with PRPCA, the sparsity in SPRP not only makes its results interpretable, but it also makes the projection operation much more efficient without compromising its accuracy. Furthermore, SPRP can also be used to perform feature elimination, which will speed up the collection process for new data. Compared with traditional sparse projection methods based on the i.i.d. assumption, SPRP can learn a more discriminative projection by explicitly modeling the covariance between instances. SPRP brings about some theoretical contributions to the area of sparsity pursuing because SPRP is the first model that pursues sparsity without requiring the i.i.d. assumption. Hence, it can inspire us to relax the i.i.d. assumption in other sparse models as well to further boost model performance. Acknowledgments Li is supported by the NSFC (No. 61100125) and the 863 Program of China (No. 2011AA01A202, No. 2012AA011003). Yeung is supported by General Research Fund 621310 from the Research Grants Council of Hong Kong
References Zou,H.,and Hastie,T.2005.Regularization and variable Archambeau,C.,and Bach,F.2008.Sparse probabilistic selection via the elastic net.Journal of the Royal Statistical projections.In NIPS 21. Society,Series B 67(2):301-320. Caron,F,and Doucet,A.2008.Sparse Bayesian nonpara- Zou,H.;Hastie,T.;and Tibshirani,R.2006.Sparse prin- metric regression.In ICML. cipal component analysis.Journal of Computational and Graphical Statistics 15(2):265-286. Craven,M.;DiPasquo,D.;Freitag,D.:McCallum,A.; Mitchell,T.M.;Nigam,K.;and Slattery,S.1998.Learning to extract symbolic knowledge from the world wide web.In AAAI/IAAI. Dempster,A.;Laird,N.;and Rubin,D.1977.Maximum likelihood from incomplete data via the EM algorithm.Jour- nal of the Royal Statistical Society,Series B 39(1):1-38. Figueiredo,M.A.T.2003.Adaptive sparseness for su- pervised learning.IEEE Trans.Pattern Anal.Mach.Intell. 25(9):1150-1159 Getoor,L..and Taskar,B.2007.Introduction to Statistical Relational Learning.The MIT Press. Guan,Y.,and Dy,J.G.2009.Sparse probabilistic principal component analysis.In A/STATS. Gupta,A.K.,and Nagar,D.K.2000.Matrix Variate Distri- butions.Chapman Hall/CRC. Jolliffe,I.T.2002.Principal Component Analysis.Springer, second edition. Jordan,M.I.:Ghahramani,Z.;Jaakkola,T.:and Saul,L.K 1999.An introduction to variational methods for graphical models.Machine Learning 37(2):183-233 Li,W.-J.,and Yeung,D.-Y.2009.Relation regularized ma- trix factorization.In /CAl. Li,W.-J.,and Yeung,D.-Y.2011.Social relations model for collaborative filtering.In AAA/. Li,W.-J.;Yeung,D.-Y.;and Zhang,Z.2009.Probabilistic relational PCA.In NIPS 22 Li,W.-J.;Yeung,D.-Y.;and Zhang,Z.2011.Generalized latent factor models for social network analysis.In //CAl. Li,W.-J.;Zhang,Z.;and Yeung,D.-Y.2009.Latent Wishart processes for relational kernel learning.In AISTATS. Li,W.-J.2010.Latent Factor Models for Statistical Rela- tional Learning.Ph.D.Dissertation,Hong Kong University of Science and Technology. McCallum,A.;Nigam,K.;Rennie,J.;and Seymore,K. 2000.Automating the construction of internet portals with machine learning.Information Retrieval 3(2):127-163. Park.T..and Casella.G.2008.The Bayesian lasso.Journal of the American Statistical Association 103(482):681-686. Sigg,C.D.,and Buhmann,J.M.2008.Expectation- maximization for sparse and non-negative PCA.In ICML. Tipping,M.E.,and Bishop,C.M.1999.Probabilistic prin- cipal component analysis.Journal of the Royal Statistical Society,Series B 61(3):611-622. Zhu,S.;Yu,K.;Chi,Y.;and Gong,Y.2007.Combining content and link for classification using matrix factorization. In SIGIR
References Archambeau, C., and Bach, F. 2008. Sparse probabilistic projections. In NIPS 21. Caron, F., and Doucet, A. 2008. Sparse Bayesian nonparametric regression. In ICML. Craven, M.; DiPasquo, D.; Freitag, D.; McCallum, A.; Mitchell, T. M.; Nigam, K.; and Slattery, S. 1998. Learning to extract symbolic knowledge from the world wide web. In AAAI/IAAI. Dempster, A.; Laird, N.; and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39(1):1–38. Figueiredo, M. A. T. 2003. Adaptive sparseness for supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 25(9):1150–1159. Getoor, L., and Taskar, B. 2007. Introduction to Statistical Relational Learning. The MIT Press. Guan, Y., and Dy, J. G. 2009. Sparse probabilistic principal component analysis. In AISTATS. Gupta, A. K., and Nagar, D. K. 2000. Matrix Variate Distributions. Chapman & Hall/CRC. Jolliffe, I. T. 2002. Principal Component Analysis. Springer, second edition. Jordan, M. I.; Ghahramani, Z.; Jaakkola, T.; and Saul, L. K. 1999. An introduction to variational methods for graphical models. Machine Learning 37(2):183–233. Li, W.-J., and Yeung, D.-Y. 2009. Relation regularized matrix factorization. In IJCAI. Li, W.-J., and Yeung, D.-Y. 2011. Social relations model for collaborative filtering. In AAAI. Li, W.-J.; Yeung, D.-Y.; and Zhang, Z. 2009. Probabilistic relational PCA. In NIPS 22. Li, W.-J.; Yeung, D.-Y.; and Zhang, Z. 2011. Generalized latent factor models for social network analysis. In IJCAI. Li, W.-J.; Zhang, Z.; and Yeung, D.-Y. 2009. Latent Wishart processes for relational kernel learning. In AISTATS. Li, W.-J. 2010. Latent Factor Models for Statistical Relational Learning. Ph.D. Dissertation, Hong Kong University of Science and Technology. McCallum, A.; Nigam, K.; Rennie, J.; and Seymore, K. 2000. Automating the construction of internet portals with machine learning. Information Retrieval 3(2):127–163. Park, T., and Casella, G. 2008. The Bayesian lasso. Journal of the American Statistical Association 103(482):681–686. Sigg, C. D., and Buhmann, J. M. 2008. Expectationmaximization for sparse and non-negative PCA. In ICML. Tipping, M. E., and Bishop, C. M. 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B 61(3):611–622. Zhu, S.; Yu, K.; Chi, Y.; and Gong, Y. 2007. Combining content and link for classification using matrix factorization. In SIGIR. Zou, H., and Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67(2):301–320. Zou, H.; Hastie, T.; and Tibshirani, R. 2006. Sparse principal component analysis. Journal of Computational and Graphical Statistics 15(2):265–286