Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li Zhihua Zhang Dit-Yan Yeung Dept.of Comp.Sci.and Eng. College of Comp.Sci.and Tech. Dept.of Comp.Sci.and Eng. Hong Kong Univ.of Sci.and Tech. Zhejiang University Hong Kong Univ.of Sci.and Tech. Hong Kong,China Zhejiang 310027,China Hong Kong,China liwujun@cse.ust.hk zhzhang@cs.zju.edu.cn dyyeung@cse.ust.hk Abstract can impair the performance significantly.Hence,ker- nel learning (Lanckriet et al.,2004;Zhang et al.,2006), One main concern towards kernel classifiers which tries to find a good kernel matrix for the train- is on their sensitivity to the choice of kernel ing data,is very important for kernel-based classifier function or kernel matrix which characterizes design. the similarity between instances.Many real- In many real-world applications,relationships or world data,such as web pages and protein- "links"between (some)instances may also be avail- protein interaction data,are relational in na- able in the data in addition to the input attributes. ture in the sense that different instances are Data of this sort,referred to as relational data(Getoor correlated (linked)with each other.The re- and Taskar,2007),can be found in such diverse appli- lational information available in such data cation areas as web mining,social network analysis, often provides strong hints on the correla- bioinformatics,marketing,and so on.In relational tion (or similarity)between instances.In this data,the attributes of connected (linked)instances paper,we propose a novel relational kernel are often correlated and the class label of one instance learning model based on latent Wishart pro- may have an influence on that of a linked instance. cesses(LWP)to learn the kernel function for This means that the relationships (or links)between relational data.This is done by seamlessly in- instances are very informative for instance classifica- tegrating the relational information and the tion (Getoor and Taskar,2007),sometimes even much input attributes into the kernel learning pro- more informative than input attributes.For exam- cess.Through extensive experiments on real- ple,two hyperlinked web pages are very likely to be world applications,we demonstrate that our related to the same topic,even when their attributes LWP model can give very promising perfor- may look quite different when represented as bags of mance in practice. words.In biology,interacting proteins are more likely to have the same biological function than those with- out interaction.In marketing,knowing one's shopping 1 Introduction habit will provide useful information about his/her close friends'shopping inclinations.Hence,in such Kernel methods,such as support vector machines data,relational information often provides very strong (SVM)and Gaussian processes(GP)(Rasmussen and hints to refine the correlation (or similarity)between Williams,2006),have been widely used in many ap- instances.This motivates our work on relational ker- plications giving very promising performance.In ker- nel learning (RKL),which refers to learning a kernel nel methods,the similarity between instances is rep- matrix (or a kernel function)for relational data by resented by a kernel function defined over the input incorporating relational information into the learning attributes.In general,the choice of an appropriate process. kernel function and its corresponding parameters is Thanks to its promising potential in many applica- difficult in practice.Poorly chosen kernel functions tion areas,relational learning (Getoor and Taskar, 2007),which tries to model relational data,has be- Appearing in Proceedings of the 12th International Confe- come a hot topic in the machine learning community. rence on Artificial Intelligence and Statistics (AISTATS) 2009,Clearwater Beach,Florida,USA.Volume 5 of JMLR: Many methods have been proposed over the past few W&CP 5.Copyright 2009 by the authors. years.For example,probabilistic relational models
Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li Dept. of Comp. Sci. and Eng. Hong Kong Univ. of Sci. and Tech. Hong Kong, China liwujun@cse.ust.hk Zhihua Zhang College of Comp. Sci. and Tech. Zhejiang University Zhejiang 310027, China zhzhang@cs.zju.edu.cn Dit-Yan Yeung Dept. of Comp. Sci. and Eng. Hong Kong Univ. of Sci. and Tech. Hong Kong, China dyyeung@cse.ust.hk Abstract One main concern towards kernel classifiers is on their sensitivity to the choice of kernel function or kernel matrix which characterizes the similarity between instances. Many realworld data, such as web pages and proteinprotein interaction data, are relational in nature in the sense that different instances are correlated (linked) with each other. The relational information available in such data often provides strong hints on the correlation (or similarity) between instances. In this paper, we propose a novel relational kernel learning model based on latent Wishart processes (LWP) to learn the kernel function for relational data. This is done by seamlessly integrating the relational information and the input attributes into the kernel learning process. Through extensive experiments on realworld applications, we demonstrate that our LWP model can give very promising performance in practice. 1 Introduction Kernel methods, such as support vector machines (SVM) and Gaussian processes (GP) (Rasmussen and Williams, 2006), have been widely used in many applications giving very promising performance. In kernel methods, the similarity between instances is represented by a kernel function defined over the input attributes. In general, the choice of an appropriate kernel function and its corresponding parameters is difficult in practice. Poorly chosen kernel functions Appearing in Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR: W&CP 5. Copyright 2009 by the authors. can impair the performance significantly. Hence, kernel learning (Lanckriet et al., 2004; Zhang et al., 2006), which tries to find a good kernel matrix for the training data, is very important for kernel-based classifier design. In many real-world applications, relationships or “links” between (some) instances may also be available in the data in addition to the input attributes. Data of this sort, referred to as relational data (Getoor and Taskar, 2007), can be found in such diverse application areas as web mining, social network analysis, bioinformatics, marketing, and so on. In relational data, the attributes of connected (linked) instances are often correlated and the class label of one instance may have an influence on that of a linked instance. This means that the relationships (or links) between instances are very informative for instance classification (Getoor and Taskar, 2007), sometimes even much more informative than input attributes. For example, two hyperlinked web pages are very likely to be related to the same topic, even when their attributes may look quite different when represented as bags of words. In biology, interacting proteins are more likely to have the same biological function than those without interaction. In marketing, knowing one’s shopping habit will provide useful information about his/her close friends’ shopping inclinations. Hence, in such data, relational information often provides very strong hints to refine the correlation (or similarity) between instances. This motivates our work on relational kernel learning (RKL), which refers to learning a kernel matrix (or a kernel function) for relational data by incorporating relational information into the learning process. Thanks to its promising potential in many application areas, relational learning (Getoor and Taskar, 2007), which tries to model relational data, has become a hot topic in the machine learning community. Many methods have been proposed over the past few years. For example, probabilistic relational models
Latent Wishart Processes for Relational Kernel Learning (PRM)(Getoor and Taskar.2007),such as relational 2 Wishart Processes Bayesian networks(RBN)(Getoor et al.,2002),rela- tional Markov networks (RMN)(Taskar et al.,2002) Definition 1 (Gupta and Nagar,2000)An nxn ran- and relational dependency networks (RDN)(Neville dom symmetric positive definite matrir a is said to and Jensen,2007),try to model the relational data have a Wishart distribution with parameters n,q, by adapting conventional graphical models for the re- andn×n scale matrir∑>0,written as A~ lational scenario.There are also some methods that Wn(q,)if its p.d.f.is given by handle relational data in heuristic ways (Macskassy and Provost,2007).These methods are not based on |A9-n-1)/2 the kernel approach. 2,2四nep(-(②-A),g≥n.) More recently,relational Gaussian process Here∑>0 means that∑is positive definite. (RGP)(Chu et al.,2007)and mixed graph Gaussian process (XGP)(Silva et al.,2008)were proposed to Assume that we are given an input space model relational data from a kernel point of view. {x1,x2,} Both of them utilize relational information to learn the covariance (or kernel)matrix for a Gaussian Definition 2 (Zhang et al.,2006)The kernel func- process.Hence,RGP and XGP are relational kernel tion [A(xi,xj);xi,xj E x}is said to be a Wishart learning methods and we will follow their path in this process(WP)if for anyn∈Nand{x1,.,xn}≤X, paper. the nxn random matrir A [A(xi,xj)=1 follows a We propose a novel model,called LWP hereafter, Wishart distribution. based on latent Wishart processes to learn the kernel for relational data by seamlessly integrating relational For A :R,there exists a correspond- information with input attributes.Several promising ing mapping (say B)from the input space to a properties of LWP are highlighted here: latent (feature)space (sayFC R),i.e.,a vector- valued function B(x)=(B1(x),...,Ba(x))'such that A(xi,xj)=B(xi)'B(xj).Zhang et al.(2006)proved LWP adopts the kernel for input attributes to de- that A(xi,xj)is a Wishart process if and only if the fine the prior for the target kernel,and use the Bi(x)(j=1,...,q)are g mutually independent Gaus- link information to define the likelihood.Finally, stan processes. MAP estimation is applied to learn the target ker- nel.Hence,LWP seamlessly integrates the two Although g is possibly infinite,we assume that it views into a principled framework. is finite in this paper for simplicity.Denote A [A(xixj)=1 (nxn)and B=[B(x1).....B(xn)= b1,...,bnl'(nxq).Then the bi are the latent vec- To the best of our knowledge,LWP is the first tors,and A BB'is the linear kernel in the latent model that employs Wishart processes for rela- space but is a nonlinear kernel w.r.t.the input space. tional learning. Letz⑧I denote the Kronecker product of∑and I (the qxq identity matrix).Using the notation Unlike many other existing models,such as XGP (Silva et al.,2008),which are only suitable Nn.(0,g)in (Gupta and Nagar,2000,page 55) for the transductive setting,LWP is naturally ap- for matrix-variate normal distributions,we have plicable for inductive inference over unseen test Theorem 1 Let be an nxn positive definite ma- data trir.Then A is distributed according to the Wishart distribution Wn(q,)if and only if B is distributed During the kernel learning phase,no label infor- according to the matriz-variate normal distribution mation for the instances is needed,which makes Nn.q(O,E&Ig). it easy to extend LWP for visualization and clus- tering of relational data. Proof:(Sketch)If q >n,the theorem can be proven by combination of Theorem 3.2.2 and Theorem 3.3.3 The rest of this paper is organized as follows. in (Gupta and Nagar,2000). Sec- tion 2 introduces some basic background for Wishart If g<n,A follows a singular Wishart distribu- processes.Section 3 presents our LWP model and Sec- tion (Srivastava,2003).and the corresponding B is tion 4 compares it with some related works.Exper- distributed according to the singular matrix-variate imental results are presented in Section 5 and then normal distribution (Definition 2.4.1 in (Gupta and finally Section 6 concludes the paper. Nagar,2000)).The proof is similar to the case of
Latent Wishart Processes for Relational Kernel Learning (PRM) (Getoor and Taskar, 2007), such as relational Bayesian networks (RBN) (Getoor et al., 2002), relational Markov networks (RMN) (Taskar et al., 2002) and relational dependency networks (RDN) (Neville and Jensen, 2007), try to model the relational data by adapting conventional graphical models for the relational scenario. There are also some methods that handle relational data in heuristic ways (Macskassy and Provost, 2007). These methods are not based on the kernel approach. More recently, relational Gaussian process (RGP) (Chu et al., 2007) and mixed graph Gaussian process (XGP) (Silva et al., 2008) were proposed to model relational data from a kernel point of view. Both of them utilize relational information to learn the covariance (or kernel) matrix for a Gaussian process. Hence, RGP and XGP are relational kernel learning methods and we will follow their path in this paper. We propose a novel model, called LWP hereafter, based on latent Wishart processes to learn the kernel for relational data by seamlessly integrating relational information with input attributes. Several promising properties of LWP are highlighted here: • LWP adopts the kernel for input attributes to de- fine the prior for the target kernel, and use the link information to define the likelihood. Finally, MAP estimation is applied to learn the target kernel. Hence, LWP seamlessly integrates the two views into a principled framework. • To the best of our knowledge, LWP is the first model that employs Wishart processes for relational learning. • Unlike many other existing models, such as XGP (Silva et al., 2008), which are only suitable for the transductive setting, LWP is naturally applicable for inductive inference over unseen test data. • During the kernel learning phase, no label information for the instances is needed, which makes it easy to extend LWP for visualization and clustering of relational data. The rest of this paper is organized as follows. Section 2 introduces some basic background for Wishart processes. Section 3 presents our LWP model and Section 4 compares it with some related works. Experimental results are presented in Section 5 and then finally Section 6 concludes the paper. 2 Wishart Processes Definition 1 (Gupta and Nagar, 2000) An n×n random symmetric positive definite matrix A is said to have a Wishart distribution with parameters n, q, and n × n scale matrix Σ 0, written as A ∼ Wn(q, Σ), if its p.d.f. is given by |A| (q−n−1)/2 2 qn/2Γn(q/2)|Σ| q/2 exp − 1 2 tr(Σ −1A) , q ≥ n. (1) Here Σ 0 means that Σ is positive definite. Assume that we are given an input space X = {x1, x2, . . .}. Definition 2 (Zhang et al., 2006) The kernel function {A(xi , xj ); xi , xj ∈ X } is said to be a Wishart process (WP) if for any n ∈ N and {x1, . . . , xn} ⊆ X , the n×n random matrix A = [A(xi , xj )]n i,j=1 follows a Wishart distribution. For A : X × X → R, there exists a corresponding mapping (say B) from the input space X to a latent (feature) space (say F ⊂ R q ), i.e., a vectorvalued function B(x) = (B1(x), . . . , Bq(x))0 such that A(xi , xj ) = B(xi) 0B(xj ). Zhang et al. (2006) proved that A(xi , xj ) is a Wishart process if and only if the Bj (x) (j = 1, . . . , q) are q mutually independent Gaussian processes. Although q is possibly infinite, we assume that it is finite in this paper for simplicity. Denote A = [A(xi , xj )]n i,j=1 (n×n) and B = [B(x1), . . . , B(xn)]0 = [b1, . . . , bn] 0 (n×q). Then the bi are the latent vectors, and A = BB0 is the linear kernel in the latent space but is a nonlinear kernel w.r.t. the input space. Let Σ⊗Iq denote the Kronecker product of Σ and Iq (the q×q identity matrix). Using the notation Nn,q(0, Σ⊗Iq) in (Gupta and Nagar, 2000, page 55) for matrix-variate normal distributions, we have Theorem 1 Let Σ be an n×n positive definite matrix. Then A is distributed according to the Wishart distribution Wn(q, Σ) if and only if B is distributed according to the matrix-variate normal distribution Nn,q(0, Σ⊗Iq). Proof: (Sketch) If q ≥ n, the theorem can be proven by combination of Theorem 3.2.2 and Theorem 3.3.3 in (Gupta and Nagar, 2000). If q < n, A follows a singular Wishart distribution (Srivastava, 2003), and the corresponding B is distributed according to the singular matrix-variate normal distribution (Definition 2.4.1 in (Gupta and Nagar, 2000)). The proof is similar to the case of
Li,Zhang,Yeung g>n.We omit the details here due to the page limit 3.1 Model constraint.☐ The available information for RKL includes both the Theorem 1 shows an interesting connection;namely, input attributes and the binary variables {zik.The the degree of freedom g in the Wishart process is the goal of RKL is to learn a target kernel function dimensionality of the latent space F.Theorem 1 ex- A(xi,xk)which takes both the input attributes and tends the results given in (Gupta and Nagar,2000) the relational information into consideration.Let and (Zhang et al.,2006)where the condition q n aik =A(xi,xk).Then A=[aik]2k=1 will be a positive is required.Moreover,the asymptotic distribution of semidefinite matrix.We now model A by a (singu- q2/2(A-)asq→oo is Nn,n(0,(Ln2+C)(∑8), lar)Wishart distribution Wn(q,>)This implies that where C is the commutation matrix (Muirhead.1982). A(xi,xk)follows a Wishart process. Let K(xi,x)be a kernel function just defined on 3 Methodology the input attributes.For example,the linear kernel K(xi,x)=xxk is such a kernel function.Similarly, K=[K(xi,is a positive semidefinite matrix. One common representation format for relational data If we set∑=B(K+λLn)with B>0and入being is a graph,in which a node corresponds to an in- typically a very small number to make 0,we have stance and an edge corresponds to a pairwise rela- tionship between the two connected nodes.Although A~Wn(g,K+λI) (2) directed edges are common in some data sets,they can be converted into undirected edges in many cases. Consequently,the input attributes are successfully in- In this paper,we only focus on modelling undirected tegrated into the target kernel function. edges which represent symmetric (or reciprocal)rela- To incorporate the relational information for the learn- tionships.Furthermore,in the real world the rela- ing of A(,),we regard each zik as a Bernoulli vari- tionship between two nodes can be either "positive" able,which is determined by the latent variable aik via or "negative",which means that the attributes of the a logistic link function sik.Given the aik,we further connected nodes have positive or negative correlation, assume that the zik are independent.Moreover,we as- respectively.Due to the page limit constraint,here sume that the links are symmetric with no self loops, we only consider positive relationships,although it is i.e.,zik Zki and zi =0.Letting Z=[Zik]k=1,we straightforward to extend our proposed model to neg- have ative relationships.The extension of our model for directed graphs or hypergraphs with negative relation- p(ZA)=ΠΠs(1-sk)- ships will be pursued in our future work. i=1k=i+1 Suppose we are given a set of data {(xi,yi,zik):i,k with sik exp(aik/2) 1+exp(aik/2) (3) 1,...,n},where xi =(xi,...,Tip)'and yi are respec- tively the input feature vector and the label for in- stance i,and zik is a binary variable indicating the To this end,both the input attributes and the rela- existence of a relationship (link)between instances i tional information are seamlessly integrated into the and k,namely, same framework.in which the input attributes define the scale matrix of the prior for the distribution of the target kernel function and the relational informa- if there exists a link between x;and xk tion defines the likelihood computed based on the tar- 0 otherwise. get kernel function.Then,learning algorithms such as marimum a posteriori (MAP)estimation can be Rather than considering the design of a kernel clas- used to learn the latent variables aik.After learning sifier,we focus our attention on the learning of the the model,the target kernel function A(xi,xk)will kernel function for any kernel classifier by utilizing the take both the input attributes and the relational in- relational information,i.e.,relational kernel learning. formation into consideration.We can directly use this new kernel function to implement kernel classification Unlike conventional kernel learning methods (Lanck- methods such as SVM and GP-based classifiers.In our riet et al.,2004;Zhang et al.,2006)which use the class labels to learn the kernel matrix,the setting for experiments,we apply A(xi,x)as a kernel function for a GP-based classifier. the relational kernel learning in LWP does not use any instance label.LWP only uses the input feature vec- In this model,the Wishart process A(xi,x)is used tors and the binary variables [zik}to learn the kernel. to define latent variables {aik We thus call this Thus,essence,LWP is unsupervised in nature. model latent Wishart process (LWP)model
Li, Zhang, Yeung q ≥ n. We omit the details here due to the page limit constraint. Theorem 1 shows an interesting connection; namely, the degree of freedom q in the Wishart process is the dimensionality of the latent space F. Theorem 1 extends the results given in (Gupta and Nagar, 2000) and (Zhang et al., 2006) where the condition q ≥ n is required. Moreover, the asymptotic distribution of q 1/2 (A − Σ) as q → ∞ is Nn,n(0,(In2 + C)(Σ⊗Σ)), where C is the commutation matrix (Muirhead, 1982). 3 Methodology One common representation format for relational data is a graph, in which a node corresponds to an instance and an edge corresponds to a pairwise relationship between the two connected nodes. Although directed edges are common in some data sets, they can be converted into undirected edges in many cases. In this paper, we only focus on modelling undirected edges which represent symmetric (or reciprocal) relationships. Furthermore, in the real world the relationship between two nodes can be either “positive” or “negative”, which means that the attributes of the connected nodes have positive or negative correlation, respectively. Due to the page limit constraint, here we only consider positive relationships, although it is straightforward to extend our proposed model to negative relationships. The extension of our model for directed graphs or hypergraphs with negative relationships will be pursued in our future work. Suppose we are given a set of data {(xi , yi , zik) : i, k = 1, . . . , n}, where xi = (xi1, . . . , xip) 0 and yi are respectively the input feature vector and the label for instance i, and zik is a binary variable indicating the existence of a relationship (link) between instances i and k, namely, zik = 1 if there exists a link between xi and xk 0 otherwise. Rather than considering the design of a kernel classifier, we focus our attention on the learning of the kernel function for any kernel classifier by utilizing the relational information, i.e., relational kernel learning. Unlike conventional kernel learning methods (Lanckriet et al., 2004; Zhang et al., 2006) which use the class labels to learn the kernel matrix, the setting for the relational kernel learning in LWP does not use any instance label. LWP only uses the input feature vectors and the binary variables {zik} to learn the kernel. Thus, essence, LWP is unsupervised in nature. 3.1 Model The available information for RKL includes both the input attributes and the binary variables {zik}. The goal of RKL is to learn a target kernel function A(xi , xk) which takes both the input attributes and the relational information into consideration. Let aik = A(xi , xk). Then A = [aik] n i,k=1 will be a positive semidefinite matrix. We now model A by a (singular) Wishart distribution Wn(q, Σ). This implies that A(xi , xk) follows a Wishart process. Let K(xi , xk) be a kernel function just defined on the input attributes. For example, the linear kernel K(xi , xk) = x 0 ixk is such a kernel function. Similarly, K = [K(xi , xk)]n i,k=1 is a positive semidefinite matrix. If we set Σ = β(K + λIn) with β > 0 and λ being typically a very small number to make Σ 0, we have A ∼ Wn(q, β(K + λI)). (2) Consequently, the input attributes are successfully integrated into the target kernel function. To incorporate the relational information for the learning of A(·, ·), we regard each zik as a Bernoulli variable, which is determined by the latent variable aik via a logistic link function sik. Given the aik, we further assume that the zik are independent. Moreover, we assume that the links are symmetric with no self loops, i.e., zik = zki and zii = 0. Letting Z = [zik] n i,k=1, we have p(Z|A) = Yn i=1 Yn k=i+1 s zik ik (1 − sik) 1−zik with sik = exp(aik/2) 1 + exp(aik/2). (3) To this end, both the input attributes and the relational information are seamlessly integrated into the same framework, in which the input attributes define the scale matrix of the prior for the distribution of the target kernel function and the relational information defines the likelihood computed based on the target kernel function. Then, learning algorithms such as maximum a posteriori (MAP) estimation can be used to learn the latent variables aik. After learning the model, the target kernel function A(xi , xk) will take both the input attributes and the relational information into consideration. We can directly use this new kernel function to implement kernel classification methods such as SVM and GP-based classifiers. In our experiments, we apply A(xi , xk) as a kernel function for a GP-based classifier. In this model, the Wishart process A(xi , xk) is used to define latent variables {aik} n i,k=1. We thus call this model latent Wishart process (LWP) model.
Latent Wishart Processes for Relational Kernel Learning 3.2 Learning experiment,we find that if y is simply set to a value less than 0.1,our algorithm works very well for all the It is natural to consider the MAP estimate of A, experiments.Although in this case the objective func- namely, tion does not necessarily increase monotonically,the argmax log p(ZA)p(A) long-term trend of it is increasing.This makes our A algorithm not only very fast but also very accurate. Theorem 1 shows that finding the MAP estimate of A is equivalent to finding the MAP estimate of B.In Note that this iterative procedure works in a paral- particular,we note that for the applications presented lel manner.It may also work sequentially.However, in Section 5,small values of q,typically no larger than our experiments show that the parallel scheme is more 50,are sufficient for delivering good performance.This stable than the sequential scheme.As we have men- motivates us to alternatively find the MaP estimate tioned,the size of Hi is gxg with g being always a of B,which will dramatically reduce the computation small number in our experiments,so the iterative pro- cost.Consequently,we attempt to maximize the fol- cedure is very efficient.In this paper,we adopt KPCA lowing log posterior probability: (Scholkopf et al.,1998)to initialize the values bi(0). L(B)=log(p(ZB)p(B)} 3.3 Out-of-Sample Extension >logp(zb:,b)+logp(B) i≠k It is easy for LWP to perform out-of-sample exten- sion (or induction),which means that we can use the >zbbx/2-log(1+exp(bbx/2)) learned LWP to predict the bi for new test data. 1 Z11Z12 「 -K+0'BB]+c Let Z= and∑= 2112121 Z21Z22 221222 where Z11(S11)is an nixni matrix and Z22(S22) bb:/2-log(1+exp(bbx/2) is an n2xn2 matrix.Suppose the n instances cor- responding to Zu and are training data,and -号∑oabe+C Z22 and 22 correspond to new test data.Similarly, B we partition B A A11 A12 i.k B2 A21 A22 BB BB2 where [gik]k=1 K+)- and C is a constant in- B2B B2B2 Because B~Nn,g(0,∑⑧Lg),we dependent of B.We employ a block quasi-Newton have B1~Nn1,g(0,∑11⑧Lg)and method to solve the maximization of L(B)w.r.t. B.The basic idea is to approximate the Hessian B2|B1~Nn2,g(②212B1,222.18Ig,(⑤) matrix a是品by using the block diagonal matrix 月2 Block diag(at而abn a2L a2L where221-∑22-∑212∑12 The Fisher score vector and Hessian matrix of L w.r.t. For inductive inference,we first find the MAP es- bi are given by timate of B1 based on argmaxB,logp(Z11B1)p(B1) and then adopt the conditional expectation (or condi- aL tional mean)of B2 given Bi to estimate B2,which is 2-s材-0)bj-0b Obi given by E21>11BI based on (5). ≠红 82L Owing to the page limit,out-of-sample extension will abiob s(1-)bb,-Lg-H 2 not be evaluated via experiments in this paper.We j≠ will report the experiments in our future work. Given the initial values bi(0),the update equations for the bi are 4 Related Work b:t+1)=b:()+y-H(0-19 i=1,...,n, The methods that are most related to our LWP model iB=B(t) are RGP (Chu et al.,2007)and XGP (Silva et al., (4) 2008).Both RGP and XGP also try to learn the where y is the step size.A strategy to make the ob- covariance (kernel)matrix for Gaussian processes by jective function monotonically increase is to learn y in exploiting the relationships between instances.Un- each update step,but to search for a suitable in each like LWP which is to learn multiple (g)GPs,RGP update step might incur high computation cost.In our and XGP only learn one GP.In fact,when q=1,B
Latent Wishart Processes for Relational Kernel Learning 3.2 Learning It is natural to consider the MAP estimate of A, namely, argmax A log p(Z|A)p(A) . Theorem 1 shows that finding the MAP estimate of A is equivalent to finding the MAP estimate of B. In particular, we note that for the applications presented in Section 5, small values of q, typically no larger than 50, are sufficient for delivering good performance. This motivates us to alternatively find the MAP estimate of B, which will dramatically reduce the computation cost. Consequently, we attempt to maximize the following log posterior probability: L(B) = log{p(Z|B)p(B)} = X i6=k log p(zik|bi , bk) + log p(B) = X i6=k h zikb 0 ibk/2 − log(1 + exp(b 0 ibk/2))i − 1 2 tr (K + λI) −1 β BB0 + C = X i6=k h zikb 0 ibk/2 − log(1 + exp(b 0 ibk/2))i − 1 2 X i,k σikb 0 ibk + C, where [σik] n i,k=1 = (K+λI) −1 β and C is a constant independent of B. We employ a block quasi-Newton method to solve the maximization of L(B) w.r.t. B. The basic idea is to approximate the Hessian matrix ∂ 2L ∂B∂B0 by using the block diagonal matrix Block diag ∂ 2L ∂b1∂b0 1 , . . . , ∂ 2L ∂bn∂b0 n . The Fisher score vector and Hessian matrix of L w.r.t. bi are given by ∂L ∂bi = X j6=i (zij − sij − σij )bj − σiibi ∂ 2L ∂bi∂b 0 i = − 1 2 X j6=i sij (1−sij )bjb 0 j − σiiIq , −Hi . Given the initial values bi(0), the update equations for the bi are bi(t+1)=bi(t)+γ · Hi(t) −1 ∂L ∂bi B=B(t) , i = 1, . . . , n, (4) where γ is the step size. A strategy to make the objective function monotonically increase is to learn γ in each update step, but to search for a suitable γ in each update step might incur high computation cost. In our experiment, we find that if γ is simply set to a value less than 0.1, our algorithm works very well for all the experiments. Although in this case the objective function does not necessarily increase monotonically, the long-term trend of it is increasing. This makes our algorithm not only very fast but also very accurate. Note that this iterative procedure works in a parallel manner. It may also work sequentially. However, our experiments show that the parallel scheme is more stable than the sequential scheme. As we have mentioned, the size of Hi is q×q with q being always a small number in our experiments, so the iterative procedure is very efficient. In this paper, we adopt KPCA (Sch¨olkopf et al., 1998) to initialize the values bi(0). 3.3 Out-of-Sample Extension It is easy for LWP to perform out-of-sample extension (or induction), which means that we can use the learned LWP to predict the bi for new test data. Let Z = Z11 Z12 Z21 Z22 and Σ = Σ11 Σ12 Σ21 Σ22 , where Z11(Σ11) is an n1×n1 matrix and Z22(Σ22) is an n2×n2 matrix. Suppose the n1 instances corresponding to Z11 and Σ11 are training data, and Z22 and Σ22 correspond to new test data. Similarly, we partition B = B1 B2 , A = A11 A12 A21 A22 = B1B0 1 B1B0 2 B2B0 1 B2B0 2 . Because B ∼ Nn,q(0, Σ⊗Iq), we have B1 ∼ Nn1,q(0, Σ11⊗Iq) and B2 | B1 ∼ Nn2,q Σ21Σ −1 11 B1, Σ22·1 ⊗ Iq , (5) where Σ22·1 = Σ22 − Σ21Σ −1 11 Σ12. For inductive inference, we first find the MAP estimate of B1 based on argmaxB1 log p(Z11|B1)p(B1) and then adopt the conditional expectation (or conditional mean) of B2 given B1 to estimate B2, which is given by Σ21Σ −1 11 B1 based on (5). Owing to the page limit, out-of-sample extension will not be evaluated via experiments in this paper. We will report the experiments in our future work. 4 Related Work The methods that are most related to our LWP model are RGP (Chu et al., 2007) and XGP (Silva et al., 2008). Both RGP and XGP also try to learn the covariance (kernel) matrix for Gaussian processes by exploiting the relationships between instances. Unlike LWP which is to learn multiple (q) GPs, RGP and XGP only learn one GP. In fact, when q = 1, B
Li,Zhang,Yeung (nx1)in LWP degenerates to a single Gaussian pro- Although our method can be applied under the induc- cess.Hence,our model can be regarded as a general- tive setting,for fair comparison we run our method ization of RGP and XGP.The key difference between under the transductive setting because both RGP and them is that LWP treats a =BB'as the learned XGP were only tested under this setting(Silva et al., kernel matrix,which can be further used to train all 2008).2 More specifically,for our LWP method,we kinds of kernel classifiers.In RGP and XGP,how- first perform kernel learning based on all the points. ever,p(BZ)is itself a prediction function with B be- including training and test points,and their links,but ing a vector of function values for all input points.The without using any label information.Then,based on learned kernel.which is the covariance matriz of the the learned kernel,we learn a GP with the training posterior distribution p(BZ),is (K-1+I1)-1 in points only and evaluate the learned model on the RGP and (K+II)in XGP,where II is a kernel ma- test points.Hence,the main difference between our trix capturing the link information.Since there is no method and other methods is just in the kernel learn- closed-form solution for p(BZ),RGP and XGP adopt ing part. different approximation strategies to compute the pos- terior covariance matrix. 5.1 Sensitivity to Parameters Another related work is the stochastic relational model There are four parameters in total which will affect in (Yu and Chu,2008;Yu et al.,2006),where A is the training of LWP.They are the dimensionality of modeled as a tensor GP rather than a WP.Since A in (Yu and Chu,2008;Yu et al.,2006)is not guaranteed the latent space g,the B in (2),the y in (4),and the number of iterations (T)to optimize L(B).We find to be positive semi-definite,it cannot be regarded as a that the performance is very stable by setting 1000< kernel matrix.Furthermore,the focus of(Yu and Chu. 2008;Yu et al.,2006)is on linkage prediction rather B<10000.Hence,in all the following experiments, 6=1000. than instance classification as in our LWP model. We first study the effect of y and T.Here we use Our work has also been motivated by the latent space approaches in social network analysis (Hoff et al., the Texas subset of the WebKB data set to illustrate 2002).Letd=lbi-bill2 be the squared distance this.The description of this subset is given in Sec- tion 5.3.We find that when y<0.01,our algorithm between bi and bi,and D [di]be the nxn dis- is very stable.The performance,measured in area tance matrix.Then -EDE EAE where E is under the ROC curve (AUC).and the objective func- an nxn centering matrix.This reveals a connection tion against the change in T are illustrated in Fig- between our model and the latent distance model in ure 1,in which the X-axis denotes T,"AUC01"is the (Hoff.2008).In addition.we also note that in the la- performance with y =0.01."AUC001"is the perfor- tent eigenmodel (Hoff,2008)for symmetric relational mance with y=0.001,"obj0l"is the objective func- data,Hoff (2008)defined A UAU'where U is an tion values withy=0.01 and "obj001"is the objective orthogonal matrix and A is a diagonal matrix but its function values with y=0.001.Note that the objec- diagonal elements can be negative.Thus,A does not tive function values in the figure are transformed to play the role of a kernel matrix. L(B)/105+4 for the convenience of demonstration. We can see that the long-term trend of the objective Experiments function is increasing,and the performance of our al- gorithm is very stable.For y=0.01,10 iterations are We compare our LWP method with several related enough to give good performance. methods,such as the standard Gaussian process classi- We also test the sensitivity of our method to the fier (GPC)(Rasmussen and Williams.2006).RGP and XGP,on three real-world data sets,WebKB (Craven change in g on the Texas subset and the political books et al.,1998),Cora(McCallum et al.,2000)and politi- data set.Figure 2 shows the average AUC with stan- dard deviation over 100 trials of our method when g is cal books data set,which are also used in (Silva et al.. set to different values.Note that we use KPCA to ini- 2008).The centralized linear kernel K(xi,xj)=xxj tialize the b:s.Hence.KPCA actually refers to a GPC (Chu et al.,2007)is used to define the covariance ma- with the kernel matrix computed based on the initial trix K in the Wishart distribution defined in(2)for all these data sets.The A in (2)is set to a small number values of bi,i.e.,bi(0),in LWP.This means KPCA corresponds to the case that the iteration number in 10-4.All the bi are initialized to the feature vec- tors obtained by kernel principal component analysis kernels.Calling it KPCA is just for the consistency of our (KPCA)(Scholkopf et al.,1998)based on K+AI. algorithm,because during the derivation of our algorithm K can be any kind of kernel. The KPCA in our paper is actually PCA,because for 2In fact,XGP can only work under the transductive text processing the linear kernel always outperforms other setting
Li, Zhang, Yeung (n×1) in LWP degenerates to a single Gaussian process. Hence, our model can be regarded as a generalization of RGP and XGP. The key difference between them is that LWP treats A = BB0 as the learned kernel matrix, which can be further used to train all kinds of kernel classifiers. In RGP and XGP, however, p(B|Z) is itself a prediction function with B being a vector of function values for all input points. The learned kernel, which is the covariance matrix of the posterior distribution p(B|Z), is (K−1 + Π−1 ) −1 in RGP and (K + Π) in XGP, where Π is a kernel matrix capturing the link information. Since there is no closed-form solution for p(B|Z), RGP and XGP adopt different approximation strategies to compute the posterior covariance matrix. Another related work is the stochastic relational model in (Yu and Chu, 2008; Yu et al., 2006), where A is modeled as a tensor GP rather than a WP. Since A in (Yu and Chu, 2008; Yu et al., 2006) is not guaranteed to be positive semi-definite, it cannot be regarded as a kernel matrix. Furthermore, the focus of (Yu and Chu, 2008; Yu et al., 2006) is on linkage prediction rather than instance classification as in our LWP model. Our work has also been motivated by the latent space approaches in social network analysis (Hoff et al., 2002). Let d 2 ij = kbi − bjk 2 be the squared distance between bi and bj , and D = [d 2 ij ] be the n×n distance matrix. Then − 1 2EDE = EAE where E is an n×n centering matrix. This reveals a connection between our model and the latent distance model in (Hoff, 2008). In addition, we also note that in the latent eigenmodel (Hoff, 2008) for symmetric relational data, Hoff (2008) defined A = UΛU0 where U is an orthogonal matrix and Λ is a diagonal matrix but its diagonal elements can be negative. Thus, A does not play the role of a kernel matrix. 5 Experiments We compare our LWP method with several related methods, such as the standard Gaussian process classi- fier (GPC) (Rasmussen and Williams, 2006), RGP and XGP, on three real-world data sets, WebKB (Craven et al., 1998), Cora (McCallum et al., 2000) and political books data set, which are also used in (Silva et al., 2008). The centralized linear kernel K(xi , xj ) = x 0 ixj (Chu et al., 2007) is used to define the covariance matrix K in the Wishart distribution defined in (2) for all these data sets. The λ in (2) is set to a small number 10−4 . All the bi are initialized to the feature vectors obtained by kernel principal component analysis (KPCA) 1 (Sch¨olkopf et al., 1998) based on K + λI. 1The KPCA in our paper is actually PCA, because for text processing the linear kernel always outperforms other Although our method can be applied under the inductive setting, for fair comparison we run our method under the transductive setting because both RGP and XGP were only tested under this setting (Silva et al., 2008).2 More specifically, for our LWP method, we first perform kernel learning based on all the points, including training and test points, and their links, but without using any label information. Then, based on the learned kernel, we learn a GP with the training points only and evaluate the learned model on the test points. Hence, the main difference between our method and other methods is just in the kernel learning part. 5.1 Sensitivity to Parameters There are four parameters in total which will affect the training of LWP. They are the dimensionality of the latent space q, the β in (2), the γ in (4), and the number of iterations (T) to optimize L(B). We find that the performance is very stable by setting 1000 ≤ β ≤ 10000. Hence, in all the following experiments, β = 1000. We first study the effect of γ and T. Here we use the Texas subset of the WebKB data set to illustrate this. The description of this subset is given in Section 5.3. We find that when γ ≤ 0.01, our algorithm is very stable. The performance, measured in area under the ROC curve (AUC), and the objective function against the change in T are illustrated in Figure 1, in which the X-axis denotes T, “AUC01” is the performance with γ = 0.01, “AUC001” is the performance with γ = 0.001, “obj01” is the objective function values with γ = 0.01 and “obj001” is the objective function values with γ = 0.001. Note that the objective function values in the figure are transformed to L(B)/105 + 4 for the convenience of demonstration. We can see that the long-term trend of the objective function is increasing, and the performance of our algorithm is very stable. For γ = 0.01, 10 iterations are enough to give good performance. We also test the sensitivity of our method to the change in q on the Texas subset and the political books data set. Figure 2 shows the average AUC with standard deviation over 100 trials of our method when q is set to different values. Note that we use KPCA to initialize the bis. Hence, KPCA actually refers to a GPC with the kernel matrix computed based on the initial values of bi , i.e., bi(0), in LWP. This means KPCA corresponds to the case that the iteration number in kernels. Calling it KPCA is just for the consistency of our algorithm, because during the derivation of our algorithm K can be any kind of kernel. 2 In fact, XGP can only work under the transductive setting
Latent Wishart Processes for Relational Kernel Learning 0.6 0.4 0 AUC01 -0.2 AUC001 (a)KPCA (b)LWP obj01 -0.4 obj001 Figure 3:Visualization of data points in the transformed 0.e space by KPCA and in the latent space learned by LWP. 100 positive(red cross)and 100 negative(blue circle)ex- -0.8 40 100 amples are shown. Figure 1:The performance(AUC)and the objective func- After learning,100 positive and 100 negative examples tion against the change in the number of iterations. are randomly chosen for visualization.The results are demonstrated in Figure 3,where KPCA refers to the initial values of bi,i.e.,bi(0).We can see that dif- ferent classes are well separated in the learned latent space by LWP,although the initialization by KPCA is very poor.Good classification performance can be ex- pected when the examples are classified in this latent space,which will be verified by our subsequent exper- iment.Furthermore,because no label information is LWP-Texas used to learn this feature representation,good cluster- KPCA-Te ing performance can also be expected in this learned -LWP-Book KPCA-B0o space. 20 40 0 100 5.3 Performance on WebKB Data Set Figure 2:Performance of LWP against the change in de- gree of freedom q in the Wishart process(i.e.,dimension- A subset of the WebKB data set (Craven et al.,1998), ality of the latent space). which was collected from the web sites of the computer LWP is set to 0.From Figure 2,we can see that LWP science departments of four universities,is used to test is very robust to the change in q.Furthermore,by our method.Each webpage is labeled with one out of comparing LWP with KPCA 3,it is not difficult to seven categories:student,professor,course,project, see that the learning method in LWP is very effective. staff,department,and "other".The original data set Note that the performance of GPC on the Texas sub- contains 4,160 pages and 9,998 hyperlinks.We adopt set is 0.799+0.021,and that on the political books the same strategy as that in (Chu et al.,2007;Silva data set is 0.92. et al.,2008)to translate the hyperlinks into 66,249 undirected linkages over the pages by assuming that We can see that LWP is robust to parameter changes. two pages are likely to be positively correlated if they In all our following experiments,unless otherwise are hyperlinked by the same hub page.Each webpage stated,we just set B=1000,=0.01,T=10,g=20. is represented as bag-of-words,a vector of "term fre- quency"components scaled by the "inverse document 5.2 Visualization frequency",and then normalized to unit length.This The learned bi can be treated as the feature repre- preprocessing step is the same as that in (Chu et al., sentation in a latent space,which can be utilized for 2007).The task is to classify each webpage into two classes:“other'”and“non-other'”.The same perfor-. data visualization.Here we use the Texas subset of the mance measure,area under the ROC curve (AUC), WebKB data set to illustrate this.We use the whole and the same evaluation strategy as those in (Silva data set(827 examples with their links)without any et al..2008)are adopted for all the methods,i.e..for a label information to perform kernel learning with our specific university,the same 100 subsamples as those LWP model.For the sake of visualization,g is set to 2. in (Chu et al.,2007;Silva et al.,2008)are used,in each 3The comparison between LWP and KPCA is just used of which 10%of the data points are randomly chosen to demonstrate that the good performance of LWP is not for training and the rest for testing. from a good initial value but from the learning process. We denote the initial values as KPCA just because we use The average AUC with standard deviation over 100 tri- KPCA for initialization. als is reported in Table 1,from which we can find that
Latent Wishart Processes for Relational Kernel Learning 0 20 40 60 80 100 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 AUC01 AUC001 obj01 obj001 Figure 1: The performance (AUC) and the objective function against the change in the number of iterations. 0 20 40 60 80 100 0.7 0.75 0.8 0.85 0.9 0.95 1 q Average AUC with Standard Deviation LWP−Texas KPCA−Texas LWP−Book KPCA−Book Figure 2: Performance of LWP against the change in degree of freedom q in the Wishart process (i.e., dimensionality of the latent space). LWP is set to 0. From Figure 2, we can see that LWP is very robust to the change in q. Furthermore, by comparing LWP with KPCA 3 , it is not difficult to see that the learning method in LWP is very effective. Note that the performance of GPC on the Texas subset is 0.799 ± 0.021, and that on the political books data set is 0.92. We can see that LWP is robust to parameter changes. In all our following experiments, unless otherwise stated, we just set β = 1000, γ = 0.01, T = 10, q = 20. 5.2 Visualization The learned bi can be treated as the feature representation in a latent space, which can be utilized for data visualization. Here we use the Texas subset of the WebKB data set to illustrate this. We use the whole data set (827 examples with their links) without any label information to perform kernel learning with our LWP model. For the sake of visualization, q is set to 2. 3The comparison between LWP and KPCA is just used to demonstrate that the good performance of LWP is not from a good initial value but from the learning process. We denote the initial values as KPCA just because we use KPCA for initialization. −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 (a) KPCA (b) LWP Figure 3: Visualization of data points in the transformed space by KPCA and in the latent space learned by LWP. 100 positive (red cross) and 100 negative (blue circle) examples are shown. After learning, 100 positive and 100 negative examples are randomly chosen for visualization. The results are demonstrated in Figure 3, where KPCA refers to the initial values of bi , i.e., bi(0). We can see that different classes are well separated in the learned latent space by LWP, although the initialization by KPCA is very poor. Good classification performance can be expected when the examples are classified in this latent space, which will be verified by our subsequent experiment. Furthermore, because no label information is used to learn this feature representation, good clustering performance can also be expected in this learned space. 5.3 Performance on WebKB Data Set A subset of the WebKB data set (Craven et al., 1998), which was collected from the web sites of the computer science departments of four universities, is used to test our method. Each webpage is labeled with one out of seven categories: student, professor, course, project, staff, department, and “other”. The original data set contains 4,160 pages and 9,998 hyperlinks. We adopt the same strategy as that in (Chu et al., 2007; Silva et al., 2008) to translate the hyperlinks into 66,249 undirected linkages over the pages by assuming that two pages are likely to be positively correlated if they are hyperlinked by the same hub page. Each webpage is represented as bag-of-words, a vector of “term frequency” components scaled by the “inverse document frequency”, and then normalized to unit length. This preprocessing step is the same as that in (Chu et al., 2007). The task is to classify each webpage into two classes: “other” and “non-other”. The same performance measure, area under the ROC curve (AUC), and the same evaluation strategy as those in (Silva et al., 2008) are adopted for all the methods, i.e., for a specific university, the same 100 subsamples as those in (Chu et al., 2007; Silva et al., 2008) are used, in each of which 10% of the data points are randomly chosen for training and the rest for testing. The average AUC with standard deviation over 100 trials is reported in Table 1, from which we can find that
Li,Zhang,Yeung LWP achieves performance at least comparable with the state of the art for all four universities.In particu- Table 3:Experiment on the data set of political books Results for GPC.RGP and XGP are taken from (Silva lar,compared with GPC and RGP,LWP achieves far etal.2008). better results. GPC RGP XGP KPCA LWP 0.920.980.980.93±0.030.98±0.02 5.4 Performance on Cora Data Set A subset of the Cora corpus (McCallum et al.,2000) see that LWP is comparable with the state of the art. including 4,285 machine learning papers with their Here,KPCA also refers to the method using the initial bibliographic citations is used for this experiment.All values of bi in LWP to perform classification. the papers in this subset are partitioned into 7 classes, the information about which is shown in the second column of Table 2.We adopt the same feature repre- 5.6 Discussions sentation scheme as that in (Silva et al.,2008).where each paper is represented by a feature vector of dimen- From the above experiments,we can see that LWP sionality 20,082.For this data set,very good perfor- achieves very promising performance on all the data mance can be achieved even if q is set to as small as sets tested,while RGP and XGP can only perform 1.Hence,we just set q to 1 for this data set.For well on some of the data sets.More specifically,RGP each class,1%of the whole set is randomly selected performs quite well on the Cora data set but it per- for training and the rest for testing.One hundred forms relatively badly on the WebKB data set.On the rounds of such partitioning are repeated.The average other hand,XGP achieves unsatisfactory performance AUC with standard deviation is reported in Table 2, on the Cora data set.This implies that LWP is much in which "GPC with Citation"is the method proposed more robust than RGP and XGP.In particular,com- in (Silva et al.,2008)by adding the citation adjacency pared with GPC which naively discards the relational matrix as a binary input feature for each paper.From information in the data,our method achieves far bet- the table,we can see that LWP is far better than re- ter results,implying that the relational information is lated methods.In(Silva et al.,2008),the authors said indeed very informative and our LWP can exploit the that the AUC of RGP on this data set is close to 1.So information very effectively. it means that on this data set,LWP is also comparable with the state of the art. 6 Concluding Remarks 5.5 Performance on Political Books Data Set Relational information is very useful for specifying the This data set contains 105 books,43 of which are similarity between different instances.We have pre- labeled as liberal ones.Pairs of books that are fre- sented a very effective LWP model for performing ker- quently bought together by the same customer are nel learning by seamlessly integrating relational infor- used to represent the relationships between them.The mation with the input attributes.Besides the promis- task is to decide whether a specific book is of liberal ing performance for instance classification,LWP can political inclination or not.The words in the Ama- also be used for data visualization and clustering. zon.com front page for a book are used as features to represent the book.Each book is represented as In our future work,we will focus on extensive empir- bag-of-words and is preprocessed by the same tech- ical comparison of LWP with other related methods niques as for the WebKB data set.After prepro- on the aspect of inductive inference.Moreover,it is cessing,each book is represented by a feature vec- also desirable to apply our model to social network tor of length 13,178.The original data is available analysis. at http://www-personal.unich.edu/mejn/netdata,and the preprocessed data can be downloaded from http://www.statslab.cam.ac.uk/~silva.We randomly Acknowledgements choose half of the whole data for training and the rest for testing.This subsampling process is repeated for We thank Ricardo silva for sharing the data and code 100 rounds and the average AUC with its standard of XGP and providing useful help for our experiments. deviation is reported in Table 3 4,from which we can We also thank Wei Chu and Vikas Sindhvani for the preprocessed Cora data set.This research has been 4The results for GPC,RGP and XGP are taken from supported by General Research Fund 621407 from the (Silva et al.,2008),in which the standard deviation is not Research Grants Council of the Hong Kong Special reported. Administrative Region,China
Li, Zhang, Yeung LWP achieves performance at least comparable with the state of the art for all four universities. In particular, compared with GPC and RGP, LWP achieves far better results. 5.4 Performance on Cora Data Set A subset of the Cora corpus (McCallum et al., 2000) including 4,285 machine learning papers with their bibliographic citations is used for this experiment. All the papers in this subset are partitioned into 7 classes, the information about which is shown in the second column of Table 2. We adopt the same feature representation scheme as that in (Silva et al., 2008), where each paper is represented by a feature vector of dimensionality 20,082. For this data set, very good performance can be achieved even if q is set to as small as 1. Hence, we just set q to 1 for this data set. For each class, 1% of the whole set is randomly selected for training and the rest for testing. One hundred rounds of such partitioning are repeated. The average AUC with standard deviation is reported in Table 2, in which “GPC with Citation” is the method proposed in (Silva et al., 2008) by adding the citation adjacency matrix as a binary input feature for each paper. From the table, we can see that LWP is far better than related methods. In (Silva et al., 2008), the authors said that the AUC of RGP on this data set is close to 1. So it means that on this data set, LWP is also comparable with the state of the art. 5.5 Performance on Political Books Data Set This data set contains 105 books, 43 of which are labeled as liberal ones. Pairs of books that are frequently bought together by the same customer are used to represent the relationships between them. The task is to decide whether a specific book is of liberal political inclination or not. The words in the Amazon.com front page for a book are used as features to represent the book. Each book is represented as bag-of-words and is preprocessed by the same techniques as for the WebKB data set. After preprocessing, each book is represented by a feature vector of length 13,178. The original data is available at http://www-personal.unich.edu/mejn/netdata, and the preprocessed data can be downloaded from http://www.statslab.cam.ac.uk/∼silva. We randomly choose half of the whole data for training and the rest for testing. This subsampling process is repeated for 100 rounds and the average AUC with its standard deviation is reported in Table 3 4 , from which we can 4The results for GPC, RGP and XGP are taken from (Silva et al., 2008), in which the standard deviation is not reported. Table 3: Experiment on the data set of political books. Results for GPC, RGP and XGP are taken from (Silva et al., 2008). GPC RGP XGP KPCA LWP 0.92 0.98 0.98 0.93 ± 0.03 0.98 ± 0.02 see that LWP is comparable with the state of the art. Here, KPCA also refers to the method using the initial values of bi in LWP to perform classification. 5.6 Discussions From the above experiments, we can see that LWP achieves very promising performance on all the data sets tested, while RGP and XGP can only perform well on some of the data sets. More specifically, RGP performs quite well on the Cora data set but it performs relatively badly on the WebKB data set. On the other hand, XGP achieves unsatisfactory performance on the Cora data set. This implies that LWP is much more robust than RGP and XGP. In particular, compared with GPC which naively discards the relational information in the data, our method achieves far better results, implying that the relational information is indeed very informative and our LWP can exploit the information very effectively. 6 Concluding Remarks Relational information is very useful for specifying the similarity between different instances. We have presented a very effective LWP model for performing kernel learning by seamlessly integrating relational information with the input attributes. Besides the promising performance for instance classification, LWP can also be used for data visualization and clustering. In our future work, we will focus on extensive empirical comparison of LWP with other related methods on the aspect of inductive inference. Moreover, it is also desirable to apply our model to social network analysis. Acknowledgements We thank Ricardo Silva for sharing the data and code of XGP and providing useful help for our experiments. We also thank Wei Chu and Vikas Sindhvani for the preprocessed Cora data set. This research has been supported by General Research Fund 621407 from the Research Grants Council of the Hong Kong Special Administrative Region, China
Latent Wishart Processes for Relational Kernel Learning Table 1:Mean and standard deviation of AUC over 100 rounds of test on the WebKB data set.Results for other related methods are taken from (Silva et al..2008).All the methods are based on the same data partitions for both training and testing.#Other and #All refer to the numbers of positive examples and all examples,respectively.#Links is the number of linkages in the corresponding data set. University #Other/#All/#Links GPC RGP XGP LWP Cornell 617/865/13177 0.708±0.021 0.884±0.025 0.917±0.022 0.932±0.019 Texas 5717827716090 0.799±0.0210.906土±0.026 0.949±0.015 0.960±0.009 Washington 939/1205/15388 0.782±0.0230.877±0.024 0.923±0.016 0.935±0.010 Wisconsin 94271263721594 0.839±0.0140.899±0.0150.941±0.0180.940±0.012 Table 2:Mean and standard deviation of AUC over 100 rounds of test on the Cora data set.Results for other related methods are taken from (Silva et al.,2008).All the methods are based on the same data partitions for both training and testing.#Pos and #Neg refer to the numbers of positive examples and negative examples,respectively.#Citations is the number of linkages in the corresponding data set. Group #Pos/#Neg/#Citations GPC GPC with Citation XGP LWP 5vsl 3464882466 0.905±0.031 0.891±0.022 0.945±0.053 0.990±0.000 5vs2 346761973417 0.900±0.032 0.905士0.044 0.933±0.0590.991±0.001 5vs3 3467137673905 0.863±0.040 0.893±0.017 0.883±0.013 0.986±0.001 5vs4 3467646/2858 0.916±0.030 0.887±0.018 0.951±0.042 0.997±0.000 5vs6 346728171968 0.887±0.054 0.843±0.076 0.955±0.041 0.998±0.000 5vs7 346752972948 0.869±0.045 0.867±0.041 0.926±0.0760.992±0.002 References Muirhead,R.J.(1982).Aspects of Multivariate Statistical Chu,W.,V.Sindhwani,Z.Ghahramani,and S.S.Keerthi Theory.New York:John Wiley and Sons. (2007).Relational learning with gaussian processes.In Neville.J.and D.Jensen (2007).Relational dependency Advances in Neural Information Processing Systems 19, networks.Journal of Machine Learning Research 8,653- pp.289-296. 692. Craven,M.,D.DiPasquo,D.Freitag,A.McCallum,T.M. Rasmussen,C.E.and C.K.I.Williams(2006).Gaussian Mitchell,K.Nigam,and S.Slattery (1998).Learning to Processes for Machine Learning.The MIT Press extract symbolic knowledge from the world wide web.In Scholkopf,B.,A.J.Smola,and K.-R.Miiller (1998).Non- AAAI/IAAI,pp.509-516. linear component analysis as a kernel eigenvalue prob- Getoor,L.,N.Friedman,D.Koller,and B.Taskar(2002). lem.Neural Computation10(⑤),1299-1319. Learning probabilistic models of link structure.Journal Silva,R.,W.Chu,and Z.Ghahramani (2008).Hid- of Machine Learning Research 3.679-707. den common cause relations in relational learning.In Getoor,L.and B.Taskar (2007).Introduction to Statistical J.Platt,D.Koller,Y.Singer,and S.Roweis (Eds.),Ad- Relational Learning.The MIT Press. vances in Neural Information Processing Systems 20,pp Gupta.A.K.and D.K.Nagar (2000). Matrit Variate 1345-1352.Cambridge,MA:MIT Press. Distributions.Chapman Hall/CRC. Srivastava,M.S.(2003).Singular Wishart and multivari- Hoff,P.D.(2008).Modeling homophily and stochastic ate Beta distributions.The Annals of Statistics 31(5). equivalence in symmetric relational data.In J.Platt, 1537-1560. D.Koller,Y.Singer,and S.Roweis(Eds.),Advances in Taskar,B.,P.Abbeel,and D.Koller (2002).Discriminative Neural Information Processing Systems 20.Cambridge probabilistic models for relational data.In Proceedings MA:MIT Press. of the 18th Conference in Uncertainty in Artificial Intel- Hoff,P.D.,A.E.Raftery,and M.S.Handcock(2002).La- ligence,pp.485-492. tent space approaches to social network analysis.Journal Yu,K.and W.Chu (2008).Gaussian process models for of the American Statistical Association 97(460),1090- link analysis and transfer learning.In J.Platt,D.Koller, 1098. Y.Singer,and S.Roweis (Eds.),Advances in Neural In- Lanckriet,G.R.G..N.Cristianini.P.L.Bartlett.L.E. formation Processing Systems 20,pp.1657-1664.Cam- Ghaoui,and M.I.Jordan (2004).Learning the kernel bridge,MA:MIT Press. matrix with semidefinite programming.Journal of Ma- Yu.K..W.Chu.S.Yu.V.Tresp,and Z.Xu (2006). chine Learning Research 5,27-72. Stochastic relational models for discriminative link pre- Macskassy,S.A.and F.J.Provost(2007).Classification in diction.In Advances in Neural Information Processing networked data:A toolkit and a univariate case study. Systems19,pp.1553-1560. Journal of Machine Learning Research 8,935-983. Zhang,Z.,J.T.Kwok,and D.-Y.Yeung (2006).Model- McCallum,A.,K.Nigam,J.Rennie,and K.Seymore based transductive learning of the kernel matrix.Ma- (2000).Automating the construction of internet portals chine Learning 63(1),69-101. with machine learning.Inf.Retr.3(2),127-163
Latent Wishart Processes for Relational Kernel Learning Table 1: Mean and standard deviation of AUC over 100 rounds of test on the WebKB data set. Results for other related methods are taken from (Silva et al., 2008). All the methods are based on the same data partitions for both training and testing. #Other and #All refer to the numbers of positive examples and all examples, respectively. #Links is the number of linkages in the corresponding data set. University #Other/#All/#Links GPC RGP XGP LWP Cornell 617 / 865 / 13177 0.708 ± 0.021 0.884 ± 0.025 0.917 ± 0.022 0.932 ± 0.019 Texas 571 / 827 / 16090 0.799 ± 0.021 0.906 ± 0.026 0.949 ± 0.015 0.960 ± 0.009 Washington 939 / 1205 / 15388 0.782 ± 0.023 0.877 ± 0.024 0.923 ± 0.016 0.935 ± 0.010 Wisconsin 942 / 1263 / 21594 0.839 ± 0.014 0.899 ± 0.015 0.941 ± 0.018 0.940 ± 0.012 Table 2: Mean and standard deviation of AUC over 100 rounds of test on the Cora data set. Results for other related methods are taken from (Silva et al., 2008). All the methods are based on the same data partitions for both training and testing. #Pos and #Neg refer to the numbers of positive examples and negative examples, respectively. #Citations is the number of linkages in the corresponding data set. Group #Pos/#Neg/#Citations GPC GPC with Citation XGP LWP 5vs1 346 / 488 / 2466 0.905 ± 0.031 0.891 ± 0.022 0.945 ± 0.053 0.990 ± 0.000 5vs2 346 / 619 / 3417 0.900 ± 0.032 0.905 ± 0.044 0.933 ± 0.059 0.991 ± 0.001 5vs3 346 / 1376 / 3905 0.863 ± 0.040 0.893 ± 0.017 0.883 ± 0.013 0.986 ± 0.001 5vs4 346 / 646 / 2858 0.916 ± 0.030 0.887 ± 0.018 0.951 ± 0.042 0.997 ± 0.000 5vs6 346 / 281 / 1968 0.887 ± 0.054 0.843 ± 0.076 0.955 ± 0.041 0.998 ± 0.000 5vs7 346 / 529 / 2948 0.869 ± 0.045 0.867 ± 0.041 0.926 ± 0.076 0.992 ± 0.002 References Chu, W., V. Sindhwani, Z. Ghahramani, and S. S. Keerthi (2007). Relational learning with gaussian processes. In Advances in Neural Information Processing Systems 19, pp. 289–296. Craven, M., D. DiPasquo, D. Freitag, A. McCallum, T. M. Mitchell, K. Nigam, and S. Slattery (1998). Learning to extract symbolic knowledge from the world wide web. In AAAI/IAAI, pp. 509–516. Getoor, L., N. Friedman, D. Koller, and B. Taskar (2002). Learning probabilistic models of link structure. Journal of Machine Learning Research 3, 679–707. Getoor, L. and B. Taskar (2007). Introduction to Statistical Relational Learning. The MIT Press. Gupta, A. K. and D. K. Nagar (2000). Matrix Variate Distributions. Chapman & Hall/CRC. Hoff, P. D. (2008). Modeling homophily and stochastic equivalence in symmetric relational data. In J. Platt, D. Koller, Y. Singer, and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20. Cambridge, MA: MIT Press. Hoff, P. D., A. E. Raftery, and M. S. Handcock (2002). Latent space approaches to social network analysis. Journal of the American Statistical Association 97 (460), 1090– 1098. Lanckriet, G. R. G., N. Cristianini, P. L. Bartlett, L. E. Ghaoui, and M. I. Jordan (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research 5, 27–72. Macskassy, S. A. and F. J. Provost (2007). Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research 8, 935–983. McCallum, A., K. Nigam, J. Rennie, and K. Seymore (2000). Automating the construction of internet portals with machine learning. Inf. Retr. 3 (2), 127–163. Muirhead, R. J. (1982). Aspects of Multivariate Statistical Theory. New York: John Wiley and Sons. Neville, J. and D. Jensen (2007). Relational dependency networks. Journal of Machine Learning Research 8, 653– 692. Rasmussen, C. E. and C. K. I. Williams (2006). Gaussian Processes for Machine Learning. The MIT Press. Sch¨olkopf, B., A. J. Smola, and K.-R. M¨uller (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10 (5), 1299–1319. Silva, R., W. Chu, and Z. Ghahramani (2008). Hidden common cause relations in relational learning. In J. Platt, D. Koller, Y. Singer, and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, pp. 1345–1352. Cambridge, MA: MIT Press. Srivastava, M. S. (2003). Singular Wishart and multivariate Beta distributions. The Annals of Statistics 31 (5), 1537–1560. Taskar, B., P. Abbeel, and D. Koller (2002). Discriminative probabilistic models for relational data. In Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence, pp. 485–492. Yu, K. and W. Chu (2008). Gaussian process models for link analysis and transfer learning. In J. Platt, D. Koller, Y. Singer, and S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, pp. 1657–1664. Cambridge, MA: MIT Press. Yu, K., W. Chu, S. Yu, V. Tresp, and Z. Xu (2006). Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems 19, pp. 1553–1560. Zhang, Z., J. T. Kwok, and D.-Y. Yeung (2006). Modelbased transductive learning of the kernel matrix. Machine Learning 63 (1), 69–101