Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong,China Joint work with Zhihua Zhang and Dit-Yan Yeung 4日4日+4工至三)及0 Li,Zhang and Yeung (CSE.HKUST) LWP AISTATS 2009 1/23
Latent Wishart Processes for Relational Kernel Learning Wu-Jun Li Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong, China Joint work with Zhihua Zhang and Dit-Yan Yeung Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 1 / 23
Contents ① Introduction ②Preliminaries ·Gaussian Processes ●Vishart Processes ③ Latent Wishart Processes o Model Formulation o Learning Out-of-Sample Extension Relation to Existing Work ⑤Experiments 6 Conclusion and Future Work 4日4日+4立4至,至)80 Li,Zhang and Yeung (CSE,HKUST) LWP AISTATS 2009 2/23
Contents 1 Introduction 2 Preliminaries Gaussian Processes Wishart Processes 3 Latent Wishart Processes Model Formulation Learning Out-of-Sample Extension 4 Relation to Existing Work 5 Experiments 6 Conclusion and Future Work Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 2 / 23
Introduction Relational Learning o Traditional machine learning models: Assumption:i.i.d. 。Advantage:simple Many real-world applications: Relational:instances are related (linked)to each other Autocorrelation:statistical dependency between the values of a random variable on related objects(non i.i.d.) .E.g.,web pages,protein-protein interaction data o Relational learning: An emerging research area attempting to represent,reason,and learn in domains with complex relational structure [Getoor Taskar,2007] ●Application areas: Web mining.social network analysis,bioinformatics,marketing,etc. 三)Q0 Li,Zhang and Yeung (CSE.HKUST) LWP A1 STATS20093/23
Introduction Relational Learning Traditional machine learning models: Assumption: i.i.d. Advantage: simple Many real-world applications: Relational: instances are related (linked) to each other Autocorrelation: statistical dependency between the values of a random variable on related objects (non i.i.d.) E.g., web pages, protein-protein interaction data Relational learning: An emerging research area attempting to represent, reason, and learn in domains with complex relational structure [Getoor & Taskar, 2007]. Application areas: Web mining, social network analysis, bioinformatics, marketing, etc. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 3 / 23
Introduction Relational Kernel Learning o Kernel function: To characterize the similarity between data instances: K(xi,xj) e.g.,K(cat,tiger)>K(cat,elephant) Positive semidefiniteness (p.s.d.) o Kernel learning: To learn an appropriate kernel matrix or kernel function for a kernel-based learning method. o Relational kernel learning (RKL): To learn an appropriate kernel matrix or kernel function for relational data by incorporating relational information between instances into the learning process. 4口40+4立4至,三)及0 Li,Zhang and Yeung (CSE,HKUST) LWP A1 STATS20094/23
Introduction Relational Kernel Learning Kernel function: To characterize the similarity between data instances: K(xi , xj) e.g., K(cat,tiger) > K(cat, elephant) Positive semidefiniteness (p.s.d.) Kernel learning: To learn an appropriate kernel matrix or kernel function for a kernel-based learning method. Relational kernel learning (RKL): To learn an appropriate kernel matrix or kernel function for relational data by incorporating relational information between instances into the learning process. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 4 / 23
Prelimineries Gaussian Processes Stochastic Processes and Gaussian Processes oStochastic processes: A stochastic process (or random process)y(x)is specified by giving the joint distribution for any finite set of instances {x1,...,xn in a consistent manner. ●Gaussian processes: A Gaussian process is a distribution over functions y(x)s.t.the values of y(x)evaluated at an arbitrary set of points {x1,...,xn jointly have a Gaussian distribution. Assuming y(x)has zero mean,the specification of a Gaussian process is completed by giving the covariance function of y(x)evaluated at any two values of x,given by the kernel function K(,): Ey(x)y(x】=K(x,x): 4口4日+1艺4至卡三及0 Li,Zhang and Yeung (CSE.HKUST) LWP AISTATS 2009 5/23
Preliminaries Gaussian Processes Stochastic Processes and Gaussian Processes Stochastic processes: A stochastic process (or random process) y(x) is specified by giving the joint distribution for any finite set of instances {x1, . . . , xn} in a consistent manner. Gaussian processes: A Gaussian process is a distribution over functions y(x) s.t. the values of y(x) evaluated at an arbitrary set of points {x1, . . . , xn} jointly have a Gaussian distribution. Assuming y(x) has zero mean, the specification of a Gaussian process is completed by giving the covariance function of y(x) evaluated at any two values of x, given by the kernel function K(·, ·): E[y(xi)y(xj)] = K(xi , xj). Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 5 / 23
Preliminaries Wishart Processes Wishart Processes oWishart distribution: An n x n random symmetric positive definite matrix A is said to have a Wishart distribution with parameters n,q,and n x n scale matrix Σ>O,written as A~Wn(q,Σ),if its p.d.f.is given by |A(q-n-1)/2 2m2T9/21znep(-2(z-1A)),9≥n Here 0 means that is positive definite(p.d.). ●Wishart processes: Given an input space={x1,x2,...},the kernel function A(xi,xj)xi,xj e}is said to be a Wishart process (WP)if for any n E N and {x1,...,xn,the nxn random matrix A=[A(xi,xj)=1 follows a Wishart distribution. 4日4日+4言4至,至)及0 Li,Zhang and Yeung (CSE,HKUST) LWP AISTATS 2009 6/23
Preliminaries Wishart Processes Wishart Processes Wishart distribution: An n × n random symmetric positive definite matrix A is said to have a Wishart distribution with parameters n, q, and n × n scale matrix Σ 0, written as A ∼ Wn(q, Σ), if its p.d.f. is given by |A| (q−n−1)/2 2 qn/2 Γn(q/2)|Σ| q/2 exp − 1 2 tr(Σ −1A) , q ≥ n. Here Σ 0 means that Σ is positive definite (p.d.). Wishart processes: Given an input space X = {x1, x2, . . .}, the kernel function {A(xi , xj) | xi , xj ∈ X } is said to be a Wishart process (WP) if for any n ∈ N and {x1, . . . , xn} ⊆ X , the n×n random matrix A = [A(xi , xj)]n i,j=1 follows a Wishart distribution. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 6 / 23
Preliminaries Wishart Processes Relationship between GP and WP o For any kernel function A:'x -R,there exists a function B:Fs.t.A(xi,xj)=B(xi)'B(xj). where A'is the input space and FCR9 is some latent(feature) space(in general the feature space may also be infinite-dimensional). Our previous result:A(xi,xj)is a Wishart process iff [Bk(x)1 are q mutually independent Gaussian processes. Let A=[A(xi,xj)]2j=1 and B=[B(x1),...,B(xn)]'=[b1,...,bn]'. Then b;are the latent vectors,and A BB'is a linear kernel in the latent space but is a nonlinear kernel w.r.t.the input space. Theorem Let E be an nxn positive definite matrix.Then A is distributed according to the Wishart distribution Wn(q,if and only if B is distributed according to the (matrix-variate)Gaussian distribution Nn.g(0,) Li,Zhang and Yeung (CSE,HKUST) LWP AISTATS 2009 7/23
Preliminaries Wishart Processes Relationship between GP and WP For any kernel function A : X × X → R, there exists a function B : X → F s.t. A(xi , xj) = B(xi) 0B(xj), where X is the input space and F ⊂ R q is some latent (feature) space (in general the feature space may also be infinite-dimensional). Our previous result: A(xi , xj) is a Wishart process iff {Bk (x)} q k=1 are q mutually independent Gaussian processes. Let A = [A(xi , xj)]n i,j=1 and B = [B(x1), . . . ,B(xn)]0 = [b1, . . . , bn] 0 . Then bi are the latent vectors, and A = BB0 is a linear kernel in the latent space but is a nonlinear kernel w.r.t. the input space. Theorem Let Σ be an n×n positive definite matrix. Then A is distributed according to the Wishart distribution Wn(q, Σ) if and only if B is distributed according to the (matrix-variate) Gaussian distribution Nn,q(0, Σ⊗Iq). Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 7 / 23
Preliminaries Wishart Processes GP and WP in a Nutshell oGaussian distribution: Each sampled instance is a finite-dimensional vector, v=(h,,vd). o Wishart distribution: Each sampled instance is a finite-dimensional p.s.d.matrix,M 0. Gaussian process: Each sampled instance is an infinite-dimensional function,f(). ●Wishart process: Each sampled instance is an infinite-dimensional p.s.d.function, g(,). 口4日+4三至三)及0 Li,Zhang and Yeung (CSE.HKUST) LWP AISTATS 2009 8/23
Preliminaries Wishart Processes GP and WP in a Nutshell Gaussian distribution: Each sampled instance is a finite-dimensional vector, v = (v1, . . . , vd ) 0 . Wishart distribution: Each sampled instance is a finite-dimensional p.s.d. matrix, M 0. Gaussian process: Each sampled instance is an infinite-dimensional function, f (·). Wishart process: Each sampled instance is an infinite-dimensional p.s.d. function, g(·, ·). Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 8 / 23
Latent Wishart Processes Model Fommulaton Relational Data 0{(x1,y,zk)1i,k=1,,n} (月 21 (x yi) 2a=0 (xyi) 2微=1 x y) xi=(xi1,...,Xip)':input feature vector for instance i yi:label for instance i ozik =1 if there exists a link between xi and xk;0 otherwise. zik zki and zi =0.Z=[zik]i.k=1. 4口4日+1立4至卡三)风0 Li,Zhang and Yeung (CSE,HKUST) LWP A1 STATS20099/23
Latent Wishart Processes Model Formulation Relational Data {(xi , yi , zik ) | i, k = 1, . . . , n} (xi , yi ) (xj , yj ) (xk, yk) (xl , yl ) zik = 1 zij = 1 zil = 0 xi = (xi1, . . . , xip) 0 : input feature vector for instance i yi : label for instance i zik = 1 if there exists a link between xi and xk ; 0 otherwise. zik = zki and zii = 0. Z = [zik ] n i,k=1. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 9 / 23
Latent Wishart Processes Model Formulation LWP Model ●Goal: To learn a target kernel function A(xi,xk)which takes both the input attributes and the relational information into consideration. LWP: Let aik A(xi,xk). Then A=[aik]=1 is a latent p.s.d.matrix. We model A by a Wishart distribution W(g,E).which implies that A(xi,xk)follows a Wishart process. 4口4日+1立4至卡三)风0 Li,Zhang and Yeung (CSE.HKUST) LWP A1 STATS200910/23
Latent Wishart Processes Model Formulation LWP Model Goal: To learn a target kernel function A(xi , xk ) which takes both the input attributes and the relational information into consideration. LWP: Let aik = A(xi , xk ). Then A = [aik ] n i,k=1 is a latent p.s.d. matrix. We model A by a Wishart distribution Wn(q, Σ), which implies that A(xi , xk ) follows a Wishart process. Li, Zhang and Yeung (CSE, HKUST) LWP AISTATS 2009 10 / 23