正在加载图片...
Probabilistic Relational PCA Wu-Jun Li Dit-Yan Yeung Zhihua Zhang Dept.of Comp.Sci.and Eng. School of Comp.Sci.and Tech. Hong Kong University of Science and Technology Zhejiang University Hong Kong,China Zhejiang 310027.China {liwujun,dyyeung}@cse.ust.hk zhzhang@cs.zju.edu.cn Abstract One crucial assumption made by both principal component analysis(PCA)and probabilistic PCA(PPCA)is that the instances are independent and identically distributed(i.i.d.).However,this common i.i.d.assumption is unreasonable for relational data.In this paper,by explicitly modeling covariance between instances as derived from the relational information,we propose a novel probabilistic di- mensionality reduction method,called probabilistic relational PCA(PRPCA),for relational data analysis.Although the i.i.d.assumption is no longer adopted in PRPCA,the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i.i.d.assumption.Experiments on real- world data sets show that PRPCA can effectively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance. 1 Introduction Using a low-dimensional embedding to summarize a high-dimensional data set has been widely used for exploring the structure in the data.The methods for discovering such low-dimensional embedding are often referred to as dimensionality reduction(DR)methods.Principal component analysis(PCA)[13]is one of the most popular DR methods with great success in many applications. As a more recent development,probabilistic PCA(PPCA)[21]provides a probabilistic formula- tion of PCA [13]based on a Gaussian latent variable model [1].Compared with the original non- probabilistic derivation of PCA in [12],PPCA possesses a number of practical advantages.For ex- ample,PPCA can naturally deal with missing values in the data;the expectation-maximization(EM) algorithm [9]used to learn the parameters in PPCA may be more efficient for high-dimensional data; it is easy to generalize the single model in PPCA to the mixture model case;furthermore,PPCA as a probabilistic model can naturally exploit Bayesian methods [2]. Like many existing DR methods,both PCA and PPCA are based on some assumptions about the data.One assumption is that the data should be represented as feature vectors all of the same dimensionality.Data represented in this form are sometimes referred to as flat data [10].Another one is the so-called i.i.d.assumption,which means that the instances are assumed to be independent and identically distributed (i.i.d.). However,the data in many real-world applications,such as web pages and research papers,contain relations or links between(some)instances in the data in addition to the textual content informa- tion which is represented in the form of feature vectors.Data of this sort,referred to as relational data [10,20],can be found in such diverse application areas as web mining [3,17,23,24],bioinfor- matics [22],social network analysis [4],and so on.On one hand,the link structure among instances In this paper,we use document classification as a running example for relational data analysis.Hence,for convenience of illustration,the specific term 'textual content information'is used in the paper to refer to the feature vectors describing the instances.However,the algorithms derived in this paper can be applied to any relational data in which the instance feature vectors can represent any attribute information.Probabilistic Relational PCA Wu-Jun Li Dit-Yan Yeung Dept. of Comp. Sci. and Eng. Hong Kong University of Science and Technology Hong Kong, China {liwujun,dyyeung}@cse.ust.hk Zhihua Zhang School of Comp. Sci. and Tech. Zhejiang University Zhejiang 310027, China zhzhang@cs.zju.edu.cn Abstract One crucial assumption made by both principal component analysis (PCA) and probabilistic PCA (PPCA) is that the instances are independent and identically distributed (i.i.d.). However, this common i.i.d. assumption is unreasonable for relational data. In this paper, by explicitly modeling covariance between instances as derived from the relational information, we propose a novel probabilistic di￾mensionality reduction method, called probabilistic relational PCA (PRPCA), for relational data analysis. Although the i.i.d. assumption is no longer adopted in PRPCA, the learning algorithms for PRPCA can still be devised easily like those for PPCA which makes explicit use of the i.i.d. assumption. Experiments on real￾world data sets show that PRPCA can effectively utilize the relational information to dramatically outperform PCA and achieve state-of-the-art performance. 1 Introduction Using a low-dimensional embedding to summarize a high-dimensional data set has been widely used for exploring the structure in the data. The methods for discovering such low-dimensional embedding are often referred to as dimensionality reduction (DR) methods. Principal component analysis (PCA) [13] is one of the most popular DR methods with great success in many applications. As a more recent development, probabilistic PCA (PPCA) [21] provides a probabilistic formula￾tion of PCA [13] based on a Gaussian latent variable model [1]. Compared with the original non￾probabilistic derivation of PCA in [12], PPCA possesses a number of practical advantages. For ex￾ample, PPCA can naturally deal with missing values in the data; the expectation-maximization (EM) algorithm [9] used to learn the parameters in PPCA may be more efficient for high-dimensional data; it is easy to generalize the single model in PPCA to the mixture model case; furthermore, PPCA as a probabilistic model can naturally exploit Bayesian methods [2]. Like many existing DR methods, both PCA and PPCA are based on some assumptions about the data. One assumption is that the data should be represented as feature vectors all of the same dimensionality. Data represented in this form are sometimes referred to as flat data [10]. Another one is the so-called i.i.d. assumption, which means that the instances are assumed to be independent and identically distributed (i.i.d.). However, the data in many real-world applications, such as web pages and research papers, contain relations or links between (some) instances in the data in addition to the textual content informa￾tion which is represented in the form of feature vectors. Data of this sort, referred to as relational data1 [10, 20], can be found in such diverse application areas as web mining [3, 17, 23, 24], bioinfor￾matics [22], social network analysis [4], and so on. On one hand, the link structure among instances 1 In this paper, we use document classification as a running example for relational data analysis. Hence, for convenience of illustration, the specific term ‘textual content information’ is used in the paper to refer to the feature vectors describing the instances. However, the algorithms derived in this paper can be applied to any relational data in which the instance feature vectors can represent any attribute information. 1
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有