正在加载图片...
We use three data sets to evaluate PRPCA.The first two data sets are Cora [16]and WebKB [8].We adopt the same strategy as that in [26]to preprocess these two data sets.The third data set is the PoliticalBook data set used in [19].For WebKB,according to the semantics of authoritative pages and hub pages [25],we first preprocess the link structure of this data set as follows:if two web pages are co-linked by or link to another common web page,we add a link between these two pages. Then all the original links are removed.After preprocessing,all the directed links are converted into undirected links. The Cora data set contains four subsets:DS.HA.ML and PL.The WebKB data set also contains four subsets:Cornell,Texas,Washington and Wisconsin.We adopt the same strategy as that in [26] to evaluate PRPCA on the Cora and WebKB data sets.For the PoliticalBook data set,we use the testing procedure of the latent Wishart process(LWP)model [15]for evaluation. 5.2 Convergence Speed of EM We use the DS and Cornell data sets to illustrate the convergence speed of the EM learning procedure of PRPCA.The performance on other data sets has similar characteristics,which is omitted here. With g=50,the average classification accuracy based on 5-fold cross validation against the number of EM iterations T is shown in Figure 2.We can see that PRPCA achieves very promising and stable performance after a very small number of iterations.We set T=5 in all our following experiments. 5.3 Visualization We use the PoliticalBook data set to visualize the DR results of PCA and PRPCA.For the sake of visualization,g is set to 2.The results are depicted in Figure 3.We can see that it is not easy to separate the two classes in the latent space of PCA.However,the two classes are better separated from each other in the latent space of PRPCA.Hence,better clustering or classification performance can be expected when the examples are clustered or classified in the latent space of PRPCA. PCA PRPCA 。号8 0. 01 0+ 0.4 -0 0 20 30 40 0.4 -02 0 02 0.e -02 0 02 Figure 2:Convergence speed Figure 3:Visualization of data points in the latent spaces of PCA and of the EM learning procedure of PRPCA for the PoliticalBook data set.The positive and negative examples PRPCA. are shown as red crosses and blue circles,respectively. 5.4 Performance The dimensionality of Cora and WebKB is moderately high.but the dimensionality of PoliticalBook is very high.We evaluate PRPCA on these two different kinds of data to verify its effectiveness in general settings. Performance on Cora and WebKB The average classification accuracy with its standard deviation based on 5-fold cross validation against the dimensionality of the latent space g is shown in Figure 4. We can find that PRPCA can dramatically outperform PCA on all the data sets under any dimen- sionality,which confirms that the relational information is very informative and PRPCA can utilize it very effectively We also perform comparison between PRPCA and those methods evaluated in [26].The methods include:SVM on content,which ignores the link structure in the data and applies SVM only on the content information in the original bag-of-words representation;SIM on links,which ignores the content information and treats the links as features,i.e,the ith feature is link-to-pagei;SVM on link-content,in which the content features and link features of the two methods above are combined to give the feature representation;directed graph regularization (DGR),which is introduced in [25]; PLSI+PHITS,which is described in [7];link-content MF,which is the joint link-content matrix factorization (MF)method in [26].Note that Link-content sup.MF in [26]is not adopted here for comparison.Because during the DR procedure link-content sup.MF employs additional labelWe use three data sets to evaluate PRPCA. The first two data sets are Cora [16] and WebKB [8]. We adopt the same strategy as that in [26] to preprocess these two data sets. The third data set is the PoliticalBook data set used in [19]. For WebKB, according to the semantics of authoritative pages and hub pages [25], we first preprocess the link structure of this data set as follows: if two web pages are co-linked by or link to another common web page, we add a link between these two pages. Then all the original links are removed. After preprocessing, all the directed links are converted into undirected links. The Cora data set contains four subsets: DS, HA, ML and PL. The WebKB data set also contains four subsets: Cornell, Texas, Washington and Wisconsin. We adopt the same strategy as that in [26] to evaluate PRPCA on the Cora and WebKB data sets. For the PoliticalBook data set, we use the testing procedure of the latent Wishart process (LWP) model [15] for evaluation. 5.2 Convergence Speed of EM We use the DS and Cornell data sets to illustrate the convergence speed of the EM learning procedure of PRPCA. The performance on other data sets has similar characteristics, which is omitted here. With q = 50, the average classification accuracy based on 5-fold cross validation against the number of EM iterations T is shown in Figure 2. We can see that PRPCA achieves very promising and stable performance after a very small number of iterations. We set T = 5 in all our following experiments. 5.3 Visualization We use the PoliticalBook data set to visualize the DR results of PCA and PRPCA. For the sake of visualization, q is set to 2. The results are depicted in Figure 3. We can see that it is not easy to separate the two classes in the latent space of PCA. However, the two classes are better separated from each other in the latent space of PRPCA. Hence, better clustering or classification performance can be expected when the examples are clustered or classified in the latent space of PRPCA. 0 10 20 30 40 50 0.5 0.6 0.7 0.8 0.9 T Accuracy DS Cornell Figure 2: Convergence speed of the EM learning procedure of PRPCA. PCA PRPCA −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 −0.2 −0.1 0 0.1 0.2 Figure 3: Visualization of data points in the latent spaces of PCA and PRPCA for the PoliticalBook data set. The positive and negative examples are shown as red crosses and blue circles, respectively. 5.4 Performance The dimensionality of Cora and WebKB is moderately high, but the dimensionality of PoliticalBook is very high. We evaluate PRPCA on these two different kinds of data to verify its effectiveness in general settings. Performance on Cora and WebKB The average classification accuracy with its standard deviation based on 5-fold cross validation against the dimensionality of the latent space q is shown in Figure 4. We can find that PRPCA can dramatically outperform PCA on all the data sets under any dimen￾sionality, which confirms that the relational information is very informative and PRPCA can utilize it very effectively. We also perform comparison between PRPCA and those methods evaluated in [26]. The methods include: SVM on content, which ignores the link structure in the data and applies SVM only on the content information in the original bag-of-words representation; SVM on links, which ignores the content information and treats the links as features, i.e, the ith feature is link-to-pagei ; SVM on link-content, in which the content features and link features of the two methods above are combined to give the feature representation; directed graph regularization (DGR), which is introduced in [25]; PLSI+PHITS, which is described in [7]; link-content MF, which is the joint link-content matrix factorization (MF) method in [26]. Note that Link-content sup. MF in [26] is not adopted here for comparison. Because during the DR procedure link-content sup. MF employs additional label 7
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有