Time Complexity To train the model, w_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Sparse probabilistic relational projection

正在加载图片...

Time Complexity community.The content information refers to the paper ab- To train the model,we need O(dN)time to compute H and stracts and the links refer to the citations.The task is to clas- O(Tgd2+Tdg3)for the T EM iterations.Hence,the total sify each paper into one of the subfields of data structure time complexity is O(dN+Tqd2+Tdg3).If d>q2,Tqd2 (DS),hardware and architecture (HA),machine learning will be larger than Idg and so the time complexity will (ML),and programming language (PL).Some characteris- tics of the Cora data set are summarized in Table 2. become O(dN +Tqd2)which is equal to that of PRPCA. If we want to use the learned W to perform projection,the Table 2:Characteristics of the Cora data set. time complexity will depend on the number of nonzero en- Data Set #classes #instances #words tries in W.Generally speaking,SPRP has lower projection DS 751 6,234 cost than PRPCA because the W in SPRP is more sparse HA 400 3.989 than that in PRPCA ML 1.617 8.329 PL 9 1.575 7,949 Experimental Evaluation Because we do not have the dictionary for generating As in (Li,Yeung,and Zhang 2009),we adopt PCA to ini- the bag-of-words representation in the preprocessed WebKB tialize W,initialize o2 to 10-6,and set y to 10-6.We and Cora data sets.we collect another data set,called set the number of EM iterations T to 30 because 30 itera- Cora-IR.The Cora-IR data set contains 350 information re- tions are sufficient for both PRPCA and SPRP to achieve trieval papers from the original Cora set (McCallum et al. good performance.The baseline methods for comparison in- 2000).There are four subfields (classes)in Cora-IR:re- clude PCA,sparse probabilistic projection(SPP)(Archam- trieval,filtering,extraction,and digital library.We use the beau and Bach 2008)and PRPCA.Through the experi- title of each paper for the content information.After pre- ments we want to verify the following claims:(1)PCA processing,we get a dictionary of 787 words.For each cannot effectively exploit the relational information in rela- word,there is at least one instance(paper)containing it.We tional data.Furthermore,it cannot learn interpretable results will use this dictionary to demonstrate the interpretability of (2)Due to its i.i.d.assumption,SPP cannot achieve satis- SPRP. factory performance even though it can learn interpretable In (Li,Yeung,and Zhang 2009),only information about results.(3)PRPCA can effectively exploit the relational in- the words (bag-of-words)is used to represent the con- formation,but it cannot learn interpretable results.(4)SPRP tent information.We expand the original content features not only can effectively exploit the relational information, by adding some extra features extracted from the origi- but it can also learn interpretable results. nal directed links.The ith link feature is referred to as link-to-instancei.For example,if instance k links to instance Data Sets i,the ith link feature of instance k will be 1.Otherwise. it will be 0.In fact,this kind of link features can also be Three data sets are used for our experimental evaluation.The treated as content information.For example,given a paper, first two are the preprocessed WebKB(Craven et al.1998) the link-to-instancei feature actually reflects whether the ref- and Cora (McCallum et al.2000)data sets used in(Zhu et erence part of that paper contains paper i.For a web page, al.2007;Li,Yeung,and Zhang 2009).The third data set is the link-to-instance;feature can also be directly extracted called Cora-IR,which contains the information retrieval pa- from the HTML file (content)of that page.Note that it is pers from the original Cora data set(McCallum et al.2000). somewhat impractical to treat the linked-by-instance;fea- All these data sets use the bag-of-words representation for tures as content features because they cannot be directly ex- the content information. tracted from the content of the instances.For example,the The WebKB data set contains 4,017 web pages from papers citing a specific paper i are not included in the con- the computer science departments of four universities(Cor- tent of paper i.After extracting the link features,we com- nell,Texas,Washington,and Wisconsin).Each web page bine the original bag-of-words with the link features to ob- is labeled with one of seven categories:student,professor, tain the expanded content features.We can see that the way course,project,staff,department,and"other".The original to get the expanded content features also assumes that the links are directed.We adopt the same strategy as that in (Li, instances are i.i.d.We will show that this way of using link Yeung,and Zhang 2009)to convert the directed links into information is not good enough to capture the structure in- undirected ones.Some characteristics of the WebKB data formation in relational data.On the contrary,PRPCA and set are summarized in Table 1. SPRP,which are not based on the i.i.d.assumption,can pro- Table 1:Characteristics of the WebKB data set. vide more effective ways to model relational data.In what follows,we will refer to the original bag-of-words represen- Data Set #classes #instances #words tation as original content features Cornell 7 827 4,134 Texas 7 814 4,029 Washington > 1.166 4,165 Laplace Prior vs.Jeffreys Prior Wisconsin 6 1,210 4,189 We define the degree of sparsity(DoS)of W as follows: The Cora data set used in (Li,Yeung,and Zhang 2009) contains 4,343 research papers from the computer science Dos=number of zero entries in W ×100%. dgTime Complexity To train the model, we need O(dN) time to compute H and O(T qd2 + T dq3 ) for the T EM iterations. Hence, the total time complexity is O(dN +T qd2 +T dq3 ). If d > q2 , T qd2 will be larger than T dq3 and so the time complexity will become O(dN + T qd2 ) which is equal to that of PRPCA. If we want to use the learnedW to perform projection, the time complexity will depend on the number of nonzero entries in W. Generally speaking, SPRP has lower projection cost than PRPCA because the W in SPRP is more sparse than that in PRPCA. Experimental Evaluation As in (Li, Yeung, and Zhang 2009), we adopt PCA to initialize W, initialize σ 2 to 10−6 , and set γ to 10−6 . We set the number of EM iterations T to 30 because 30 iterations are sufficient for both PRPCA and SPRP to achieve good performance. The baseline methods for comparison include PCA, sparse probabilistic projection (SPP) (Archambeau and Bach 2008) and PRPCA. Through the experiments we want to verify the following claims: (1) PCA cannot effectively exploit the relational information in relational data. Furthermore, it cannot learn interpretable results. (2) Due to its i.i.d. assumption, SPP cannot achieve satisfactory performance even though it can learn interpretable results. (3) PRPCA can effectively exploit the relational information, but it cannot learn interpretable results. (4) SPRP not only can effectively exploit the relational information, but it can also learn interpretable results. Data Sets Three data sets are used for our experimental evaluation. The first two are the preprocessed WebKB (Craven et al. 1998) and Cora (McCallum et al. 2000) data sets used in (Zhu et al. 2007; Li, Yeung, and Zhang 2009). The third data set is called Cora-IR, which contains the information retrieval papers from the original Cora data set (McCallum et al. 2000). All these data sets use the bag-of-words representation for the content information. The WebKB data set contains 4,017 web pages from the computer science departments of four universities (Cornell, Texas, Washington, and Wisconsin). Each web page is labeled with one of seven categories: student, professor, course, project, staff, department, and “other”. The original links are directed. We adopt the same strategy as that in (Li, Yeung, and Zhang 2009) to convert the directed links into undirected ones. Some characteristics of the WebKB data set are summarized in Table 1. Table 1: Characteristics of the WebKB data set. Data Set #classes #instances #words Cornell 7 827 4,134 Texas 7 814 4,029 Washington 7 1,166 4,165 Wisconsin 6 1,210 4,189 The Cora data set used in (Li, Yeung, and Zhang 2009) contains 4,343 research papers from the computer science community. The content information refers to the paper abstracts and the links refer to the citations. The task is to classify each paper into one of the subfields of data structure (DS), hardware and architecture (HA), machine learning (ML), and programming language (PL). Some characteristics of the Cora data set are summarized in Table 2. Table 2: Characteristics of the Cora data set. Data Set #classes #instances #words DS 9 751 6,234 HA 7 400 3,989 ML 7 1,617 8,329 PL 9 1,575 7,949 Because we do not have the dictionary for generating the bag-of-words representation in the preprocessed WebKB and Cora data sets, we collect another data set, called Cora-IR. The Cora-IR data set contains 350 information retrieval papers from the original Cora set (McCallum et al. 2000). There are four subfields (classes) in Cora-IR: retrieval, filtering, extraction, and digital library. We use the title of each paper for the content information. After preprocessing, we get a dictionary of 787 words. For each word, there is at least one instance (paper) containing it. We will use this dictionary to demonstrate the interpretability of SPRP. In (Li, Yeung, and Zhang 2009), only information about the words (bag-of-words) is used to represent the content information. We expand the original content features by adding some extra features extracted from the original directed links. The ith link feature is referred to as link-to-instancei . For example, if instance k links to instance i, the ith link feature of instance k will be 1. Otherwise, it will be 0. In fact, this kind of link features can also be treated as content information. For example, given a paper, the link-to-instancei feature actually reflects whether the reference part of that paper contains paper i. For a web page, the link-to-instancei feature can also be directly extracted from the HTML file (content) of that page. Note that it is somewhat impractical to treat the linked-by-instancei features as content features because they cannot be directly extracted from the content of the instances. For example, the papers citing a specific paper i are not included in the content of paper i. After extracting the link features, we combine the original bag-of-words with the link features to obtain the expanded content features. We can see that the way to get the expanded content features also assumes that the instances are i.i.d. We will show that this way of using link information is not good enough to capture the structure information in relational data. On the contrary, PRPCA and SPRP, which are not based on the i.i.d. assumption, can provide more effective ways to model relational data. In what follows, we will refer to the original bag-of-words representation as original content features. Laplace Prior vs. Jeffreys Prior We define the degree of sparsity (DoS) of W as follows: DoS = number of zero entries in W dq × 100%

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Sparse probabilistic relational projection