正在加载图片...
3. 2. Paper collection agent the topic model mentioned in Section 2.3 from the charac teristic(2)and (3)points of view point of time six, four, and There are many papers which can be accessed by the two months ago public via the Internet, and a paper collection agent is able The experiment is as follows. We collected users'mod- to collect these papers as PDF files. A conventional pa- els which were measured and papers which are read over a per collection agents can collect files from researchers'web eight months period. Based on our method, we eliminated sites, the Research Index, the ACM Digital Library4, and stopwords, and make stems as a preprocessing step. Ad ditionally, we added words and word co-occurrences to the network based on the order of read papers, and represent the network as user's model 3.3. Paper categorization agent We apply papers that were stored in a database of Pa The paper categorization agent categorizes papers in- pits. Over 10,000 papers stored in a database of Papits that cluded within the database. Initially papers in the database describe information technology in English were included. are not categorized, but eventually the paper categorization 4.1. Vector Space Model agent categorizes papers into pre-defined categories based on existing classifiers [13]. Automatic classification helps The vector space model[20] is widely used in informa te papers by following their category of interest. tion retrieval systems. In this model, documents and queries The main problem in automatic text classification is to iden- are represented as bags of terms, and statistics concerning tify what words are the most suitable to classify documents these terms and the documents they appear in are gathered in predefined classes. This section discusses the text classi- together into an index. In the index, each distinct term t fication method for Papits and our feature selection method. has an associated document frequency, denoted ft, which In Papits, automatic classification needs to classify doc- indicates the number of documents it appears in. In addi- uments into the multivalued category, because research is tion, each term is associated with an inverted list of pointers organized in various fields. However, feature selection be-< d, ft recording that term t appears in document d comes sensitive to noise and irrelevant data compared to total of fa t times. Moreover, each document d has a corre cases with few categories. There may also not be enough sponding value Wa associated with it, its document length, registered papers as training data to identify the most suit- which is calculated as a function of ft and fa t for the terms able words to classify into the multivalued category in Pa- in that document. Generally speaking, W d is greater when pits.We propose feature selection to classify documents, a document becomes physically longer, but Wd usually de- which is represented by a bag-of-words, into the multival- pends also upon the relative scarcity of the terms in the doc ued category. Several existing feature selection techniques ument use some metric to determine the relevance of a term with In ranking a query q against the database, the vector regard to the classification criterion. Information gain (G) space model employs a similarity heuristic to calculate a is often used in text classification in the bag-of-words ap- score S, d between g and each document d of the database d can be described as 4. Evaluation Experiments Wdt x w We measured the our methods effectiveness in terms of where the values of wd, t and wg, t, called term impacts or ecall and analysis misrecognition. We evaluated the ef simply impacts[2], represent the degree of"import fectiveness by comparing our method to the Vector Space term t in document d and query g respectively, and are cal Model(VSM)[21], co-occurrence-based thesaurus[l1, and culated from fd, t, fg,t, ft, Wd, and Wa. It should be noted IRM[16]. IRM is the mechanism that supports users in that we employ a common notation for document impacts Web browsing, similar to our method. However, IRM does and query impacts(that is, the impact values of document not consider long and short-range interests. So, we evalu- terms and query terms res ated whether our method can measure long-range and short they can in fact have different formulations in terms of the range interests. We use Equation 7 to measure whether our underlying values fd, t, fg, t, ft, Wa, and Wq method was more reliable the other existing methods Addi- tionally, we compare Equation 7 to Equation 6, we measure 4.2. Co-occurrence based thesaurus http://portal.acmorg/dl.cfm Terms used in documents in a sentence differ Shttp://www.sciencedirect.com/ another and meanings of a term differ, depend Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce(DEEC'05) 076952401-X0520.00@2005LEEE SOCIETY3.2. Paper collection agent There are many papers which can be accessed by the public via the Internet, and a paper collection agent is able to collect these papers as PDF files. A conventional pa￾per collection agents can collect files from researchers’ web sites, the Research Index3, the ACM Digital Library4, and Science Direct5. 3.3. Paper categorization agent The paper categorization agent categorizes papers in￾cluded within the database. Initially papers in the database are not categorized, but eventually the paper categorization agent categorizes papers into pre-defined categories based on existing classifiers [13]. Automatic classification helps users locate papers by following their category of interest. The main problem in automatic text classification is to iden￾tify what words are the most suitable to classify documents in predefined classes. This section discusses the text classi- fication method for Papits and our feature selection method. In Papits, automatic classification needs to classify doc￾uments into the multivalued category, because research is organized in various fields. However, feature selection be￾comes sensitive to noise and irrelevant data compared to cases with few categories. There may also not be enough registered papers as training data to identify the most suit￾able words to classify into the multivalued category in Pa￾pits. We propose feature selection to classify documents, which is represented by a bag-of-words, into the multival￾ued category. Several existing feature selection techniques use some metric to determine the relevance of a term with regard to the classification criterion. Information gain (IG) is often used in text classification in the bag-of-words ap￾proach. 4. Evaluation Experiments We measured the our method’s effectiveness in terms of recall and analysis misrecognition. We evaluated the ef￾fectiveness by comparing our method to the Vector Space Model (VSM)[21], co-occurrence-based thesaurus[11], and IRM[16]. IRM is the mechanism that supports users in Web browsing, similar to our method. However, IRM does not consider long and short-range interests. So, we evalu￾ated whether our method can measure long-range and short￾range interests. We use Equation 7 to measure whether our method was more reliable the other existing methods. Addi￾tionally, we compare Equation 7 to Equation 6, we measure 3http://citeseer.ist.psu.edu/ 4http://portal.acm.org/dl.cfm 5http://www.sciencedirect.com/ the topic model mentioned in Section 2.3 from the charac￾teristic(2) and (3) points of view point of time six, four, and two months ago. The experiment is as follows. We collected users’ mod￾els which were measured and papers which are read over a eight months period. Based on our method, we eliminated stopwords, and make stems as a preprocessing step. Ad￾ditionally, we added words and word co-occurrences to the network based on the order of read papers, and represent the network as user’s model. We apply papers that were stored in a database of Pa￾pits. Over 10,000 papers stored in a database of Papits that describe information technology in English were included. 4.1. Vector Space Model The vector space model[20] is widely used in informa￾tion retrieval systems. In this model, documents and queries are represented as bags of terms, and statistics concerning these terms and the documents they appear in are gathered together into an index. In the index, each distinct term t has an associated document frequency, denoted f t, which indicates the number of documents it appears in. In addi￾tion, each term is associated with an inverted list of pointers < d, ft > recording that term t appears in document d a total of fd,ttimes. Moreover, each document d has a corre￾sponding value Wd associated with it, its document length, which is calculated as a function of ft and fd,t for the terms in that document. Generally speaking, Wd is greater when a document becomes physically longer, but Wd usually de￾pends also upon the relative scarcity of the terms in the doc￾ument. In ranking a query q against the database, the vector space model employs a similarity heuristic to calculate a score Sq,d between q and each document d of the database. Sq,d can be described as Sq,d =  t∈q∩d wd,t × wq,t where the values of wd,t and wq,t, called term impacts or simply impacts[2], represent the degree of “importance” of term t in document d and query q respectively, and are cal￾culated from fd,t, fq,t, ft, Wd, and Wq. It should be noted that we employ a common notation for document impacts and query impacts (that is, the impact values of document terms and query terms respectively) for simplicity, and that they can in fact have different formulations in terms of the underlying values fd,t, fq,t, ft, Wd, and Wq. 4.2. Co-occurrence based thesaurus Terms used in documents in a sentence differ from one another and meanings of a term differ, depending on the Proceedings of the 2005 International Workshop on Data Engineering Issues in E-Commerce (DEEC’05) 0-7695-2401-X/05 $20.00 © 2005 IEEE
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有