正在加载图片...
LCSNS International Joumal of Computer Science and Network Security, VOL 6 No 5A, May 2006 st. when users want to look for lassified Training data increases by going through interested in, they can easily find these by tracing the this step, and classification accuracy improves category or retrieving or using the recommender. Figure 2 has the results obtained for classification in Papits. When a user wants to look for papers of interest, it can be found based on the category of interest Additionally, users can narrow the range of the field of survey based on subcategories. In this way, users can scrutinize their field of interest through the automatic paper classifier 3 Automatic classification Automatic classification helps users locate papers by following their category of interest. The main problem in automatic text classification is to identify what words are the most suitable to classify documents in predefined classes. This section discusses the text classification Document DB method for Papits and our feature selection method 3. 1 Text Classification Algorithm Fig. I Work Flow of Classification in Papits k-Nearest Neighbor (kNN) and Support Vector subcategories Machine (SVM) have frequently been applied to text categorization[ 12]. Yang describes kNN and SVM are almost equivalent performance[ 12]. Section 4 discusses the experimental results using these text classification 3.1.1KNN ORMATION The kNN algorithm is quite simple: kNN finds the k nearest neighbors of the test document from the training documents. The categories of these nearest neighbors are used to weight the category candidates. The similarit score of each neighbor document to the test document is 9fbMTotal-Order Planning with Partially Ordered Subtask na Nau, Htor Muz-Avla, Yue Cao, Amnon Lotem and Steven used as the weight for the categories of the neighbor document. If several k nearest neighbors share a category categories and the weighted sum is used as the likelihood score for that Fig. 2 Browsing classified papers category with respect to the test document. By sorting the scores of the candidate category, a ranked list is obtained Figure 1 illustrates the Papits automatic classification for the test document rocess. Papits first collects papers from users, web sites Typical similarity is measured with a cosine function and the other sources. In this step, the papers have not been yet classified. The unclassified papers are classified by a classifier that uses manually classified in the ∑a(x)-a(x2) document DB as training data. Here, we have assumed that cos(xu,x,)= classification aided by the user is correct, and papers classified by the classifier cannot be guaranteed to be 2(2x perfectly correct. Papers classified by the classifier are where and x, are documents, and x is a document stored in databases as automatic classified papers, and is not vector(a, (x) a, (x),a, (). a, (x)is the weight of the j-th While browsing for a paper, if a user corrects or feature( word)on x certifies a category for that paper, it is stored as manuallyIJCSNS International Journal of Computer Science and Network Security, VOL.6 No.5A, May 2006 18 interest. When users want to look for papers they are interested in, they can easily find these by tracing the category or retrieving or using the recommender. Fig. 1 Work Flow of Classification in Papits. Fig. 2 Browsing classified papers. Figure 1 illustrates the Papits automatic classification process. Papits first collects papers from users, web sites, and the other sources. In this step, the papers have not been yet classified. The unclassified papers are classified by a classifier that uses manually classified papers in the document DB as training data. Here, we have assumed that classification aided by the user is correct, and papers classified by the classifier cannot be guaranteed to be perfectly correct. Papers classified by the classifier are stored in databases as automatic classified papers, and is not used as training data. While browsing for a paper, if a user corrects or certifies a category for that paper, it is stored as manually classified paper. Training data increases by going through this step, and classification accuracy improves. Figure 2 has the results obtained for classification in Papits. When a user wants to look for papers of interest, it can be found based on the category of interest. Additionally, users can narrow the range of the field of survey based on subcategories. In this way, users can scrutinize their field of interest through the automatic paper classifier. 3 Automatic Classification Automatic classification helps users locate papers by following their category of interest. The main problem in automatic text classification is to identify what words are the most suitable to classify documents in predefined classes. This section discusses the text classification method for Papits and our feature selection method. 3.1 Text Classification Algorithm k-Nearest Neighbor (kNN) and Support Vector Machine (SVM) have frequently been applied to text categorization[12]. Yang describes kNN and SVM are an almost equivalent performance[12]. Section 4 discusses the experimental results using these text classification algorithms. 3.1.1 KNN The kNN algorithm is quite simple: kNN finds the k nearest neighbors of the test document from the training documents. The categories of these nearest neighbors are used to weight the category candidates. The similarity score of each neighbor document to the test document is used as the weight for the categories of the neighbor document. If several k nearest neighbors share a category, then the per-neighbor weights of that category are added, and the weighted sum is used as the likelihood score for that category with respect to the test document. By sorting the scores of the candidate category, a ranked list is obtained for the test document. Typical similarity is measured with a cosine function: cos(x1, x2 ) = aj(x1) j=1 n ∑ ⋅ a j(x2) a j(x1) 2 ⋅ a j(x2 ) 2 j=1 n ∑ j=1 n ∑ where x1 and x2 are documents, and x is a document vector a1(x),a2 (x),L,an (x) . aj(x) is the weight of the j-th feature (word) on x
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有