正在加载图片...
LCSNS International Joumal of Computer Science and Network Security, VOL 6 No 5A, May 2006 with a predetermined subset of highly ranked features. This method evaluates a task of binary classification. Text classification in Papits needs to classify documents to be classified into the multivalued category. Hence it is hard to collect enough training data. Our method considers the case that Papits stores a few training data, transforms the multivalued category into a binary category to easily identify the characteristic words. John [4 proposed feature selection in the wrapper model. This method finds all strongly suitable features and a useful subset of the weakly relevant features that yields good performance. The processing cost to identify weakly relevant features was very expensive, because the wrapper model repeats evaluation with respect to every subset of features. Our Numher of feature method considered subsets of categories. Subsets of Fig 8. SVM: accuracy of N=3 categories are much smaller than that of features 7. Conclusions and Future Work Discussion In this paper, we introduced an approach and a structure to ccuracy performance point of view, there is not so much structure gradually increased the accuracy by using difference between kNN and SVM. base on the result of the feedback from users. In this system, papers classified by experiments. Furthermore, [12] describes kNN and SVM the classifier were not used as training data, since these have an almost equivalent perfor If SVM is applied annot guarantee a perfectly correct prediction. An to the Papits classifier, Papits has to a new unclassified paper is classified by a classifier that only uses classifier whenever users input new paper nation or manually classified papers in the document DB as training correct that information. Hence Papits e kNN data algorithm The main problem for the automatic text classification In the early stages of Papits running, there may also is to identify what words are most suitable to classify not be enough registered papers as training data to identify documents in predefined classes. Automatic classification the most suitable words to classify the papers into the in Papits needs to classify documents into the multivalued multivalued category in Papits. The proposed method in category, since research is organized by field. To solve this this paper solves this problem. Our method transforms the problem, we proposed a feature selection method for text multivalued category into a binary category. Because of classification in Papits. It transforms the multivalued increasing the number of data in one category, our method category into a binary category and was helpful in reducing makes it relatively easy to identify the characteristic words noise in document representation and improving to classify category. Though, the system manager has to classification and computational efficiency, because it input some amount of paper information increased the amount of data in one category, and selected features using IG. We experimentally confirmed its efficac 6. Related works One direction for future study is to develop a means of determining parameters that are suited to the task required Feature selection is helpful in reducing noise in document such as the number of features and the number of representation, improving both classification and combinations of categories mputational efficiency. Therefore, several methods of feature selection have been reported ng References [13] reported a comparative study of feature selection [1] C.J. C. Burges, A tutorial on support vector machines for methods in statistical learning of text categorization. This paper proposed methods that selected any feature with an (2)pp.12l-1671998 IG that was greater than some threshold. Soucy [111 2N. Fujimaki, T. Ozono, and T. Shintani, Flexible Query presented methods that combined IG and the cooccurrence Modifier for Research Support System Papits, Proceedings of words. This method selects a set of features according to of the iasted International Conference on artificial and an IG criterion, and refines them based on the cooccurrence Computational Intelligence(ACl2002), pp 142-147, 2002IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.5A, May 2006 22 Fig. 8. SVM : accuracy of ``N=3'' 5. Discussion The Papits classifier uses kNN instead of SVM. From accuracy performance point of view, there is not so much difference between kNN and SVM, base on the result of the experiments. Furthermore, [12] describes kNN and SVM have an almost equivalent performance. If SVM is applied to the Papits classifier, Papits has to generate a new classifier whenever users input new paper information or correct that information. Hence Papits uses the kNN algorithm. In the early stages of Papits running, there may also not be enough registered papers as training data to identify the most suitable words to classify the papers into the multivalued category in Papits. The proposed method in this paper solves this problem. Our method transforms the multivalued category into a binary category. Because of increasing the number of data in one category, our method makes it relatively easy to identify the characteristic words to classify category. Though, the system manager has to input some amount of paper information. 6. Related Works Feature selection is helpful in reducing noise in document representation, improving both classification and computational efficiency. Therefore, several methods of feature selection have been reported [4][11][13]. Yang [13] reported a comparative study of feature selection methods in statistical learning of text categorization. This paper proposed methods that selected any feature with an IG that was greater than some threshold. Soucy [11] presented methods that combined IG and the cooccurrence of words. This method selects a set of features according to an IG criterion, and refines them based on the cooccurrence with a predetermined subset of highly ranked features. This method evaluates a task of binary classification. Text classification in Papits needs to classify documents to be classified into the multivalued category. Hence it is hard to collect enough training data. Our method considers the case that Papits stores a few training data, transforms the multivalued category into a binary category to easily identify the characteristic words. John [4] proposed feature selection in the wrapper model. This method finds all strongly suitable features and a useful subset of the weakly relevant features that yields good performance. The processing cost to identify weakly relevant features was very expensive, because the wrapper model repeats evaluation with respect to every subset of features. Our method considered subsets of categories. Subsets of categories are much smaller than that of features. 7. Conclusions and Future Work In this paper, we introduced an approach and a structure to implement automatic classification in Papits. This structure gradually increased the accuracy by using feedback from users. In this system, papers classified by the classifier were not used as training data, since these cannot guarantee a perfectly correct prediction. An unclassified paper is classified by a classifier that only uses manually classified papers in the document DB as training data. The main problem for the automatic text classification is to identify what words are most suitable to classify documents in predefined classes. Automatic classification in Papits needs to classify documents into the multivalued category, since research is organized by field. To solve this problem, we proposed a feature selection method for text classification in Papits. It transforms the multivalued category into a binary category and was helpful in reducing noise in document representation and improving classification and computational efficiency, because it increased the amount of data in one category, and selected features using IG. We experimentally confirmed its efficacy. One direction for future study is to develop a means of determining parameters that are suited to the task required, such as the number of features and the number of combinations of categories. References [1] C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 2(2),pp.121-167 1998 [2] N. Fujimaki, T. Ozono, and T. Shintani, Flexible Query Modifier for Research Support System Papits., Proceedings of the IASTED International Conference on Artificial and Computational Intelligence(ACI2002), pp.142-147, 2002
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有