because in practice one does not know_中国高校课件下载中心

点击下载：《电子商务 E-business》阅读文献：Learning to Classify Text Using Positive

正在加载图片...

because in practice one does not know how many positive Acknowledgement: This research is supported by a research documents are sufficient. Using a smaller set of positive grant from A-STAR and nus to the second author documents also reduce the manual labeling effort From Figure 5, we also observe that Roc-Clu-sVM is References slightly better than Roc-SVM in both F value and accu- [Basu et al, 2002]S. Basu, A. Banerjee,& R mooney important in such cases. When a is large, their results are [Blum& Mitchell, 1998 A Blum&T Mitchell. Combining similar. In practice, Roc-SVM may be sufficient due to its Bockhorst Craven, 2002 J. Bockhorst M. Craven We also ran one-Class SVM using the liBs vm package Exploiting relations among concepts to acquire weakly (http://www.csie.ntu.edu.tw/-cjlin/libsvm/).Itgivesvery labeled training data. ICML-2002, 2002 oor results(mostly with F values less than 0.4). Due to [Buckley et al., 1994]C. Buckley, G. Salton,and J.Allan space limitations, we do not list their results here The effect of adding relevance information in a relevance Comparison of different number of clusters: As discussed feed back environment. SIGIR-94. 1994 above, the number of clusters used in Roc-Clu-SVM makes [ Dempster et al., 1977] A. Dempster, N. Laird &D.Rubin Maximum likelihood from incomplete data via the em ittle difference to the final classifier. We now provide ex- algorithm. Journal of the Royal Statistical Society, 1977 periment results in Figure 6 to support this claim by using different numbers of clusters(here, a=25% and for other a [Denis, 1998]F. Denis. PAC learning from positive statisti- al queries. ALT-98, pages 1 12-126. 1998 from 2 to 30. We observe that the choice of k has little in- [Goldman& Zhou, 2000]S Goldman &Y. Zhou Enhancing fluence on the results as long as it is not too smal supervised learning with unlabeled data. ICML-2000 Joachims, 1999 T. Joachims. Making large-Scale SVM Learning Practical. Advances in Kernel Methods- Support Vector Learning, B. Scholkopf and C. Burges and A Smola(ed ) MIT-Press, 1999 05 ( Kearns, 1998]M. Kearns. Efficient noise-tolerant learning from statistical queries. J.of the ACM, 45: 983-1006, 1 LLetouzey et al, 2000] F Letouzey, F. Denis, R. Gilleron, E. Grappa. Learning from 61014182230 ples. ALT-00, 2000 Number of clusters Lewis Gale, 1994]D. Lewis w. Gale. A sequential Figure 6: Averaged F values of different clusters(a25%) algorithm for training text classifiers. S/GIR-94, 1994 Execution times: Our technique consists of wo seps: ex- [Liu et al., 2002] B. Liu, W.Lee, P. Yu,&XLi.Partially supervised classification of text documents. ICML-2002 tracting negative documents from U and iteratively running Muslea et al., 2002] I Muslea, S Minton C Knoblock SVM. As sVM is a standard technique we will not discuss Active semi-supervised learning robust multi-view cient because Rocchio is a linear algorithm, O(UuPD). If we [Muggleton, 2001]S. Muggleton. Learning from the positive use rocchio with clustering, more time is needed because data Machine Learning, 2001, to appea k-means clustering. The time complexity of k-means is INigam et al, 1998]. Nigam, A McCallum, S.Thrun,&T O(**), where k is the number of clusters, IRN is the Mitchell. Text classification from labeled and unlabeled size of the reliable negative set, and l is the number of itera- documents using EM, machine learning, 2000 tions.Since k and I are normally small, k-means is generally [Rocchio, 1971]J. Rocchio Relevant feedback in informa- regarded as a linear algorithm O(RND. Thus, the time com- vaL. In plexity of the extraction step of Roc-Clu-SVM is o(wP) tem: experiments in automatic document processing less than 7 seconds for step one for Roc-Clu-SVM(on Pen- Englewood Cliffs, NJ, 1971 tium 800MHz PC with 256MB memory) [Salton McGill, 1983]G Salton M. McGill. Introduc 5 Conclusions Scholkopf et al, 1999]B Scholkopf, J. Platt, J. Shawe, A Smola & R. Williamson. Estimating the support of a This paper studied the problem of text classification with high-dimensional distribution. Technical Report only partial information, i.e., with only one class of labeled MSR-TR-99-87. Microsoft Research. 1999 documents and a set of unlabeled documents. An effective [Vapnik, 1995] V Vapnik. The nature of statistical learning technique is proposed to solve the problem. Our algorithm first utilizes the rocchio classifier and/or clustering to extract Yang Liu, 1999]Y. Yang&x liu. A re-examination of a set of reliable negative documents from the unlabeled set, text categorization methods. SIGIR-99, 1999 and then builds a SVM classifier iteratively. Experimental TYu et al., 2002]H. Yu, J. Han, K Chang PEBL: Positive results show that the proposed technique is more robust than example based learning for Web page classification both s-em and Pebl SVM.KDD-02.2002because in practice one does not know how many positive documents are sufficient. Using a smaller set of positive documents also reduce the manual labeling effort. From Figure 5, we also observe that Roc-Clu-SVM is slightly better than Roc-SVM in both F value and accuracy, especially when a is small because pure RN is more important in such cases. When a is large, their results are similar. In practice, Roc-SVM may be sufficient due to its simplicity, efficiency and good results. We also ran one-Class SVM using the LIBSVM package (http://www.csie.ntu.edu.tw/~cjlin/libsvm/). It gives very poor results (mostly with F values less than 0.4). Due to space limitations, we do not list their results here. Comparison of different number of clusters: As discussed above, the number of clusters used in Roc-Clu-SVM makes little difference to the final classifier. We now provide experiment results in Figure 6 to support this claim by using different numbers of clusters (here, a = 25% and for other a values, the results are similar). The cluster number k varies from 2 to 30. We observe that the choice of k has little influence on the results as long as it is not too small. 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 2 6 10 14 18 22 26 30 Number of clusters Averaged F values Figure 6: Averaged F values of different clusters (a=25%) Execution times: Our technique consists of two steps: extracting negative documents from U and iteratively running SVM. As SVM is a standard technique, we will not discuss it here. If we only use Rocchio for the first step, it is very efficient because Rocchio is a linear algorithm, O(|U∪P|). If we use Rocchio with clustering, more time is needed because of k-means clustering. The time complexity of k-means is O(k*|RN|*I), where k is the number of clusters, |RN| is the size of the reliable negative set, and I is the number of iterations. Since k and I are normally small, k-means is generally regarded as a linear algorithm O(|RN|). Thus, the time complexity of the extraction step of Roc-Clu-SVM is O(|U∪P|) since |RN| < |U∪P|. In our experiments, every dataset takes less than 7 seconds for step one for Roc-Clu-SVM (on Pentium 800MHz PC with 256MB memory). 5 Conclusions This paper studied the problem of text classification with only partial information, i.e., with only one class of labeled documents and a set of unlabeled documents. An effective technique is proposed to solve the problem. Our algorithm first utilizes the Rocchio classifier and/or clustering to extract a set of reliable negative documents from the unlabeled set, and then builds a SVM classifier iteratively. Experimental results show that the proposed technique is more robust than both S-EM and PEBL. Acknowledgement: This research is supported by a research grant from A-STAR and NUS to the second author. References [Basu et al., 2002] S. Basu, A. Banerjee, & R. Mooney. Semi-supervised clustering by seeding. ICML-2002, 2002. [Blum & Mitchell, 1998] A. Blum & T. Mitchell. Combining labeled & unlabeled data with co-training. COLT-98. [Bockhorst & Craven, 2002] J. Bockhorst & M. Craven. Exploiting relations among concepts to acquire weakly labeled training data. ICML-2002, 2002. [Buckley et al., 1994] C. Buckley, G. Salton, and J. Allan. The effect of adding relevance information in a relevance feedback environment. SIGIR-94, 1994. [Dempster et al., 1977] A. Dempster, N. Laird & D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1977. [Denis, 1998] F. Denis. PAC learning from positive statistical queries. ALT-98, pages 112-126. 1998. [Goldman & Zhou, 2000] S. Goldman & Y. Zhou. Enhancing supervised learning with unlabeled data. ICML-2000. [Joachims, 1999] T. Joachims. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schˆlkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999 [Kearns, 1998] M. Kearns. Efficient noise-tolerant learning from statistical queries. J. of the ACM,45:983-1006, 1998. [Letouzey et al., 2000] F. Letouzey, F. Denis, R. Gilleron, & E. Grappa. Learning from positive and unlabeled examples. ALT-00, 2000. [Lewis & Gale, 1994] D. Lewis & W. Gale. A sequential algorithm for training text classifiers. SIGIR-94, 1994. [Liu et al., 2002] B. Liu, W. Lee, P. Yu, & X. Li. Partially supervised classification of text documents. ICML-2002. [Muslea et al., 2002] I. Muslea, S. Minton & C. Knoblock. Active + semi-supervised learning = robust multi-view learning. ICML-2002, 2002. [Muggleton, 2001] S. Muggleton. Learning from the positive data. Machine Learning, 2001, to appear. [Nigam et al., 1998] K. Nigam, A. McCallum, S. Thrun, & T. Mitchell. Text classification from labeled and unlabeled documents using EM, machine learning, 2000. [Rocchio, 1971] J. Rocchio. Relevant feedback in information retrieval. In G. Salton (ed.). The smart retrieval system: experiments in automatic document processing, Englewood Cliffs, NJ, 1971. [Salton & McGill, 1983] G. Salton & M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill,1983. [Scholkopf et al., 1999] B. Scholkopf, J. Platt, J. Shawe, A. Smola & R. Williamson. Estimating the support of a high-dimensional distribution. Technical Report MSR-TR-99-87, Microsoft Research, 1999. [Vapnik, 1995] V. Vapnik. The nature of statistical learning theory, 1995. [Yang & Liu, 1999] Y. Yang & X. Liu. A re-examination of text categorization methods. SIGIR-99, 1999. [Yu et al., 2002] H. Yu, J. Han, & K. Chang. PEBL: Positive example based learning for Web page classification using SVM. KDD-02, 2002

<<向上翻页

点击下载：《电子商务 E-business》阅读文献：Learning to Classify Text Using Positive