正在加载图片...
Learning to Classify Texts Using positive and unlabeled data Xiaoli li Bing liu School of Computing Department of Computer Science National University of Singapore/ University of Illinois at Chicago apore-MIT All ance 851 South morgan Street 117543 Chicago IL 60607-7053 lixl@comp. nus. edu.S liublacs uic edu Abstract domains, especially those involving online sources[Nigam et In traditional text classification a classifier is built al, 1998; Liu et al, 20021 using labeled training documents of every class. This In [ Liu et al, 2002; Yu et al, 2002 two techniques are paper studies a different problem. Given a set P of proposed to solve the problem. One is based on the EM al documents of a particular class(called positive class gorithm [ Dempster et al. (called S-EM) and the other and a set U of unlabeled documents that contains is based on Support Vector Machine (SVM)IVap时奶 documents from class P and also other types of documents (called negative class documents),we shortcomings. S-EM is not accurate because of its weak want to build a classifier to classify the documents in classifier. PEBL is not robust because it performs well in U into documents from p and documents not from P certain situations and fails badly in others. We will discuss The key feature of this problem is that there is no hese two techniques in detail in Section 2 and compare their labeled negative document, which makes traditional results with the proposed technique in Section 4 text classification techniques inapplicable. In this As discussed in our earlier work Liu et al, 2002, paper, we propose an effective technique to solve the positive class based learning occurs in many applications roblem. It combines the rocchio method and the With the growing volume of text documents on the Web SVM technique for classifier building. Experimental Internet news feeds, and digital libraries one often wants sults show that the new method outperforms ex to find those documents that are related to one's interest ing methods significantly For instance, one may want to build a repository of ma- chine learning(ML) papers. One can start with an initial set of ML papers(e.g, an ICML Proceedings). One can then find those ML papers from related online journals or Text classification is an important problem and has been conference series, e.g., Al journal, AAAl, IJCAl, etc studied extensively in information retrieval and machine The ability to build classifiers without negative train- learning. To build a text classifier, the user first collects a ing data is particularly useful if one needs to find positive set of training examples, which are labeled with documents from many text collections or sources. Given a pre-defined classes(labeling is often done manually ). A new collection, the algorithm can be run to find those classification algorithm is then applied to the training data positive documents. Following the above example, given a to build a classifier. This approach to building classifiers is collection of AAal papers(unlabeled set), one can run the called supervised learning/classification because the algorithm to identify those ML papers. Given a set of training examples/documents all have pre-labeled classes. SIGIR papers, one can run the algorithm again to find This paper studies a special form of semi-supervised text those ML papers. In general, one cannot use the classifier classification. This problem can be regarded as a two-class built using the AAAl collection to classify the SiGir (positive and negative) classification problem, where there collection because they are from different domains. In are only labeled positive training data, but no labeled nega- traditional classification, labeling of negative documents tive training data. Due to the lack of negative training data needed for each collection. A user would obviously the classifier building is thus semi-supervised. Since tradi- prefer techniques that can provide accurate classification tional classification techniques require both labeled positive without manual labeling any negative documents Ind negative examples to build a classifier, they are thus not This paper proposes a more effective and robust technique suitable for this problem. Although it is possible to manually to solve the problem. It is based on the Rocchio method label some negative examples, it is labor-intensive and very [Rocchio, 1971] and SVM. The idea is to first use Rocchio to time consuming. In this research. we want to build a classifier extract some reliable negative documents from the unlabeled using only a set of positive examples and a set of unlabeled set and then apply S VM iteratively to build and to select a examples. Collecting unlabeled examples or documents is classifier. Experimental results show that the new method normally easy and inexpensive in many text or Web page outperforms existing methods significantlyAbstract In traditional text classification, a classifier is built using labeled training documents of every class. This paper studies a different problem. Given a set P of documents of a particular class (called positive class) and a set U of unlabeled documents that contains documents from class P and also other types of documents (called negative class documents), we want to build a classifier to classify the documents in U into documents from P and documents not from P. The key feature of this problem is that there is no labeled negative document, which makes traditional text classification techniques inapplicable. In this paper, we propose an effective technique to solve the problem. It combines the Rocchio method and the SVM technique for classifier building. Experimental results show that the new method outperforms ex￾isting methods significantly. 1 Introduction Text classification is an important problem and has been studied extensively in information retrieval and machine learning. To build a text classifier, the user first collects a set of training examples, which are labeled with pre-defined classes (labeling is often done manually). A classification algorithm is then applied to the training data to build a classifier. This approach to building classifiers is called supervised learning/classification because the training examples/documents all have pre-labeled classes. This paper studies a special form of semi-supervised text classification. This problem can be regarded as a two-class (positive and negative) classification problem, where there are only labeled positive training data, but no labeled nega￾tive training data. Due to the lack of negative training data, the classifier building is thus semi-supervised. Since tradi￾tional classification techniques require both labeled positive and negative examples to build a classifier, they are thus not suitable for this problem. Although it is possible to manually label some negative examples, it is labor-intensive and very time consuming. In this research, we want to build a classifier using only a set of positive examples and a set of unlabeled examples. Collecting unlabeled examples or documents is normally easy and inexpensive in many text or Web page domains, especially those involving online sources [Nigam et al., 1998; Liu et al., 2002]. In [Liu et al., 2002; Yu et al., 2002], two techniques are proposed to solve the problem. One is based on the EM al￾gorithm [Dempster et al., 1977] (called S-EM) and the other is based on Support Vector Machine (SVM) [Vapnik, 1995] (called PEBL). However, both techniques have some major shortcomings. S-EM is not accurate because of its weak classifier. PEBL is not robust because it performs well in certain situations and fails badly in others. We will discuss these two techniques in detail in Section 2 and compare their results with the proposed technique in Section 4. As discussed in our earlier work [Liu et al., 2002], positive class based learning occurs in many applications. With the growing volume of text documents on the Web, Internet news feeds, and digital libraries, one often wants to find those documents that are related to oneís interest. For instance, one may want to build a repository of ma￾chine learning (ML) papers. One can start with an initial set of ML papers (e.g., an ICML Proceedings). One can then find those ML papers from related online journals or conference series, e.g., AI journal, AAAI, IJCAI, etc. The ability to build classifiers without negative train￾ing data is particularly useful if one needs to find positive documents from many text collections or sources. Given a new collection, the algorithm can be run to find those positive documents. Following the above example, given a collection of AAAI papers (unlabeled set), one can run the algorithm to identify those ML papers. Given a set of SIGIR papers, one can run the algorithm again to find those ML papers. In general, one cannot use the classifier built using the AAAI collection to classify the SIGIR collection because they are from different domains. In traditional classification, labeling of negative documents is needed for each collection. A user would obviously prefer techniques that can provide accurate classification without manual labeling any negative documents. This paper proposes a more effective and robust technique to solve the problem. It is based on the Rocchio method [Rocchio, 1971] and SVM. The idea is to first use Rocchio to extract some reliable negative documents from the unlabeled set and then apply SVM iteratively to build and to select a classifier. Experimental results show that the new method outperforms existing methods significantly. Learning to Classify Texts Using Positive and Unlabeled Data Xiaoli Li School of Computing National University of Singapore/ Singapore-MIT Alliance Singapore 117543 lixl@comp.nus.edu.sg Bing Liu Department of Computer Science University of Illinois at Chicago 851 South Morgan Street Chicago, IL 60607-7053 liub@cs.uic.edu
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有