Learning to Classify Texts Using positive and unlabeled data Xiaoli li Bing liu School of Computing Department of Computer Science National University of Singapore/ University of Illinois at Chicago apore-MIT All ance 851 South morgan Street 117543 Chicago IL 60607-7053 lixl@comp. nus. edu.S liublacs uic edu Abstract domains, especially those involving online sources[Nigam et In traditional text classification a classifier is built al, 1998; Liu et al, 20021 using labeled training documents of every class. This In [ Liu et al, 2002; Yu et al, 2002 two techniques are paper studies a different problem. Given a set P of proposed to solve the problem. One is based on the EM al documents of a particular class(called positive class gorithm [ Dempster et al. (called S-EM) and the other and a set U of unlabeled documents that contains is based on Support Vector Machine (SVM)IVap时奶 documents from class P and also other types of documents (called negative class documents),we shortcomings. S-EM is not accurate because of its weak want to build a classifier to classify the documents in classifier. PEBL is not robust because it performs well in U into documents from p and documents not from P certain situations and fails badly in others. We will discuss The key feature of this problem is that there is no hese two techniques in detail in Section 2 and compare their labeled negative document, which makes traditional results with the proposed technique in Section 4 text classification techniques inapplicable. In this As discussed in our earlier work Liu et al, 2002, paper, we propose an effective technique to solve the positive class based learning occurs in many applications roblem. It combines the rocchio method and the With the growing volume of text documents on the Web SVM technique for classifier building. Experimental Internet news feeds, and digital libraries one often wants sults show that the new method outperforms ex to find those documents that are related to one's interest ing methods significantly For instance, one may want to build a repository of ma- chine learning(ML) papers. One can start with an initial set of ML papers(e.g, an ICML Proceedings). One can then find those ML papers from related online journals or Text classification is an important problem and has been conference series, e.g., Al journal, AAAl, IJCAl, etc studied extensively in information retrieval and machine The ability to build classifiers without negative train- learning. To build a text classifier, the user first collects a ing data is particularly useful if one needs to find positive set of training examples, which are labeled with documents from many text collections or sources. Given a pre-defined classes(labeling is often done manually ). A new collection, the algorithm can be run to find those classification algorithm is then applied to the training data positive documents. Following the above example, given a to build a classifier. This approach to building classifiers is collection of AAal papers(unlabeled set), one can run the called supervised learning/classification because the algorithm to identify those ML papers. Given a set of training examples/documents all have pre-labeled classes. SIGIR papers, one can run the algorithm again to find This paper studies a special form of semi-supervised text those ML papers. In general, one cannot use the classifier classification. This problem can be regarded as a two-class built using the AAAl collection to classify the SiGir (positive and negative) classification problem, where there collection because they are from different domains. In are only labeled positive training data, but no labeled nega- traditional classification, labeling of negative documents tive training data. Due to the lack of negative training data needed for each collection. A user would obviously the classifier building is thus semi-supervised. Since tradi- prefer techniques that can provide accurate classification tional classification techniques require both labeled positive without manual labeling any negative documents Ind negative examples to build a classifier, they are thus not This paper proposes a more effective and robust technique suitable for this problem. Although it is possible to manually to solve the problem. It is based on the Rocchio method label some negative examples, it is labor-intensive and very [Rocchio, 1971] and SVM. The idea is to first use Rocchio to time consuming. In this research. we want to build a classifier extract some reliable negative documents from the unlabeled using only a set of positive examples and a set of unlabeled set and then apply S VM iteratively to build and to select a examples. Collecting unlabeled examples or documents is classifier. Experimental results show that the new method normally easy and inexpensive in many text or Web page outperforms existing methods significantly
Abstract In traditional text classification, a classifier is built using labeled training documents of every class. This paper studies a different problem. Given a set P of documents of a particular class (called positive class) and a set U of unlabeled documents that contains documents from class P and also other types of documents (called negative class documents), we want to build a classifier to classify the documents in U into documents from P and documents not from P. The key feature of this problem is that there is no labeled negative document, which makes traditional text classification techniques inapplicable. In this paper, we propose an effective technique to solve the problem. It combines the Rocchio method and the SVM technique for classifier building. Experimental results show that the new method outperforms existing methods significantly. 1 Introduction Text classification is an important problem and has been studied extensively in information retrieval and machine learning. To build a text classifier, the user first collects a set of training examples, which are labeled with pre-defined classes (labeling is often done manually). A classification algorithm is then applied to the training data to build a classifier. This approach to building classifiers is called supervised learning/classification because the training examples/documents all have pre-labeled classes. This paper studies a special form of semi-supervised text classification. This problem can be regarded as a two-class (positive and negative) classification problem, where there are only labeled positive training data, but no labeled negative training data. Due to the lack of negative training data, the classifier building is thus semi-supervised. Since traditional classification techniques require both labeled positive and negative examples to build a classifier, they are thus not suitable for this problem. Although it is possible to manually label some negative examples, it is labor-intensive and very time consuming. In this research, we want to build a classifier using only a set of positive examples and a set of unlabeled examples. Collecting unlabeled examples or documents is normally easy and inexpensive in many text or Web page domains, especially those involving online sources [Nigam et al., 1998; Liu et al., 2002]. In [Liu et al., 2002; Yu et al., 2002], two techniques are proposed to solve the problem. One is based on the EM algorithm [Dempster et al., 1977] (called S-EM) and the other is based on Support Vector Machine (SVM) [Vapnik, 1995] (called PEBL). However, both techniques have some major shortcomings. S-EM is not accurate because of its weak classifier. PEBL is not robust because it performs well in certain situations and fails badly in others. We will discuss these two techniques in detail in Section 2 and compare their results with the proposed technique in Section 4. As discussed in our earlier work [Liu et al., 2002], positive class based learning occurs in many applications. With the growing volume of text documents on the Web, Internet news feeds, and digital libraries, one often wants to find those documents that are related to oneís interest. For instance, one may want to build a repository of machine learning (ML) papers. One can start with an initial set of ML papers (e.g., an ICML Proceedings). One can then find those ML papers from related online journals or conference series, e.g., AI journal, AAAI, IJCAI, etc. The ability to build classifiers without negative training data is particularly useful if one needs to find positive documents from many text collections or sources. Given a new collection, the algorithm can be run to find those positive documents. Following the above example, given a collection of AAAI papers (unlabeled set), one can run the algorithm to identify those ML papers. Given a set of SIGIR papers, one can run the algorithm again to find those ML papers. In general, one cannot use the classifier built using the AAAI collection to classify the SIGIR collection because they are from different domains. In traditional classification, labeling of negative documents is needed for each collection. A user would obviously prefer techniques that can provide accurate classification without manual labeling any negative documents. This paper proposes a more effective and robust technique to solve the problem. It is based on the Rocchio method [Rocchio, 1971] and SVM. The idea is to first use Rocchio to extract some reliable negative documents from the unlabeled set and then apply SVM iteratively to build and to select a classifier. Experimental results show that the new method outperforms existing methods significantly. Learning to Classify Texts Using Positive and Unlabeled Data Xiaoli Li School of Computing National University of Singapore/ Singapore-MIT Alliance Singapore 117543 lixl@comp.nus.edu.sg Bing Liu Department of Computer Science University of Illinois at Chicago 851 South Morgan Street Chicago, IL 60607-7053 liub@cs.uic.edu
2 Related work am et aL. 1998: Blum mitchell 1998 Goldman Zhou 2000: Basu et al. 2002: Muslea et al. 2002: Bockhorst Many studies on text classification have been conducted in Craven, 2002]. It has been shown that the unlabeled data the past. Existing techniques including Rocchio, naive Bayes helps classification. This is different from our work, as we do (NB), SVM, and many others [e.g, Rocchio, 1971; Lewis not have any labeled negative training data. Gale, 1994; Vapnik, 1995; Joachims, 1999, Nigam et al These existing techniques, how- 3 The Proposed Technique ever, all require labeled training data of all classes. They are not designed for the positive class based classification This section presents the proposed approach, which also In [Denis, 1998], a theoretical study of PAC learning consists of two steps: (I)extracting some reliable negative from positive and unlabeled examples under the statistical documents from the unlabeled set,(2)applying SVM query model [Kearns, 1998] is reported. [Letouzey et aL., iteratively to build a classifier. Below, we present these 2000] presents an algorithm for learning using a modified two steps in turn. In Section 4, we show that our technique decision tree algorithm based on the statistical query is much more robust than existing techniques model. It is not for text classification. but for normal data In[Muggleton, 20011, Muggleton studied the problem in a 3.1 Finding reliable negative documents Bayesian framework where the distribution of functions This sub-section gives two methods for finding reliable and examples are assumed known. The result is similar to negative documents: (1)Rocchio and(2) Rocchio wit clustering. The second method often results in better classi 2002]extends the result to the noisy case. The purpose of ers espe for this step is that the identified negative docu- prove existing techniques In lIu et al., 2002], Liu et al. proposed a method with no or very few positive documents, because SVM(used (called S-EM) to solve the problem in the text domain. It is based on naive Bayesian classification(NB)and the EM in our second step)is very sensitive to noise algorithm [Dempster et al, 1977]. The main idea of the 3.1.1 Method 1: Rocchio method is to first use a spy technique to identify some This method treats the entire unlabeled set U as negative reliable negative documents from the unlabeled set. It then documents and then uses the positive set P and U as the runs em to build the final classifier. Since nb is not training data to build a rocchio classifier. The classifier is trong classifier for texts, we will show that the propo then used to classify U. Those documents that are classified technique is much more accurate than S-EM as negative are considered(reliable) negative data, denoted In [Yu et al 2002. Yu et al. proposed a svM based by RN. The algorithm is shown in Figure I technique(called PEBL) to classify Web pages give sitive and unlabeled pages. The core idea 1. Assign the unlabeled set U the negative class, and the that in [Liu et al., 2002, i.e, (1) identifying a set of re positive set P the positive class, liable negative documents from the unlabeled set(called 2. Let a strong negative documents in PEBL), and(2) building a P|∈F‖l classifier using SVM. The difference between the two d techniques is that for both steps the two techniques use fed‖ 1P点n different methods. In Yu et al., 2002], strong negative features of the positive data. After a set of strong negative 5 for each document d in U do documents are those documents that do not contain any if sim(c, d)<=sim(c-, d) then build a classifier. pebl is sensitive to the number of 6 documents is identified, SVM is applied iteratively to RN=RN∪{d}; Figure 1: Rocchio extraction using U as negative positive examples. When the positive data is small, the results are often very poor. The proposed method differs In Rocchio classification, each document d is representee from PEBL in that we perform negative data extraction as a vector [Salton McGill,1983], d=(qu, q,,", ,) Each from the unlabeled set using the rocchio method and element qi in d represents a word w, and is calculated as the clustering. Although our second step also runs SVM it- combination of term frequency(t and inverse document eratively to build a classifier, there is akey difference. Our frequency(id), i.e., q=tf* idf. tf, is the number of times that built by SVM, while PEBL does not. The proposed method word w occurs in d, while performs well consistently under a variety of conditions id, log( D I/df(w, ) In [Scholkopfet al., 1999], one-class SVM is prop Here DI is the total number of documents and dfw) is the This technique uses only positive data to build a s number of documents where word w occurs at least once classifier. However, our experiment results show that its Building a classifier is achieved by constructing positive classification performance is poor and negative prototype vectors c and c(lines 2 and 3 in Another line of related work is the use of a small labeled Figure 1). a and B parameters adjust the relative impact of set of every class and a large unlabeled set for learning INI- positive and negative training examples. a=16 and B=4 are
2 Related Work Many studies on text classification have been conducted in the past. Existing techniques including Rocchio, naive Bayes (NB), SVM, and many others [e.g., Rocchio, 1971; Lewis & Gale, 1994; Vapnik, 1995; Joachims, 1999; Nigam et al., 1998; Yang & Liu, 1999]. These existing techniques, however, all require labeled training data of all classes. They are not designed for the positive class based classification. In [Denis, 1998], a theoretical study of PAC learning from positive and unlabeled examples under the statistical query model [Kearns, 1998] is reported. [Letouzey et al., 2000] presents an algorithm for learning using a modified decision tree algorithm based on the statistical query model. It is not for text classification, but for normal data. In [Muggleton, 2001], Muggleton studied the problem in a Bayesian framework where the distribution of functions and examples are assumed known. The result is similar to that in [Liu et al., 2002] in the noiseless case. [Liu et al., 2002] extends the result to the noisy case. The purpose of this paper is to propose a robust practical method to improve existing techniques. In [Liu et al., 2002], Liu et al. proposed a method (called S-EM) to solve the problem in the text domain. It is based on naÔve Bayesian classification (NB) and the EM algorithm [Dempster et al., 1977]. The main idea of the method is to first use a spy technique to identify some reliable negative documents from the unlabeled set. It then runs EM to build the final classifier. Since NB is not a strong classifier for texts, we will show that the proposed technique is much more accurate than S-EM. In [Yu et al., 2002], Yu et al. proposed a SVM based technique (called PEBL) to classify Web pages given positive and unlabeled pages. The core idea is the same as that in [Liu et al., 2002], i.e., (1) identifying a set of reliable negative documents from the unlabeled set (called strong negative documents in PEBL), and (2) building a classifier using SVM. The difference between the two techniques is that for both steps the two techniques use different methods. In [Yu et al., 2002], strong negative documents are those documents that do not contain any features of the positive data. After a set of strong negative documents is identified, SVM is applied iteratively to build a classifier. PEBL is sensitive to the number of positive examples. When the positive data is small, the results are often very poor. The proposed method differs from PEBL in that we perform negative data extraction from the unlabeled set using the Rocchio method and clustering. Although our second step also runs SVM iteratively to build a classifier, there is a key difference. Our technique selects a good classifier from a set of classifiers built by SVM, while PEBL does not. The proposed method performs well consistently under a variety of conditions. In [Scholkopf et al., 1999], one-class SVM is proposed. This technique uses only positive data to build a SVM classifier. However, our experiment results show that its classification performance is poor. Another line of related work is the use of a small labeled set of every class and a large unlabeled set for learning [Nigam et al., 1998; Blum & Mitchell, 1998; Goldman & Zhou, 2000; Basu et al., 2002; Muslea et al., 2002; Bockhorst & Craven, 2002]. It has been shown that the unlabeled data helps classification. This is different from our work, as we do not have any labeled negative training data. 3 The Proposed Technique This section presents the proposed approach, which also consists of two steps: (1) extracting some reliable negative documents from the unlabeled set, (2) applying SVM iteratively to build a classifier. Below, we present these two steps in turn. In Section 4, we show that our technique is much more robust than existing techniques. 3.1 Finding reliable negative documents This sub-section gives two methods for finding reliable negative documents: (1) Rocchio and (2) Rocchio with clustering. The second method often results in better classifiers, especially when the positive set is small. The key requirement for this step is that the identified negative documents from the unlabeled set must be reliable or pure, i.e., with no or very few positive documents, because SVM (used in our second step) is very sensitive to noise. 3.1.1 Method 1: Rocchio This method treats the entire unlabeled set U as negative documents and then uses the positive set P and U as the training data to build a Rocchio classifier. The classifier is then used to classify U. Those documents that are classified as negative are considered (reliable) negative data, denoted by RN. The algorithm is shown in Figure 1. 1. Assign the unlabeled set U the negative class, and the positive set P the positive class; 2. Let ∑ ∑ ∈ ∈ + = − d P d U d d d U d P c r r r r r r r | | || || 1 | | || || 1 α β ; 3. Let ∑ ∑ ∈ ∈ − = − d U d P d d d P d U c r r r r r r r | | || || 1 | | || || 1 α β ; 4. for each document ' d in U do 5. if ( , ) ( , ) ' ' sim c d sim c d r r r r + − <= then 6. RN = RN ∪ { ' d }; Figure 1: Rocchio extraction using U as negative In Rocchio classification, each document d is represented as a vector [Salton & McGill, 1983], . Each element qi in represents a word wi and is calculated as the combination of term frequency (tf) and inverse document frequency (idf), i.e., qi = tfi * idfi. tfi is the number of times that word wi occurs in d, while log(| | / ( )) i wi idf = D df . Here |D| is the total number of documents and df(wi) is the number of documents where word wi occurs at least once. Building a classifier is achieved by constructing positive and negative prototype vectors + c r and − c r (lines 2 and 3 in Figure 1). α and β parameters adjust the relative impact of positive and negative training examples. α = 16 and β = 4 are ( , , , ) 1 2 n d = q q L q d
recommended in [ Buckley et al., 1994 distance(similarity)to c and C. If a document is located In classification, for each test document d', it simply uses on the right-hand-side of H, it is classified as negative oth the cosine measure [Salton& McGill, 1983] to compute the erwise it is classified as positive. However, the positive and similarity(sim)of d with each prototype vector. The class negative classes in the data cannot be separated by H well. In whose prototype vector is more similar to d' is assigned to Figure 2, the positive documents within the regions I and 2 classified as negative form the negatives ether classific methods(e.g, naive Bayesian, SVM)to extract negative data from the unlabeled set. However. not all classification tech niques perform well in the absence of a"pure negative ed with SVM, naive Bayesian +c and rocchio, and observed that rocchio produces the best ++ esults. It is able to extract a large number of true negative documents from U with very high precision. The reason that Rocchio works well is as follow: In our Figure 2: Rocchio classifier is unable to remove some posi- positive class based learning, the unlabeled set U typically tive examples in U has the following characteristics 1. The proportion of positive documents in U is very small In order to further"purify"RN, we need to deal with the Thus, they do not affect c very much above problem. We propose the following approach 2. The negative documents in U are of diverse topics. Thus This approach uses clustering to partition RN into many imilarity groups(or clusters ). It then uses rocchio again to Since the positive set P is usually homogenous(focusing on build a classifier using each cluster and the positive set P one topic), they cover a much smaller space. Assume that there is a true decision surface s that documents in the cluster and delete them. The idea is to separates positive and negative data. Then, the positive pro- dentify and remove some positive documents in RN in a totype vector c will be closer to s than the type vector c due to the vector summation in Rocchio(line the cluster and the positive set are homogeneous, they allow 2 and 3 of Figure 1). Because of this reason, when we apply Rocchio to build better prototype vectors the similarity computation(in line 5 of Figure 1)to classify The clustering technique that we use is k-means, which is ocuments, many negative documents will be classified as an efficient technique. The k-means method needs the input sitive because they are closer to c. This explains why Rocchio can extract negative documents with high precision, observe that the choice of k does not affect classification and positive documents with low precision(but very high results much as long as it is not too small. Too few clusters may not be effective because the documents in a cluster can by Rocchio, we can run SVM using P and RN. Experimental classifier. Too many clusters will not cause much proen that we could actually use c and c in Figure I as the final lo This new method can further purify RN without removing many negative documents from RN. The detailed algo lassifier without using SVM. However, it does not perform rithm is given in well because it tends to classify too many negative docu- Note that in line I we still perform the initial Rocchio SVM using RN and P as training data can correct the situa- for clustering is RN, not U. The reason is that the purity of RN tion and produce a more accurate final classifier the influence of too many noisy(positive) documents in U 3.1.2 Method 2 Rocchio with Clustering Lines 2-3 perform the standard k-means clustering of RN, Rocchio is a linear classifier based which produces k clusters, NI Nk- Based When the decision boundary is non-linear or does not con form to the separating plane resulted from cosine similarity, luster N,(E(1, 2, ..,k, (lines 5 and 6). Line8 starts the them in RN. This will damage the SVM classifier later. Be- extraction. For each document d, E RN, we try to find the low, we propose xan enhancement to the Rocchio approach in similarity siml(., d, is less than the similarity between d, and order to purify RN further, i.e, to discard some likely positive any nega\s e ne 10). The reason for using the proce- documents from rn ototype vector n,, we put it in our final negative set i 9 Figure 2 shows a possible scenario where some positive dures in lines 9 and 10 is that we want to be conservative in ocuments in U may still be in RN. C and c represent the positive and negative prototype vectors respectively. H is the removing likely positive documents from RN, i.e., not to decision hyperplane produced by Rocchio. It has the sam remove too many negative documents
recommended in [Buckley et al., 1994]. In classification, for each test document d' r , it simply uses the cosine measure [Salton & McGill, 1983] to compute the similarity (sim) of d ' r with each prototype vector. The class whose prototype vector is more similar to d ' r is assigned to the test document (lines 4-6 in Figure 1). Those documents classified as negative form the negative set RN. In general, it is also possible to use other classification methods (e.g., naÔve Bayesian, SVM) to extract negative data from the unlabeled set. However, not all classification techniques perform well in the absence of a ìpureî negative training set. We experimented with SVM, naÔve Bayesian and Rocchio, and observed that Rocchio produces the best results. It is able to extract a large number of true negative documents from U with very high precision. The reason that Rocchio works well is as follow: In our positive class based learning, the unlabeled set U typically has the following characteristics: 1. The proportion of positive documents in U is very small. Thus, they do not affect − c r very much. 2. The negative documents in U are of diverse topics. Thus, in the vector space, they cover a very large region. Since the positive set P is usually homogenous (focusing on one topic), they cover a much smaller region in the vector space. Assume that there is a true decision surface S that separates positive and negative data. Then, the positive prototype vector + c r will be closer to S than the negative prototype vector − c r due to the vector summation in Rocchio (line 2 and 3 of Figure 1). Because of this reason, when we apply the similarity computation (in line 5 of Figure 1) to classify documents, many negative documents will be classified as positive because they are closer to + c r . This explains why Rocchio can extract negative documents with high precision, and positive documents with low precision (but very high recall). Our experiments show that this is indeed the case. After some reliable negative documents are found from U by Rocchio, we can run SVM using P and RN. Experimental results show that this simple approach works very well. Note that we could actually use + c r and − c r in Figure 1 as the final classifier without using SVM. However, it does not perform well because it tends to classify too many negative documents as positive documents (low precision for positive). SVM using RN and P as training data can correct the situation and produce a more accurate final classifier. 3.1.2 Method 2: Rocchio with Clustering Rocchio is a linear classifier based on cosine similarity. When the decision boundary is non-linear or does not conform to the separating plane resulted from cosine similarity, Rocchio may still extract some positive documents and put them in RN. This will damage the SVM classifier later. Below, we propose an enhancement to the Rocchio approach in order to purify RN further, i.e., to discard some likely positive documents from RN. Figure 2 shows a possible scenario where some positive documents in U may still be in RN. + c r and − c r represent the positive and negative prototype vectors respectively. H is the decision hyperplane produced by Rocchio. It has the same distance (similarity) to + c r and − c r . If a document is located on the right-hand-side of H, it is classified as negative otherwise it is classified as positive. However, the positive and negative classes in the data cannot be separated by H well. In Figure 2, the positive documents within the regions 1 and 2 will be misclassified as negative documents. Figure 2: Rocchio classifier is unable to remove some positive examples in U In order to further ìpurifyî RN, we need to deal with the above problem. We propose the following approach. This approach uses clustering to partition RN into many similarity groups (or clusters). It then uses Rocchio again to build a classifier using each cluster and the positive set P. The classifier is then applied to identify likely positive documents in the cluster and delete them. The idea is to identify and remove some positive documents in RN in a localized manner, which gives better accuracy. Since both the cluster and the positive set are homogeneous, they allow Rocchio to build better prototype vectors. The clustering technique that we use is k-means, which is an efficient technique. The k-means method needs the input cluster number k. From our experiments in Section 4, we observe that the choice of k does not affect classification results much as long as it is not too small. Too few clusters may not be effective because the documents in a cluster can still be diverse. Then, Rocchio still cannot build an accurate classifier. Too many clusters will not cause much problem. This new method can further purify RN without removing too many negative documents from RN. The detailed algorithm is given in Figure 3. Note that in line 1 we still perform the initial Rocchio extraction as discussed in Section 3.1.1. Thus, the input data for clustering is RN, not U. The reason is that the purity of RN is higher, which allows us to produce better clusters without the influence of too many noisy (positive) documents in U. Lines 2-3 perform the standard k-means clustering of RN, which produces k clusters, N1, N2, Ö, Nk. Based on these clusters, we construct a positive prototype vector j p r and a negative prototype vector nj r for the positive set P and each cluster Nj(j ∈ {1, 2, …, k} (lines 5 and 6). Line 8 starts the extraction. For each document di ∈ RN, we try to find the nearest positive prototype vector v p r to di (line 9). If the similarity ( , ) pv di sim r is less than the similarity between di and any negative prototype vector nj r , we put it in our final negative set RNí (line 10). The reason for using the procedures in lines 9 and 10 is that we want to be conservative in removing likely positive documents from RN, i.e., not to remove too many negative documents. − c r + c r 1 2 H
1. Perform the algorithm in Figure I and generate the from g and put them in the negative set RN, then the last initial negative set RM SVM classifier will be extremely poor. This is the problem 2. Choose k initial cluster centers (O1, O2,..., Ok with PEBL, which also runs SVM iteratively In our algo- randomly from RN rithm. we decide whether to use the first svm classifier or Perform k-means clustering to produce k clusters the last one(line 7). Basically, we use the SVM classifier at ie.N1,N2,…,Nk convergence(called Stast, line 6)to classify the positive set P. 4. forj= I to k If too many( 5%)positive documents in P are classified as d-β negative, it indicates that SVM has as,‖d‖"|P|dp‖d‖; the first SVM classifier(S1). Otherwise, we use Stast as the final classifier. We choose 5% as the threshold because we d d 点 want to be very conservative. Since SVM is very sensitive to noise, when noise is large, the resulting classifier is often 7.RN={}; ur first classifier 8. for each document d e rn do even without catching the last classifier which may be better Find the nearest positive prototype vector p, to it is acceptable Note that this strategy is not applicable to PEBL because d, where v= arg max sim(P,, d, ) PEBL s step I extracts too few negative documents from U 10. if there exist an, (=1, 2, .,k),s.t Its first SVM classifier is thus very inaccurate. Step l of our sim(p,, d, )<=sim(ni, d ) then proposed technique is able to extract a lar negative documents from U. Hence, our first SVM classifier RN=RNU idi is al ways quite good, although it may not be the best Figure 3 Rocchio extraction with clustering Note also that neither the first nor the last svm cla may be the best. This is the case for both Pebl After RN'is determined, we can build the final classifier algorithm. In many cases, a classifier somewhere using SVM, which takes P and RNas the positive and middle is the best. However, it is hard to catch it. We leave negative training data respectively his issue to our future research 3.2 Step 2: Classifier building Empirical Evaluati This step builds the final classifier by running SVM itera- This section evaluates the pro tively with the document sets P and RN or RN'depending or d techniques using the Reuters-21578 text collection, which is commonly used the method used in step I Let o be the set of the remaining in evaluating text classification methods. We will also unlabeled documents, g-RN(or U-RN) The algo- compare the proposed techniques with EM and pebl The Reuters-21578 collection contains 21578 text docu 1. Every document in P is assigned the class label+I ments. We used only the most populous 10 out of the 135 2. Every document in RN (RN) is assigned the label-I topic categories. Table I gives the number of documents in 3. Use P and RN (or RN) to train a SVm classifier S;, with each of the ten topic categories. In our experiments, each initially and i= i+l with each iteration(line 3-5); category is used as the positive class, and the rest of the 4. Classify Q using S. Let the set of documents in o that categories as the negative class. This gives us 10 dataset lassified as negative be w 5. if w= then stop Table 1: Categories and numbers of documents in Reuters else o=0-h RN=RNuW(orRN=RN'∪W, D369238457813%4584787172861486283 goto(3); 6. Use the last SVM classifier Slasr to classify P Our task is to identify positive documents from the 7. if more than 5% positive are classified as negati unlabeled set. The construction of each dataset for our use S, as the final classific experiments is done as follows: Firstly, we randomly select else use Stast as the final classifier a% of the documents from the positive class and the negative Figure 4: Constructing the final classifier lass, and put them into P and N classes respectively. The emaining(1-a%)documents from both classes form the The reason that we run SVM iteratively(lines 3-5)is that unlabeled set U. Our training set for each dataset consists of the reliable negative set RN from step I may not be suffi- P and U. N is not used as we do not want to change the class ciently large to build the best classifier SVM classifiers(S; in distribution of the unlabeled set U. We also used U as the test line 3)can be used to iteratively extract more negative set in our experiments because our objective is to recover documents from g. The iteration stops when there is no those positive documents in U. We do not need separate test negative document that can be extracted from o (line 5) sets In our experiments, we use 7 a values to create different There is, however, a danger in running S VM iteratively. settings, i.e., 5%, 15%, 25%, 35%, 45%, 55%and 65% Since SVM is very sensitive to noise, if some iteration of VM goes wrong and extracts many positive documents http://www.research.att.com/-lewis/reuters21578.html
1. Perform the algorithm in Figure 1 and generate the initial negative set RN; 2. Choose k initial cluster centers {O1, O2, Ö , Ok} randomly from RN; 3. Perform k-means clustering to produce k clusters, i.e. N1, N2,…, Nk; 4. for j = 1 to k 5. ; 6. ; 7. RNí={}; 8. for each document di ∈ RN do 9. Find the nearest positive prototype vector v p r to di, where ; 10. if there exist a (j = 1, 2, …, k), s.t. ( , ) ( , ) v i j i sim p d sim n d r r 5%) positive documents in P are classified as negative, it indicates that SVM has gone wrong. We then use the first SVM classifier (S1). Otherwise, we use Slast as the final classifier. We choose 5% as the threshold because we want to be very conservative. Since SVM is very sensitive to noise, when noise is large, the resulting classifier is often very poor. Since our first classifier is always quite strong, even without catching the last classifier which may be better, it is acceptable. Note that this strategy is not applicable to PEBL because PEBLís step 1 extracts too few negative documents from U. Its first SVM classifier is thus very inaccurate. Step 1 of our proposed technique is able to extract a large number of negative documents from U. Hence, our first SVM classifier is always quite good, although it may not be the best. Note also that neither the first nor the last SVM classifier may be the best. This is the case for both PEBL and our algorithm. In many cases, a classifier somewhere in the middle is the best. However, it is hard to catch it. We leave this issue to our future research. 4 Empirical Evaluation This section evaluates the proposed techniques using the Reuters-21578 text collection1 , which is commonly used in evaluating text classification methods. We will also compare the proposed techniques with S-EM and PEBL. The Reuters-21578 collection contains 21578 text documents. We used only the most populous 10 out of the 135 topic categories. Table 1 gives the number of documents in each of the ten topic categories. In our experiments, each category is used as the positive class, and the rest of the categories as the negative class. This gives us 10 datasets. Table 1: Categories and numbers of documents in Reuters acq corn crude earn grain interest money ship trade wheat 2369 238 578 3964 582 478 717 286 486 283 Our task is to identify positive documents from the unlabeled set. The construction of each dataset for our experiments is done as follows: Firstly, we randomly select a% of the documents from the positive class and the negative class, and put them into P and N classes respectively. The remaining (1 - a%) documents from both classes form the unlabeled set U. Our training set for each dataset consists of P and U. N is not used as we do not want to change the class distribution of the unlabeled set U. We also used U as the test set in our experiments because our objective is to recover those positive documents in U. We do not need separate test sets. In our experiments, we use 7 a values to create different settings, i.e., 5%, 15%, 25%, 35%, 45%, 55% and 65%. 1 http://www.research.att.com/~lewis/reuters21578.html ∑ ∑ ∈ ∈ = − d N j d P j j d d d N d P p r r r r r r | | || || 1 | | || || 1 α β = ∑ − ∑ d ∈ N d ∈P j j d d d P d N n j r r r r r r r | | || || 1 | | || || 1 α β arg max ( , ) j i j v sim p d r = j n r
4.1 Evaluation measures to show the accuracy results(as noted earlier, accuracy does Two popular measures used for evaluating text classification not fully reflect the performance of our system) are the F value and breakeven point F value is defined as, F Table 2: Experiment results for a= 15% 2pr/(p+r), where p is the precision and r is the recall. F value measures the performance of a system on a particular clas 上平 (in our case, the positive class). Breakeven point is the value at which recall and precision are equal. However, breakeven come Fa ing order of class probabilities of documents. It does not give indication of classification performance. F value, on the other interest hand,reflects an average effect of both precision and recall. ship1 o7D tive documents it is undesirable to have either too small a whea□圖鸚圖 precision or too small a recall LosBa og7o o73d ogre ooo o923 o.7e og7s o7ge osel In our experimental results, we also report accuracies. It however, should be noted that accuracy does not fully reflect Table 3: Experiment results for a=45% the performance of our system, as our datasets has a large45% roportion of negative documents. We believe this reflects realistic situations. In such cases, the accuracy can be higl but few positive documents may be identified 4.2 Experimental result and Roc-Clu-sVM to denote the classification techniques trade that employ rocchio and Rocchio with clustering to extract whea reliable negative set respectively(both methods use SVM for Avg Lo.72 oge 0.745 o972 a78 0982 0.785 og76 a7gal classifier building). We observe from our experiments that using Rocchio and Rocchio with clustering alone for classi- fication do not provide good classification results. SVM improves the results significantly For comparison, we include the classification results of NB. S-EM and PEBL. Here. NB treats all the documents in 06 the unlabeled set as negative. SVM for the noisy situation(U 0.5 x.-NB as negative)performs poorly because S VM does not tolerate 0.4 noise well. Thus, its results are not listed (in many cases, its F 亠-PEBL 0.3 values are close to O). In both roc-sVM and Roc-Clu-SVM -or-.Roc-SVM we used the linear SVM as [ Yang Liu, 1999] reports that Roc-Ch-SVN linear SVM gives slightly better results than non-linear models on the Reuters dataset. For our experiments, we im- plemented PEBL as it is not publicly available For SVM,we used the SVm"ight system [Joachims, 1999. PEBL also used SVM ght. S-EM is our earlier system. It is publicly available Figure 5 Average results for all a settings athttp://www.cs.uic.edu/-liub/s-em/s-em-download.html From Figure 5, we can draw the following conclusions Table 2 shows the classification results of various tech- 1. S-EM's results are quite consistent under different set niques in terms of f value and accuracy(Acc)for a= 15% tings. However, its results are worse than Roc-SVM and the positive set is small). The final row of the table gives the Roc-Clu-SVM. The reason is that the negative documents average result of each column. We used 10 clusters (i.e, k extracted from U by its spy technique are not that reliable 0) for k-means clustering in Roc-Clu-SVM (later we will and s-eM uses a weaker classifier. NB ee that the number of clusters does not matter much) s results are extremely poor when the number of We observe that Roc-Clu-SVM produces better results positive documents is small. We believe that this is be than Roc-SVM. Both Roc-SVM and Roc-Clu-SVM outper- cause its strategy of extracting the initial set of strong form NB, S-EM and PEBL PEBL is extremely poor in this negative documents could easily go wrong without suf- case. In fact, PEBL performs poorly when the number of ficient positive data. Even when the number of positive sitive documents is small. When the number of positive documents is large, it may also go wrong. For example documents is large, it usually performs well(see Table 3 with for a=55%, one F value (for the dataset, trade)is only 45%) Both Roc-SVM and Roc-Clu-SVM perform well 0. 073. This shows that pebl is not robust consistently. We summarize the average F value results of all 3. Both Roc-SVM and Roc-Clu-SVM are robust with dif- a settings in Figure 5. Due to space limitations, we are unable ferent numbers of positive documents. This is important
4.1 Evaluation measures Two popular measures used for evaluating text classification are the F value and breakeven point. F value is defined as, F = 2pr/(p+r), where p is the precision and r is the recall. F value measures the performance of a system on a particular class (in our case, the positive class). Breakeven point is the value at which recall and precision are equal. However, breakeven point is not suitable for our task as it only evaluates the sorting order of class probabilities of documents. It does not give indication of classification performance. F value, on the other hand, reflects an average effect of both precision and recall. This is suitable for our purpose, as we want to identify positive documents. It is undesirable to have either too small a precision or too small a recall. In our experimental results, we also report accuracies. It, however, should be noted that accuracy does not fully reflect the performance of our system, as our datasets has a large proportion of negative documents. We believe this reflects realistic situations. In such cases, the accuracy can be high, but few positive documents may be identified. 4.2 Experimental results We now present the experimental results. We use Roc-SVM and Roc-Clu-SVM to denote the classification techniques that employ Rocchio and Rocchio with clustering to extract reliable negative set respectively (both methods use SVM for classifier building). We observe from our experiments that using Rocchio and Rocchio with clustering alone for classification do not provide good classification results. SVM improves the results significantly. For comparison, we include the classification results of NB, S-EM and PEBL. Here, NB treats all the documents in the unlabeled set as negative. SVM for the noisy situation (U as negative) performs poorly because SVM does not tolerate noise well. Thus, its results are not listed (in many cases, its F values are close to 0). In both Roc-SVM and Roc-Clu-SVM, we used the linear SVM as [Yang & Liu, 1999] reports that linear SVM gives slightly better results than non-linear models on the Reuters dataset. For our experiments, we implemented PEBL as it is not publicly available. For SVM, we used the SVMlight system [Joachims, 1999]. PEBL also used SVMlight. S-EM is our earlier system. It is publicly available at http://www.cs.uic.edu/~liub/S-EM/S-EM-download.html. Table 2 shows the classification results of various techniques in terms of F value and accuracy (Acc) for a = 15% (the positive set is small). The final row of the table gives the average result of each column. We used 10 clusters (i.e., k = 10) for k-means clustering in Roc-Clu-SVM (later we will see that the number of clusters does not matter much). We observe that Roc-Clu-SVM produces better results than Roc-SVM. Both Roc-SVM and Roc-Clu-SVM outperform NB, S-EM and PEBL. PEBL is extremely poor in this case. In fact, PEBL performs poorly when the number of positive documents is small. When the number of positive documents is large, it usually performs well (see Table 3 with a = 45%). Both Roc-SVM and Roc-Clu-SVM perform well consistently. We summarize the average F value results of all a settings in Figure 5. Due to space limitations, we are unable to show the accuracy results (as noted earlier, accuracy does not fully reflect the performance of our system). Table 2: Experiment results for a = 15%. Table 3: Experiment results for a = 45%. Figure 5 Average results for all a settings From Figure 5, we can draw the following conclusions: 1. S-EMís results are quite consistent under different settings. However, its results are worse than Roc-SVM and Roc-Clu-SVM. The reason is that the negative documents extracted from U by its spy technique are not that reliable, and S-EM uses a weaker classifier, NB. 2. PEBLís results are extremely poor when the number of positive documents is small. We believe that this is because its strategy of extracting the initial set of strong negative documents could easily go wrong without sufficient positive data. Even when the number of positive documents is large, it may also go wrong. For example, for a = 55%, one F value (for the dataset, trade) is only 0.073. This shows that PEBL is not robust. 3. Both Roc-SVM and Roc-Clu-SVM are robust with different numbers of positive documents. This is important 15% F Acc F Acc F Acc F Acc F Acc acq 0.744 0.920 0.876 0.954 0.001 0.817 0.846 0.948 0.867 0.952 corn 0.452 0.983 0.452 0.983 0.000 0.982 0.804 0.993 0.822 0.995 crude 0.731 0.979 0.820 0.984 0.000 0.955 0.782 0.983 0.801 0.985 earn 0.910 0.949 0.947 0.968 0.000 0.693 0.858 0.924 0.891 0.959 grain 0.728 0.977 0.807 0.982 0.020 0.955 0.845 0.987 0.869 0.986 interest 0.609 0.972 0.648 0.970 0.000 0.963 0.704 0.976 0.724 0.983 money 0.754 0.974 0.793 0.975 0.000 0.945 0.768 0.973 0.785 0.973 ship 0.701 0.989 0.742 0.990 0.008 0.978 0.578 0.986 0.596 0.989 trade 0.627 0.977 0.698 0.979 0.000 0.962 0.759 0.983 0.778 0.985 wheat 0.579 0.982 0.611 0.979 0.000 0.978 0.834 0.994 0.854 0.997 Avg 0.684 0.970 0.739 0.976 0.003 0.923 0.778 0.975 0.799 0.980 NB S-EM PEBL Roc-SVM Roc-Clu-SVM 45% F Acc F Acc F Acc F Acc F Acc acq 0.802 0.934 0.891 0.959 0.891 0.963 0.905 0.964 0.909 0.965 corn 0.502 0.972 0.517 0.970 0.663 0.991 0.635 0.990 0.645 0.992 crude 0.833 0.985 0.850 0.986 0.798 0.984 0.811 0.985 0.811 0.985 earn 0.924 0.956 0.950 0.970 0.956 0.974 0.886 0.937 0.923 0.956 grain 0.768 0.975 0.772 0.975 0.900 0.992 0.903 0.992 0.903 0.992 interest 0.617 0.959 0.614 0.956 0.770 0.983 0.614 0.957 0.616 0.957 money 0.751 0.969 0.760 0.968 0.714 0.973 0.764 0.969 0.764 0.970 ship 0.791 0.991 0.806 0.991 0.672 0.989 0.829 0.993 0.843 0.994 trade 0.678 0.973 0.693 0.972 0.728 0.982 0.728 0.982 0.728 0.982 wheat 0.595 0.973 0.595 0.972 0.783 0.992 0.779 0.992 0.792 0.994 Avg 0.726 0.969 0.745 0.972 0.788 0.982 0.785 0.976 0.793 0.979 NB S-EM PEBL Roc-SVM Roc-Clu-SVM 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 5% 15% 25% 35% 45% 55% 65% a F value NB S-EM PEBL Roc-SVM Roc-Clu-SVM
because in practice one does not know how many positive Acknowledgement: This research is supported by a research documents are sufficient. Using a smaller set of positive grant from A-STAR and nus to the second author documents also reduce the manual labeling effort From Figure 5, we also observe that Roc-Clu-sVM is References slightly better than Roc-SVM in both F value and accu- [Basu et al, 2002]S. Basu, A. Banerjee,& R mooney important in such cases. When a is large, their results are [Blum& Mitchell, 1998 A Blum&T Mitchell. Combining similar. In practice, Roc-SVM may be sufficient due to its Bockhorst Craven, 2002 J. Bockhorst M. Craven We also ran one-Class SVM using the liBs vm package Exploiting relations among concepts to acquire weakly (http://www.csie.ntu.edu.tw/-cjlin/libsvm/).Itgivesvery labeled training data. ICML-2002, 2002 oor results(mostly with F values less than 0.4). Due to [Buckley et al., 1994]C. Buckley, G. Salton,and J.Allan space limitations, we do not list their results here The effect of adding relevance information in a relevance Comparison of different number of clusters: As discussed feed back environment. SIGIR-94. 1994 above, the number of clusters used in Roc-Clu-SVM makes [ Dempster et al., 1977] A. Dempster, N. Laird &D.Rubin Maximum likelihood from incomplete data via the em ittle difference to the final classifier. We now provide ex- algorithm. Journal of the Royal Statistical Society, 1977 periment results in Figure 6 to support this claim by using different numbers of clusters(here, a=25% and for other a [Denis, 1998]F. Denis. PAC learning from positive statisti- al queries. ALT-98, pages 1 12-126. 1998 from 2 to 30. We observe that the choice of k has little in- [Goldman& Zhou, 2000]S Goldman &Y. Zhou Enhancing fluence on the results as long as it is not too smal supervised learning with unlabeled data. ICML-2000 Joachims, 1999 T. Joachims. Making large-Scale SVM Learning Practical. Advances in Kernel Methods- Support Vector Learning, B. Scholkopf and C. Burges and A Smola(ed ) MIT-Press, 1999 05 ( Kearns, 1998]M. Kearns. Efficient noise-tolerant learning from statistical queries. J.of the ACM, 45: 983-1006, 1 LLetouzey et al, 2000] F Letouzey, F. Denis, R. Gilleron, E. Grappa. Learning from 61014182230 ples. ALT-00, 2000 Number of clusters Lewis Gale, 1994]D. Lewis w. Gale. A sequential Figure 6: Averaged F values of different clusters(a25%) algorithm for training text classifiers. S/GIR-94, 1994 Execution times: Our technique consists of wo seps: ex- [Liu et al., 2002] B. Liu, W.Lee, P. Yu,&XLi.Partially supervised classification of text documents. ICML-2002 tracting negative documents from U and iteratively running Muslea et al., 2002] I Muslea, S Minton C Knoblock SVM. As sVM is a standard technique we will not discuss Active semi-supervised learning robust multi-view cient because Rocchio is a linear algorithm, O(UuPD). If we [Muggleton, 2001]S. Muggleton. Learning from the positive use rocchio with clustering, more time is needed because data Machine Learning, 2001, to appea k-means clustering. The time complexity of k-means is INigam et al, 1998]. Nigam, A McCallum, S.Thrun,&T O(**), where k is the number of clusters, IRN is the Mitchell. Text classification from labeled and unlabeled size of the reliable negative set, and l is the number of itera- documents using EM, machine learning, 2000 tions.Since k and I are normally small, k-means is generally [Rocchio, 1971]J. Rocchio Relevant feedback in informa- regarded as a linear algorithm O(RND. Thus, the time com- vaL. In plexity of the extraction step of Roc-Clu-SVM is o(wP) tem: experiments in automatic document processing less than 7 seconds for step one for Roc-Clu-SVM(on Pen- Englewood Cliffs, NJ, 1971 tium 800MHz PC with 256MB memory) [Salton McGill, 1983]G Salton M. McGill. Introduc 5 Conclusions Scholkopf et al, 1999]B Scholkopf, J. Platt, J. Shawe, A Smola & R. Williamson. Estimating the support of a This paper studied the problem of text classification with high-dimensional distribution. Technical Report only partial information, i.e., with only one class of labeled MSR-TR-99-87. Microsoft Research. 1999 documents and a set of unlabeled documents. An effective [Vapnik, 1995] V Vapnik. The nature of statistical learning technique is proposed to solve the problem. Our algorithm first utilizes the rocchio classifier and/or clustering to extract Yang Liu, 1999]Y. Yang&x liu. A re-examination of a set of reliable negative documents from the unlabeled set, text categorization methods. SIGIR-99, 1999 and then builds a SVM classifier iteratively. Experimental TYu et al., 2002]H. Yu, J. Han, K Chang PEBL: Positive results show that the proposed technique is more robust than example based learning for Web page classification both s-em and Pebl SVM.KDD-02.2002
because in practice one does not know how many positive documents are sufficient. Using a smaller set of positive documents also reduce the manual labeling effort. From Figure 5, we also observe that Roc-Clu-SVM is slightly better than Roc-SVM in both F value and accuracy, especially when a is small because pure RN is more important in such cases. When a is large, their results are similar. In practice, Roc-SVM may be sufficient due to its simplicity, efficiency and good results. We also ran one-Class SVM using the LIBSVM package (http://www.csie.ntu.edu.tw/~cjlin/libsvm/). It gives very poor results (mostly with F values less than 0.4). Due to space limitations, we do not list their results here. Comparison of different number of clusters: As discussed above, the number of clusters used in Roc-Clu-SVM makes little difference to the final classifier. We now provide experiment results in Figure 6 to support this claim by using different numbers of clusters (here, a = 25% and for other a values, the results are similar). The cluster number k varies from 2 to 30. We observe that the choice of k has little influence on the results as long as it is not too small. 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 2 6 10 14 18 22 26 30 Number of clusters Averaged F values Figure 6: Averaged F values of different clusters (a=25%) Execution times: Our technique consists of two steps: extracting negative documents from U and iteratively running SVM. As SVM is a standard technique, we will not discuss it here. If we only use Rocchio for the first step, it is very efficient because Rocchio is a linear algorithm, O(|U∪P|). If we use Rocchio with clustering, more time is needed because of k-means clustering. The time complexity of k-means is O(k*|RN|*I), where k is the number of clusters, |RN| is the size of the reliable negative set, and I is the number of iterations. Since k and I are normally small, k-means is generally regarded as a linear algorithm O(|RN|). Thus, the time complexity of the extraction step of Roc-Clu-SVM is O(|U∪P|) since |RN| < |U∪P|. In our experiments, every dataset takes less than 7 seconds for step one for Roc-Clu-SVM (on Pentium 800MHz PC with 256MB memory). 5 Conclusions This paper studied the problem of text classification with only partial information, i.e., with only one class of labeled documents and a set of unlabeled documents. An effective technique is proposed to solve the problem. Our algorithm first utilizes the Rocchio classifier and/or clustering to extract a set of reliable negative documents from the unlabeled set, and then builds a SVM classifier iteratively. Experimental results show that the proposed technique is more robust than both S-EM and PEBL. Acknowledgement: This research is supported by a research grant from A-STAR and NUS to the second author. References [Basu et al., 2002] S. Basu, A. Banerjee, & R. Mooney. Semi-supervised clustering by seeding. ICML-2002, 2002. [Blum & Mitchell, 1998] A. Blum & T. Mitchell. Combining labeled & unlabeled data with co-training. COLT-98. [Bockhorst & Craven, 2002] J. Bockhorst & M. Craven. Exploiting relations among concepts to acquire weakly labeled training data. ICML-2002, 2002. [Buckley et al., 1994] C. Buckley, G. Salton, and J. Allan. The effect of adding relevance information in a relevance feedback environment. SIGIR-94, 1994. [Dempster et al., 1977] A. Dempster, N. Laird & D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1977. [Denis, 1998] F. Denis. PAC learning from positive statistical queries. ALT-98, pages 112-126. 1998. [Goldman & Zhou, 2000] S. Goldman & Y. Zhou. Enhancing supervised learning with unlabeled data. ICML-2000. [Joachims, 1999] T. Joachims. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schˆlkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999 [Kearns, 1998] M. Kearns. Efficient noise-tolerant learning from statistical queries. J. of the ACM,45:983-1006, 1998. [Letouzey et al., 2000] F. Letouzey, F. Denis, R. Gilleron, & E. Grappa. Learning from positive and unlabeled examples. ALT-00, 2000. [Lewis & Gale, 1994] D. Lewis & W. Gale. A sequential algorithm for training text classifiers. SIGIR-94, 1994. [Liu et al., 2002] B. Liu, W. Lee, P. Yu, & X. Li. Partially supervised classification of text documents. ICML-2002. [Muslea et al., 2002] I. Muslea, S. Minton & C. Knoblock. Active + semi-supervised learning = robust multi-view learning. ICML-2002, 2002. [Muggleton, 2001] S. Muggleton. Learning from the positive data. Machine Learning, 2001, to appear. [Nigam et al., 1998] K. Nigam, A. McCallum, S. Thrun, & T. Mitchell. Text classification from labeled and unlabeled documents using EM, machine learning, 2000. [Rocchio, 1971] J. Rocchio. Relevant feedback in information retrieval. In G. Salton (ed.). The smart retrieval system: experiments in automatic document processing, Englewood Cliffs, NJ, 1971. [Salton & McGill, 1983] G. Salton & M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill,1983. [Scholkopf et al., 1999] B. Scholkopf, J. Platt, J. Shawe, A. Smola & R. Williamson. Estimating the support of a high-dimensional distribution. Technical Report MSR-TR-99-87, Microsoft Research, 1999. [Vapnik, 1995] V. Vapnik. The nature of statistical learning theory, 1995. [Yang & Liu, 1999] Y. Yang & X. Liu. A re-examination of text categorization methods. SIGIR-99, 1999. [Yu et al., 2002] H. Yu, J. Han, & K. Chang. PEBL: Positive example based learning for Web page classification using SVM. KDD-02, 2002