正在加载图片...
1. Perform the algorithm in Figure I and generate the from g and put them in the negative set RN, then the last initial negative set RM SVM classifier will be extremely poor. This is the problem 2. Choose k initial cluster centers (O1, O2,..., Ok with PEBL, which also runs SVM iteratively In our algo- randomly from RN rithm. we decide whether to use the first svm classifier or Perform k-means clustering to produce k clusters the last one(line 7). Basically, we use the SVM classifier at ie.N1,N2,…,Nk convergence(called Stast, line 6)to classify the positive set P. 4. forj= I to k If too many( 5%)positive documents in P are classified as d-β negative, it indicates that SVM has as,‖d‖"|P|dp‖d‖; the first SVM classifier(S1). Otherwise, we use Stast as the final classifier. We choose 5% as the threshold because we d d 点 want to be very conservative. Since SVM is very sensitive to noise, when noise is large, the resulting classifier is often 7.RN={}; ur first classifier 8. for each document d e rn do even without catching the last classifier which may be better Find the nearest positive prototype vector p, to it is acceptable Note that this strategy is not applicable to PEBL because d, where v= arg max sim(P,, d, ) PEBL s step I extracts too few negative documents from U 10. if there exist an, (=1, 2, .,k),s.t Its first SVM classifier is thus very inaccurate. Step l of our sim(p,, d, )<=sim(ni, d ) then proposed technique is able to extract a lar negative documents from U. Hence, our first SVM classifier RN=RNU idi is al ways quite good, although it may not be the best Figure 3 Rocchio extraction with clustering Note also that neither the first nor the last svm cla may be the best. This is the case for both Pebl After RN'is determined, we can build the final classifier algorithm. In many cases, a classifier somewhere using SVM, which takes P and RNas the positive and middle is the best. However, it is hard to catch it. We leave negative training data respectively his issue to our future research 3.2 Step 2: Classifier building Empirical Evaluati This step builds the final classifier by running SVM itera- This section evaluates the pro tively with the document sets P and RN or RN'depending or d techniques using the Reuters-21578 text collection, which is commonly used the method used in step I Let o be the set of the remaining in evaluating text classification methods. We will also unlabeled documents, g-RN(or U-RN) The algo- compare the proposed techniques with EM and pebl The Reuters-21578 collection contains 21578 text docu 1. Every document in P is assigned the class label+I ments. We used only the most populous 10 out of the 135 2. Every document in RN (RN) is assigned the label-I topic categories. Table I gives the number of documents in 3. Use P and RN (or RN) to train a SVm classifier S;, with each of the ten topic categories. In our experiments, each initially and i= i+l with each iteration(line 3-5); category is used as the positive class, and the rest of the 4. Classify Q using S. Let the set of documents in o that categories as the negative class. This gives us 10 dataset lassified as negative be w 5. if w= then stop Table 1: Categories and numbers of documents in Reuters else o=0-h RN=RNuW(orRN=RN'∪W, D369238457813%4584787172861486283 goto(3); 6. Use the last SVM classifier Slasr to classify P Our task is to identify positive documents from the 7. if more than 5% positive are classified as negati unlabeled set. The construction of each dataset for our use S, as the final classific experiments is done as follows: Firstly, we randomly select else use Stast as the final classifier a% of the documents from the positive class and the negative Figure 4: Constructing the final classifier lass, and put them into P and N classes respectively. The emaining(1-a%)documents from both classes form the The reason that we run SVM iteratively(lines 3-5)is that unlabeled set U. Our training set for each dataset consists of the reliable negative set RN from step I may not be suffi- P and U. N is not used as we do not want to change the class ciently large to build the best classifier SVM classifiers(S; in distribution of the unlabeled set U. We also used U as the test line 3)can be used to iteratively extract more negative set in our experiments because our objective is to recover documents from g. The iteration stops when there is no those positive documents in U. We do not need separate test negative document that can be extracted from o (line 5) sets In our experiments, we use 7 a values to create different There is, however, a danger in running S VM iteratively. settings, i.e., 5%, 15%, 25%, 35%, 45%, 55%and 65% Since SVM is very sensitive to noise, if some iteration of VM goes wrong and extracts many positive documents http://www.research.att.com/-lewis/reuters21578.html1. Perform the algorithm in Figure 1 and generate the initial negative set RN; 2. Choose k initial cluster centers {O1, O2, Ö , Ok} randomly from RN; 3. Perform k-means clustering to produce k clusters, i.e. N1, N2,…, Nk; 4. for j = 1 to k 5. ; 6. ; 7. RNí={}; 8. for each document di ∈ RN do 9. Find the nearest positive prototype vector v p r to di, where ; 10. if there exist a (j = 1, 2, …, k), s.t. ( , ) ( , ) v i j i sim p d sim n d r r <= then 11. RNí = RNí ∪ {di}; Figure 3: Rocchio extraction with clustering. After RNí is determined, we can build the final classifier using SVM, which takes P and RNí as the positive and negative training data respectively. 3.2 Step 2: Classifier building This step builds the final classifier by running SVM itera￾tively with the document sets P and RN or RNí depending on the method used in step 1. Let Q be the set of the remaining unlabeled documents, Q = U ñ RN (or U ñ RNí). The algo￾rithm for this second step is given in Figure 4. 1. Every document in P is assigned the class label +1; 2. Every document in RN (RNí) is assigned the label ñ1; 3. Use P and RN (or RNí) to train a SVM classifier Si, with i = 1 initially and i = i+1 with each iteration (line 3-5); 4. Classify Q using Si. Let the set of documents in Q that are classified as negative be W; 5. if W = {} then stop; else Q = Q ñ W; RN = RN ∪ W (or RNí = RNí ∪ W); goto (3); 6. Use the last SVM classifier Slast to classify P; 7. if more than 5% positive are classified as negative then use S1 as the final classifier; else use Slast as the final classifier; Figure 4: Constructing the final classifier. The reason that we run SVM iteratively (lines 3-5) is that the reliable negative set RN from step 1 may not be suffi￾ciently large to build the best classifier. SVM classifiers (Si in line 3) can be used to iteratively extract more negative documents from Q. The iteration stops when there is no negative document that can be extracted from Q (line 5). There is, however, a danger in running SVM iteratively. Since SVM is very sensitive to noise, if some iteration of SVM goes wrong and extracts many positive documents from Q and put them in the negative set RN, then the last SVM classifier will be extremely poor. This is the problem with PEBL, which also runs SVM iteratively. In our algo￾rithm, we decide whether to use the first SVM classifier or the last one (line 7). Basically, we use the SVM classifier at convergence (called Slast, line 6) to classify the positive set P. If too many (> 5%) positive documents in P are classified as negative, it indicates that SVM has gone wrong. We then use the first SVM classifier (S1). Otherwise, we use Slast as the final classifier. We choose 5% as the threshold because we want to be very conservative. Since SVM is very sensitive to noise, when noise is large, the resulting classifier is often very poor. Since our first classifier is always quite strong, even without catching the last classifier which may be better, it is acceptable. Note that this strategy is not applicable to PEBL because PEBLís step 1 extracts too few negative documents from U. Its first SVM classifier is thus very inaccurate. Step 1 of our proposed technique is able to extract a large number of negative documents from U. Hence, our first SVM classifier is always quite good, although it may not be the best. Note also that neither the first nor the last SVM classifier may be the best. This is the case for both PEBL and our algorithm. In many cases, a classifier somewhere in the middle is the best. However, it is hard to catch it. We leave this issue to our future research. 4 Empirical Evaluation This section evaluates the proposed techniques using the Reuters-21578 text collection1 , which is commonly used in evaluating text classification methods. We will also compare the proposed techniques with S-EM and PEBL. The Reuters-21578 collection contains 21578 text docu￾ments. We used only the most populous 10 out of the 135 topic categories. Table 1 gives the number of documents in each of the ten topic categories. In our experiments, each category is used as the positive class, and the rest of the categories as the negative class. This gives us 10 datasets. Table 1: Categories and numbers of documents in Reuters acq corn crude earn grain interest money ship trade wheat 2369 238 578 3964 582 478 717 286 486 283 Our task is to identify positive documents from the unlabeled set. The construction of each dataset for our experiments is done as follows: Firstly, we randomly select a% of the documents from the positive class and the negative class, and put them into P and N classes respectively. The remaining (1 - a%) documents from both classes form the unlabeled set U. Our training set for each dataset consists of P and U. N is not used as we do not want to change the class distribution of the unlabeled set U. We also used U as the test set in our experiments because our objective is to recover those positive documents in U. We do not need separate test sets. In our experiments, we use 7 a values to create different settings, i.e., 5%, 15%, 25%, 35%, 45%, 55% and 65%. 1 http://www.research.att.com/~lewis/reuters21578.html ∑ ∑ ∈ ∈ = − d N j d P j j d d d N d P p r r r r r r | | || || 1 | | || || 1 α β = ∑ − ∑ d ∈ N d ∈P j j d d d P d N n j r r r r r r r | | || || 1 | | || || 1 α β arg max ( , ) j i j v sim p d r = j n r
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有