正在加载图片...
recommended in [ Buckley et al., 1994 distance(similarity)to c and C. If a document is located In classification, for each test document d', it simply uses on the right-hand-side of H, it is classified as negative oth the cosine measure [Salton& McGill, 1983] to compute the erwise it is classified as positive. However, the positive and similarity(sim)of d with each prototype vector. The class negative classes in the data cannot be separated by H well. In whose prototype vector is more similar to d' is assigned to Figure 2, the positive documents within the regions I and 2 classified as negative form the negatives ether classific methods(e.g, naive Bayesian, SVM)to extract negative data from the unlabeled set. However. not all classification tech niques perform well in the absence of a"pure negative ed with SVM, naive Bayesian +c and rocchio, and observed that rocchio produces the best ++ esults. It is able to extract a large number of true negative documents from U with very high precision. The reason that Rocchio works well is as follow: In our Figure 2: Rocchio classifier is unable to remove some posi- positive class based learning, the unlabeled set U typically tive examples in U has the following characteristics 1. The proportion of positive documents in U is very small In order to further"purify"RN, we need to deal with the Thus, they do not affect c very much above problem. We propose the following approach 2. The negative documents in U are of diverse topics. Thus This approach uses clustering to partition RN into many imilarity groups(or clusters ). It then uses rocchio again to Since the positive set P is usually homogenous(focusing on build a classifier using each cluster and the positive set P one topic), they cover a much smaller space. Assume that there is a true decision surface s that documents in the cluster and delete them. The idea is to separates positive and negative data. Then, the positive pro- dentify and remove some positive documents in RN in a totype vector c will be closer to s than the type vector c due to the vector summation in Rocchio(line the cluster and the positive set are homogeneous, they allow 2 and 3 of Figure 1). Because of this reason, when we apply Rocchio to build better prototype vectors the similarity computation(in line 5 of Figure 1)to classify The clustering technique that we use is k-means, which is ocuments, many negative documents will be classified as an efficient technique. The k-means method needs the input sitive because they are closer to c. This explains why Rocchio can extract negative documents with high precision, observe that the choice of k does not affect classification and positive documents with low precision(but very high results much as long as it is not too small. Too few clusters may not be effective because the documents in a cluster can by Rocchio, we can run SVM using P and RN. Experimental classifier. Too many clusters will not cause much proen that we could actually use c and c in Figure I as the final lo This new method can further purify RN without removing many negative documents from RN. The detailed algo lassifier without using SVM. However, it does not perform rithm is given in well because it tends to classify too many negative docu- Note that in line I we still perform the initial Rocchio SVM using RN and P as training data can correct the situa- for clustering is RN, not U. The reason is that the purity of RN tion and produce a more accurate final classifier the influence of too many noisy(positive) documents in U 3.1.2 Method 2 Rocchio with Clustering Lines 2-3 perform the standard k-means clustering of RN, Rocchio is a linear classifier based which produces k clusters, NI Nk- Based When the decision boundary is non-linear or does not con form to the separating plane resulted from cosine similarity, luster N,(E(1, 2, ..,k, (lines 5 and 6). Line8 starts the them in RN. This will damage the SVM classifier later. Be- extraction. For each document d, E RN, we try to find the low, we propose xan enhancement to the Rocchio approach in similarity siml(., d, is less than the similarity between d, and order to purify RN further, i.e, to discard some likely positive any nega\s e ne 10). The reason for using the proce- documents from rn ototype vector n,, we put it in our final negative set i 9 Figure 2 shows a possible scenario where some positive dures in lines 9 and 10 is that we want to be conservative in ocuments in U may still be in RN. C and c represent the positive and negative prototype vectors respectively. H is the removing likely positive documents from RN, i.e., not to decision hyperplane produced by Rocchio. It has the sam remove too many negative documentsrecommended in [Buckley et al., 1994]. In classification, for each test document d' r , it simply uses the cosine measure [Salton & McGill, 1983] to compute the similarity (sim) of d ' r with each prototype vector. The class whose prototype vector is more similar to d ' r is assigned to the test document (lines 4-6 in Figure 1). Those documents classified as negative form the negative set RN. In general, it is also possible to use other classification methods (e.g., naÔve Bayesian, SVM) to extract negative data from the unlabeled set. However, not all classification tech￾niques perform well in the absence of a ìpureî negative training set. We experimented with SVM, naÔve Bayesian and Rocchio, and observed that Rocchio produces the best results. It is able to extract a large number of true negative documents from U with very high precision. The reason that Rocchio works well is as follow: In our positive class based learning, the unlabeled set U typically has the following characteristics: 1. The proportion of positive documents in U is very small. Thus, they do not affect − c r very much. 2. The negative documents in U are of diverse topics. Thus, in the vector space, they cover a very large region. Since the positive set P is usually homogenous (focusing on one topic), they cover a much smaller region in the vector space. Assume that there is a true decision surface S that separates positive and negative data. Then, the positive pro￾totype vector + c r will be closer to S than the negative proto￾type vector − c r due to the vector summation in Rocchio (line 2 and 3 of Figure 1). Because of this reason, when we apply the similarity computation (in line 5 of Figure 1) to classify documents, many negative documents will be classified as positive because they are closer to + c r . This explains why Rocchio can extract negative documents with high precision, and positive documents with low precision (but very high recall). Our experiments show that this is indeed the case. After some reliable negative documents are found from U by Rocchio, we can run SVM using P and RN. Experimental results show that this simple approach works very well. Note that we could actually use + c r and − c r in Figure 1 as the final classifier without using SVM. However, it does not perform well because it tends to classify too many negative docu￾ments as positive documents (low precision for positive). SVM using RN and P as training data can correct the situa￾tion and produce a more accurate final classifier. 3.1.2 Method 2: Rocchio with Clustering Rocchio is a linear classifier based on cosine similarity. When the decision boundary is non-linear or does not con￾form to the separating plane resulted from cosine similarity, Rocchio may still extract some positive documents and put them in RN. This will damage the SVM classifier later. Be￾low, we propose an enhancement to the Rocchio approach in order to purify RN further, i.e., to discard some likely positive documents from RN. Figure 2 shows a possible scenario where some positive documents in U may still be in RN. + c r and − c r represent the positive and negative prototype vectors respectively. H is the decision hyperplane produced by Rocchio. It has the same distance (similarity) to + c r and − c r . If a document is located on the right-hand-side of H, it is classified as negative oth￾erwise it is classified as positive. However, the positive and negative classes in the data cannot be separated by H well. In Figure 2, the positive documents within the regions 1 and 2 will be misclassified as negative documents. Figure 2: Rocchio classifier is unable to remove some posi￾tive examples in U In order to further ìpurifyî RN, we need to deal with the above problem. We propose the following approach. This approach uses clustering to partition RN into many similarity groups (or clusters). It then uses Rocchio again to build a classifier using each cluster and the positive set P. The classifier is then applied to identify likely positive documents in the cluster and delete them. The idea is to identify and remove some positive documents in RN in a localized manner, which gives better accuracy. Since both the cluster and the positive set are homogeneous, they allow Rocchio to build better prototype vectors. The clustering technique that we use is k-means, which is an efficient technique. The k-means method needs the input cluster number k. From our experiments in Section 4, we observe that the choice of k does not affect classification results much as long as it is not too small. Too few clusters may not be effective because the documents in a cluster can still be diverse. Then, Rocchio still cannot build an accurate classifier. Too many clusters will not cause much problem. This new method can further purify RN without removing too many negative documents from RN. The detailed algo￾rithm is given in Figure 3. Note that in line 1 we still perform the initial Rocchio extraction as discussed in Section 3.1.1. Thus, the input data for clustering is RN, not U. The reason is that the purity of RN is higher, which allows us to produce better clusters without the influence of too many noisy (positive) documents in U. Lines 2-3 perform the standard k-means clustering of RN, which produces k clusters, N1, N2, Ö, Nk. Based on these clusters, we construct a positive prototype vector j p r and a negative prototype vector nj r for the positive set P and each cluster Nj(j ∈ {1, 2, …, k} (lines 5 and 6). Line 8 starts the extraction. For each document di ∈ RN, we try to find the nearest positive prototype vector v p r to di (line 9). If the similarity ( , ) pv di sim r is less than the similarity between di and any negative prototype vector nj r , we put it in our final negative set RNí (line 10). The reason for using the proce￾dures in lines 9 and 10 is that we want to be conservative in removing likely positive documents from RN, i.e., not to remove too many negative documents. − c r + c r 1 2 H
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有