正在加载图片...
2 Related work am et aL. 1998: Blum mitchell 1998 Goldman Zhou 2000: Basu et al. 2002: Muslea et al. 2002: Bockhorst Many studies on text classification have been conducted in Craven, 2002]. It has been shown that the unlabeled data the past. Existing techniques including Rocchio, naive Bayes helps classification. This is different from our work, as we do (NB), SVM, and many others [e.g, Rocchio, 1971; Lewis not have any labeled negative training data. Gale, 1994; Vapnik, 1995; Joachims, 1999, Nigam et al These existing techniques, how- 3 The Proposed Technique ever, all require labeled training data of all classes. They are not designed for the positive class based classification This section presents the proposed approach, which also In [Denis, 1998], a theoretical study of PAC learning consists of two steps: (I)extracting some reliable negative from positive and unlabeled examples under the statistical documents from the unlabeled set,(2)applying SVM query model [Kearns, 1998] is reported. [Letouzey et aL., iteratively to build a classifier. Below, we present these 2000] presents an algorithm for learning using a modified two steps in turn. In Section 4, we show that our technique decision tree algorithm based on the statistical query is much more robust than existing techniques model. It is not for text classification. but for normal data In[Muggleton, 20011, Muggleton studied the problem in a 3.1 Finding reliable negative documents Bayesian framework where the distribution of functions This sub-section gives two methods for finding reliable and examples are assumed known. The result is similar to negative documents: (1)Rocchio and(2) Rocchio wit clustering. The second method often results in better classi 2002]extends the result to the noisy case. The purpose of ers espe for this step is that the identified negative docu- prove existing techniques In lIu et al., 2002], Liu et al. proposed a method with no or very few positive documents, because SVM(used (called S-EM) to solve the problem in the text domain. It is based on naive Bayesian classification(NB)and the EM in our second step)is very sensitive to noise algorithm [Dempster et al, 1977]. The main idea of the 3.1.1 Method 1: Rocchio method is to first use a spy technique to identify some This method treats the entire unlabeled set U as negative reliable negative documents from the unlabeled set. It then documents and then uses the positive set P and U as the runs em to build the final classifier. Since nb is not training data to build a rocchio classifier. The classifier is trong classifier for texts, we will show that the propo then used to classify U. Those documents that are classified technique is much more accurate than S-EM as negative are considered(reliable) negative data, denoted In [Yu et al 2002. Yu et al. proposed a svM based by RN. The algorithm is shown in Figure I technique(called PEBL) to classify Web pages give sitive and unlabeled pages. The core idea 1. Assign the unlabeled set U the negative class, and the that in [Liu et al., 2002, i.e, (1) identifying a set of re positive set P the positive class, liable negative documents from the unlabeled set(called 2. Let a strong negative documents in PEBL), and(2) building a P|∈F‖l classifier using SVM. The difference between the two d techniques is that for both steps the two techniques use fed‖ 1P点n different methods. In Yu et al., 2002], strong negative features of the positive data. After a set of strong negative 5 for each document d in U do documents are those documents that do not contain any if sim(c, d)<=sim(c-, d) then build a classifier. pebl is sensitive to the number of 6 documents is identified, SVM is applied iteratively to RN=RN∪{d}; Figure 1: Rocchio extraction using U as negative positive examples. When the positive data is small, the results are often very poor. The proposed method differs In Rocchio classification, each document d is representee from PEBL in that we perform negative data extraction as a vector [Salton McGill,1983], d=(qu, q,,", ,) Each from the unlabeled set using the rocchio method and element qi in d represents a word w, and is calculated as the clustering. Although our second step also runs SVM it- combination of term frequency(t and inverse document eratively to build a classifier, there is akey difference. Our frequency(id), i.e., q=tf* idf. tf, is the number of times that built by SVM, while PEBL does not. The proposed method word w occurs in d, while performs well consistently under a variety of conditions id, log( D I/df(w, ) In [Scholkopfet al., 1999], one-class SVM is prop Here DI is the total number of documents and dfw) is the This technique uses only positive data to build a s number of documents where word w occurs at least once classifier. However, our experiment results show that its Building a classifier is achieved by constructing positive classification performance is poor and negative prototype vectors c and c(lines 2 and 3 in Another line of related work is the use of a small labeled Figure 1). a and B parameters adjust the relative impact of set of every class and a large unlabeled set for learning INI- positive and negative training examples. a=16 and B=4 are2 Related Work Many studies on text classification have been conducted in the past. Existing techniques including Rocchio, naive Bayes (NB), SVM, and many others [e.g., Rocchio, 1971; Lewis & Gale, 1994; Vapnik, 1995; Joachims, 1999; Nigam et al., 1998; Yang & Liu, 1999]. These existing techniques, how￾ever, all require labeled training data of all classes. They are not designed for the positive class based classification. In [Denis, 1998], a theoretical study of PAC learning from positive and unlabeled examples under the statistical query model [Kearns, 1998] is reported. [Letouzey et al., 2000] presents an algorithm for learning using a modified decision tree algorithm based on the statistical query model. It is not for text classification, but for normal data. In [Muggleton, 2001], Muggleton studied the problem in a Bayesian framework where the distribution of functions and examples are assumed known. The result is similar to that in [Liu et al., 2002] in the noiseless case. [Liu et al., 2002] extends the result to the noisy case. The purpose of this paper is to propose a robust practical method to im￾prove existing techniques. In [Liu et al., 2002], Liu et al. proposed a method (called S-EM) to solve the problem in the text domain. It is based on naÔve Bayesian classification (NB) and the EM algorithm [Dempster et al., 1977]. The main idea of the method is to first use a spy technique to identify some reliable negative documents from the unlabeled set. It then runs EM to build the final classifier. Since NB is not a strong classifier for texts, we will show that the proposed technique is much more accurate than S-EM. In [Yu et al., 2002], Yu et al. proposed a SVM based technique (called PEBL) to classify Web pages given positive and unlabeled pages. The core idea is the same as that in [Liu et al., 2002], i.e., (1) identifying a set of re￾liable negative documents from the unlabeled set (called strong negative documents in PEBL), and (2) building a classifier using SVM. The difference between the two techniques is that for both steps the two techniques use different methods. In [Yu et al., 2002], strong negative documents are those documents that do not contain any features of the positive data. After a set of strong negative documents is identified, SVM is applied iteratively to build a classifier. PEBL is sensitive to the number of positive examples. When the positive data is small, the results are often very poor. The proposed method differs from PEBL in that we perform negative data extraction from the unlabeled set using the Rocchio method and clustering. Although our second step also runs SVM it￾eratively to build a classifier, there is a key difference. Our technique selects a good classifier from a set of classifiers built by SVM, while PEBL does not. The proposed method performs well consistently under a variety of conditions. In [Scholkopf et al., 1999], one-class SVM is proposed. This technique uses only positive data to build a SVM classifier. However, our experiment results show that its classification performance is poor. Another line of related work is the use of a small labeled set of every class and a large unlabeled set for learning [Ni￾gam et al., 1998; Blum & Mitchell, 1998; Goldman & Zhou, 2000; Basu et al., 2002; Muslea et al., 2002; Bockhorst & Craven, 2002]. It has been shown that the unlabeled data helps classification. This is different from our work, as we do not have any labeled negative training data. 3 The Proposed Technique This section presents the proposed approach, which also consists of two steps: (1) extracting some reliable negative documents from the unlabeled set, (2) applying SVM iteratively to build a classifier. Below, we present these two steps in turn. In Section 4, we show that our technique is much more robust than existing techniques. 3.1 Finding reliable negative documents This sub-section gives two methods for finding reliable negative documents: (1) Rocchio and (2) Rocchio with clustering. The second method often results in better classi￾fiers, especially when the positive set is small. The key re￾quirement for this step is that the identified negative docu￾ments from the unlabeled set must be reliable or pure, i.e., with no or very few positive documents, because SVM (used in our second step) is very sensitive to noise. 3.1.1 Method 1: Rocchio This method treats the entire unlabeled set U as negative documents and then uses the positive set P and U as the training data to build a Rocchio classifier. The classifier is then used to classify U. Those documents that are classified as negative are considered (reliable) negative data, denoted by RN. The algorithm is shown in Figure 1. 1. Assign the unlabeled set U the negative class, and the positive set P the positive class; 2. Let ∑ ∑ ∈ ∈ + = − d P d U d d d U d P c r r r r r r r | | || || 1 | | || || 1 α β ; 3. Let ∑ ∑ ∈ ∈ − = − d U d P d d d P d U c r r r r r r r | | || || 1 | | || || 1 α β ; 4. for each document ' d in U do 5. if ( , ) ( , ) ' ' sim c d sim c d r r r r + − <= then 6. RN = RN ∪ { ' d }; Figure 1: Rocchio extraction using U as negative In Rocchio classification, each document d is represented as a vector [Salton & McGill, 1983], . Each element qi in represents a word wi and is calculated as the combination of term frequency (tf) and inverse document frequency (idf), i.e., qi = tfi * idfi. tfi is the number of times that word wi occurs in d, while log(| | / ( )) i wi idf = D df . Here |D| is the total number of documents and df(wi) is the number of documents where word wi occurs at least once. Building a classifier is achieved by constructing positive and negative prototype vectors + c r and − c r (lines 2 and 3 in Figure 1). α and β parameters adjust the relative impact of positive and negative training examples. α = 16 and β = 4 are ( , , , ) 1 2 n d = q q L q d
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有