正在加载图片...
LJCSNS International Joumal of Computer Science and Network Security, VOL 6 No5A, May 2006 We assumed that the weight of each feature would be 4. An Algorithm for Feature Selection the same a, (x)=l: if the j-th word is in document x Feature selection techniques are needed to identify the a, (x)=0: otherwise most suitable words to classify documents, and to reduce the computation costs of classifying new documents. In 3.1.2SVM Papits, automatic classification needs to classify documents nto the multivalued category, because research is The formulation of SVM is constructed starting from a organized in various fields. However, feature selection simple linear maximum margin classifier [1]. A general becomes sensitive to noise and irrelevant data compared to linear SVM can be expressed as Eq. I cases with few categories. There may also not be enough registered papers as training data to identify the most suitable words to classify into the multivalued category in X=wX (1) Papits. We propose feature selection to classify documents, which is represented by a bag-of-words, into the multivalued category where f(x) is the output of the SVM, x is the input Several existing feature selection techniques use some document, b is a threshold,w=ayx,, x, is a stored metric to determine the relevance of a term with regard to the classification criterion. IG is often used in text training document, y, e(1 +1 is the desired output of the classification in the bag-of-words approach [31[7[101 classifier, and a, are weights. The margin for this linear classifier Is y/b. Hence the training problem is to minimize hyll with respect to constraint Eq lG(A,.)=-∑4log ∑∑ The linear formulation cannot classify nonlinearly cAsesa)ec separable documents. SVMs get around this problem by mapping the sample points into a higher dimensional space using a kernel function. A general non-linear SVM can be where C is the set of all categories, and each of categories is expressed as Eq. 2. denoted as c. A is an arbitrary feature(word), v is a value of A. and the set of value of feature A is denoted as Values(a)=10, 1). If feature A is in a document, then value f(x)=∑aK(x,x)-b (2) v is 1. Otherwise v is O. x denotes a set of all documents, and XoXnXcr denote sets of documents that are included in category c, taking feature value v, and belonging to where K is a kernel function that measures the similarity of category c as well as taking the feature value v. A indicate stored training documents x, to the inputx y, ef-1+l; is the number of elements of set x. the desired output of the classifier, b is a threshold, and a, is weights that blend the different kernels The formulation of svm was based on a two-class problem, hence SVM is basically a binary classifier. Several different schemes can be applied to the basic SVM algorithm to handle the n-category classification problem One of schemes to handle the n-category is one-versus-rest approach. The one-versus-rest approach works by Fig. 3. the multivalued category into the binary category constructing a set of n binary classifiers for an n-category problem. The k-th classifier is trained with all of the In this formula, as the number of elements of c documents in the k-th category with positive labels, and all ncreases, documents are divided into more categories other documents with negative labels. The final output is Hence. the ig value becomes sensitive to noise and the the category that corresponds to the classifier with the irrelevant data. Our method transforms the multivalued highest output value category into a binary category, increases the number of data in one category, and does feature selection using IG f(x)=argmaxayK(x, x)-b igure 3 presents the idea behind this method with the set of categories{cl,C2…,cCs…,c…,cCn}. If suitable words to classify into documents various combinations of categories are found, since a document category isIJCSNS International Journal of Computer Science and Network Security, VOL.6 No.5A, May 2006 19 We assumed that the weight of each feature would be the same: a j(x) =1: if the j − th word is in document x a j(x) = 0 : otherwise 3.1.2 SVM The formulation of SVM is constructed starting from a simple linear maximum margin classifier [1]. A general linear SVM can be expressed as Eq. 1. f (x) = w⋅ x − b (1) where f (x) is the output of the SVM, x is the input document, b is a threshold, w = αiyi xi ∑i , xi is a stored training document, yi ∈ −{ } 1,+1 is the desired output of the classifier, and αi are weights. The margin for this linear classifier is 1 w . Hence the training problem is to minimize w with respect to constraint Eq. 1 The linear formulation cannot classify nonlinearly separable documents. SVMs get around this problem by mapping the sample points into a higher dimensional space using a kernel function. A general non-linear SVM can be expressed as Eq. 2. f (x) = α iyi K(xi ,x) − b i ∑ (2) where K is a kernel function that measures the similarity of stored training documents xi to the input x yi ∈ {−1,+1} is the desired output of the classifier, b is a threshold, and αi is weights that blend the different kernels. The formulation of SVM was based on a two-class problem, hence SVM is basically a binary classifier. Several different schemes can be applied to the basic SVM algorithm to handle the n-category classification problem. One of schemes to handle the n-category is one-versus-rest approach. The one-versus-rest approach works by constructing a set of n binary classifiers for an n-category problem. The k-th classifier is trained with all of the documents in the k-th category with positive labels, and all other documents with negative labels. The final output is the category that corresponds to the classifier with the highest output value. f (x) = argmaxk α i k yi Kk (xi ,x) − b i=1 ∑ 4. An Algorithm for Feature Selection Feature selection techniques are needed to identify the most suitable words to classify documents, and to reduce the computation costs of classifying new documents. In Papits, automatic classification needs to classify documents into the multivalued category, because research is organized in various fields. However, feature selection becomes sensitive to noise and irrelevant data compared to cases with few categories. There may also not be enough registered papers as training data to identify the most suitable words to classify into the multivalued category in Papits. We propose feature selection to classify documents, which is represented by a bag-of-words, into the multivalued category. Several existing feature selection techniques use some metric to determine the relevance of a term with regard to the classification criterion. IG is often used in text classification in the bag-of-words approach [3][7][10]. IG(A,X) = − Xc c∈C X ∑ log2 Xc X ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ − − Xc,v c∈C X ∑ log2 Xc,v v∈Values(A ) Xv ∑ ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ where C is the set of all categories, and each of categories is denoted as c. A is an arbitrary feature (word), v is a value of A, and the set of value of feature A is denoted as Values(A) = {0, 1}. If feature A is in a document, then value v is 1. Otherwise v is 0. X denotes a set of all documents, and Xc,Xv,Xc,v denote sets of documents that are included in category c, taking feature value v, and belonging to category c as well as taking the feature value v. |X| indicate the number of elements of set X. CA CB ci cj c1 c2 cn ... Fig. 3. the multivalued category into the binary category In this formula, as the number of elements of C increases, documents are divided into more categories. Hence, the IG value becomes sensitive to noise and the irrelevant data. Our method transforms the multivalued category into a binary category, increases the number of data in one category, and does feature selection using IG. Figure 3 presents the idea behind this method with the set of categories {c1,c2,…,ci,…,cj,…,cn}. If suitable words to classify into documents various combinations of categories are found, since a document category is
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有