正在加载图片...
LCSNS International Joumal of Computer Science and Network Security, VOL 6 No 5A, May 2006 predicted by a combination of words, we ht it w be possible to classify each category by X and x denote sets of documents that are included in words. For example, let us suppose the following case: categories Ca and CB. Xc, and Xc. denote taking feature set CA consists of categories c, and c value v and its belonging to categories CA and CE respectively. Finally, the best SkS words according to th set CB consists of categories other than c and c metric are chosen as features set Cs consists of categories c and ck 4. Evaluation set Cr consists of categories other than cr and ck word wa is suitable to classify Ca and CE 4. 1 Experimental setting This section evaluates the performance of our algorithms by If a combination of wa and w, can be found, a classifier can measuring its ability to reproduce manual category classify original categories C, C, and Ck. Our feature assignments on a data set. selection method can be used to locate wa and wh We will now describe the data sets and the method of evaluation. The data set is a set of papers from IJCAlOl proceedings. We used 188 papers that had extracted titles, V=set of words, sorting by information gain authors, and abstracts from PDF files as data. These papers had been manually indexed by category(14 categories) D=set of documents!\ Each category corresponded to a section of IJCAlO1 C= set of categories\ Proceedings and selection was done as follows: Knowledge k=arbitrary number of features arbitrary number of categories\ Representation and Reasoning, Search, Satisfiability, and IG CACB(W, D): IG of documents D Constraint Satisfaction Problems, Cognitive relative to categories Ca and cB Planning, Diagnosis, Logic Programming and Theorem add(V, w, IG): word w is added to V sorted by IG value Proving, Uncertainty and Probabilistic Reasoning, Neural Networks and Genetic Algorithms, Machine Learning and 1: Feature Selection Algorithm ata 2: for each combination Ca ofC choose 1 or 2 categories Natural Language Processing and Information Retrieval. Robotics and Perception, Web Applications 4 IGvalue=IGcA CB(W, D) Our method of feature selection, called Binary 5: if(Smax IGvalue)then Category", and another using IG were used over this data 6: max= LValue set. The method of comparison used the IG metric assessed 8: return k higher ranks of v over the set of all words encountered in all texts and then the best k were chosen words according to that metric. We Fig. 4. Proposing Feature Selection Algorithm called this"Multivalued Category. "After the best features were chosen with Multivalued Category and Binary Category. We estimated the accuracy of classification by orithm. First, new category Ci is a set that consists of classifier using kNn and svM in each case of k. SVM two or less categories that are selected from a set of training is carried out with the TinySVM [5]. To handle the categories C, and CB is a set of elements of C except for n-category classification problem, we applied categories that constitute Ca For all combinations of these one-versus-rest approach to Tiny S VM classifier tool IG is assessed over the set of all words encountered in all To estimate accuracy for selected features, we used an texts, let the highest value of IG be the importance of word n-fold cross-validation. The data set is randomly divided 6, IG for new categories( Cu, Cg) is determined by the into n sets with approximately equal size. For each"fold" the classifier is trained using all but one of the n groups and then tested on the unseen group. This procedure is repeated for each of the n groups. The cross-validation score is the log2 x used a 10-fold cross-validation for our experimentIJCSNS International Journal of Computer Science and Network Security, VOL.6 No.5A, May 2006 20 predicted by a combination of words, we thought it would be possible to classify each category by combining these words. For example, let us suppose the following case: • set CA consists of categories ci and cj • set CB consists of categories other than ci and cj • set CS consists of categories ci and ck • set CT consists of categories other than ci and ck • word wa is suitable to classify CA and CB • word wb is suitable to classify CS and CT If a combination of wa and wb can be found, a classifier can classify original categories ci, cj, and ck. Our feature selection method can be used to locate wa and wb . V = set of words, sorting by information gain (initial condition = {}) D = set of documents\\ C = set of categories\\ k = arbitrary number of features\\ l = arbitrary number of categories\\ IG_CA,CB (w,D) : IG of documents D on word w, relative to categories CA and CB add(V,w,IG) : word w is added to V sorted by IG value 1: Feature_Selection_Algorithm() 2: for each combination CA of C choose 1 or 2 categories 3: CB = C - CA 4 IGvalue = IGCA,CB(w,D) 5: if ($max < IGvalue) then 6: max = IGvalue 7: add(V,w,max) 8: return k higher ranks of V. Fig. 4. Proposing Feature Selection Algorithm Figure 4 shows the proposed feature selection algorithm. First, new category CA is a set that consists of two or less categories that are selected from a set of categories C, and CB is a set of elements of C except for categories that constitute CA . For all combinations of these, IG is assessed over the set of all words encountered in all texts, let the highest value of IG be the importance of word w. IG for new categories { CA , CB } is determined by the following: IGCA ,CB (A,X) = XCA X log2 XCA X + XCB X log2 XCB X ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ + XCA ,v X log2 XCA ,v Xv + XCB ,v X log2 XCB ,v Xv ⎛ ⎝ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ v∈Values(A ) ∑ XCA and XCB denote sets of documents that are included in categories CA and CB. XCA ,v and XCB ,v denote taking feature value v and its belonging to categories CA and CB respectively. Finally, the best $k$ words according to this metric are chosen as features. 4. Evaluation 4.1 Experimental setting This section evaluates the performance of our algorithms by measuring its ability to reproduce manual category assignments on a data set. We will now describe the data sets and the method of evaluation. The data set is a set of papers from IJCAI'01 proceedings. We used 188 papers that had extracted titles, authors, and abstracts from PDF files as data. These papers had been manually indexed by category (14 categories). Each category corresponded to a section of IJCAI'01 Proceedings and selection was done as follows: Knowledge Representation and Reasoning, Search, Satisfiability, and Constraint Satisfaction Problems, Cognitive Modeling, Planning, Diagnosis, Logic Programming and Theorem Proving, Uncertainty and Probabilistic Reasoning, Neural Networks and Genetic Algorithms, Machine Learning and Data Mining, Case-based Reasoning, Multi-Agent System, Natural Language Processing and Information Retrieval, Robotics and Perception, Web Applications. Our method of feature selection, called ``Binary Category'', and another using IG were used over this data set. The method of comparison used the IG metric assessed over the set of all words encountered in all texts, and then the best k were chosen words according to that metric. We called this ``Multivalued Category.'' After the best features were chosen with Multivalued Category and Binary Category. We estimated the accuracy of classification by classifier using kNN and SVM in each case of k. SVM training is carried out with the TinySVM [5]. To handle the n-category classification problem, we applied one-versus-rest approach to TinySVM classifier tool. To estimate accuracy for selected features, we used an n-fold cross-validation. The data set is randomly divided into n sets with approximately equal size. For each ``fold'', the classifier is trained using all but one of the n groups and then tested on the unseen group. This procedure is repeated for each of the n groups. The cross-validation score is the average performance across each of the n training runs. We used a 10-fold cross-validation for our experiments
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有