DOI: 10.11992/tis.201711007 网络出版地址: h

正在加载图片...

第13卷第5期智能系统学报 Vol.13 No.5 2018年10月 CAAI Transactions on Intelligent Systems Oct.2018 D0:10.11992/tis.201711007 网络出版地址：http:/kns.cnki.net/kcms/detail/23.1538.tp.20180423.1540.016.html 基于支持向量的最近邻文本分类方法古丽娜孜·艾力木江23，乎西旦·居马洪，孙铁利2，梁义 (1.伊犁师范学院电子与信息工程学院，新疆伊宁835000,2.东北师范大学计算机科学与技术学院，吉林长春130117：3.东北师范大学地理科学学院，吉林长春130024) 摘要：文本分类为一个文档自动分配一组预定义的类别或主题。文本分类中，文档的表示对学习机的学习性能有很大的影响。以实现哈萨克语文本分类为目的，根据哈萨克语语法规则设计实现哈萨克语文本的词干提取，完成哈萨克语文本的预处理。提出基于最近支持向量机的样本距离公式，避免k参数的选定，以SVM与 KNN分类算法的特殊组合算法(SV-NN)实现了哈萨克语文本的分类。结合自己构建的哈萨克语文本语料库的语料进行文本分类仿真实验，数值实验展示了提出算法的有效性并证实了理论结果。关键词：词干提取；预处理：支持向量机：文本分类：分类精度中图分类号：TP309文献标志码：A文章编号：1673-4785(2018)05-0799-09 中文引用格式：古丽娜孜·艾力木江，乎西旦·居马洪，孙铁利，等.基于支持向量的最近邻文本分类方法.智能系统学报， 2018,13(5):799-807. 英文引用格式：GULNAZ Alimjan,HURXIDA Jumahun,SUN Tieli,etal.The nearest neighbor text classification method based on support vector[J.CAAI transactions on intelligent systems,2018,13(5):799-807. The nearest neighbor text classification method based on support vector GULNAZ Alimjan2,HURXIDA Jumahun',SUN Tieli,LIANG Yi' (1.Department of Electronics and Information Engineering,Yili Normal University,Yining 835000,China;2.School of Information Science and Technology,Northeast Normal University,Changchun 130117,China;3.Department of Geographical Science,North- east Normal University,Changchun 130024,China) Abstract:Text categorization automatically assigns a set of predefined categories or topics to a document.In text classi- fication,the representation of the document has a great influence on the learning performance of the learning machine. The aim is to achieve Kazakh text classification,according to Kazakh grammar rules,the stemming of Kazakh texts is designed to complete the preprocessing of Kazakh text.A sample distance formula based on the latest support vector machine(SVM)is proposed to avoid the selection of k-parameters.The Kazakh texts are classified by special combina- tion of SVM and KNN classification algorithms (SV-NN).Combining the corpus of Kazakh text corpora constructed by himself,text categorization simulation experiments were conducted.Numerical experiments showed the effectiveness of the proposed algorithm and confirmed the theoretical results. Keywords:stemming;preprocessing;support vector machines,text categorization,classification accuracy 文本分类(text classification,.TC)是对一个文本数据组织与处理的关键技术。数字化数据有不档自动分配一组预定义的类别或应用主题的过同的形式，它可以是文字、图像、空间形式等，其程"。随着数字图书馆的快速增长，TC已成为文中最常见和应用最多的是文本数据，阅读的新收稿日期：2017-11-02.网络出版日期：2018-04-24. 闻、社交媒体上的帖子和信息主要以文本形式出基金项目：伊犁师范学院一般项目(2016WXYB0004):国家自然科学基金项目(61663045)：新疆高校科研计划重点现。文本自动分类在网站分类2-)、自动索引研究项目(XEDU2014I043):伊犁师范学院重点项目(2016YSZD04). 电子邮件过滤6、垃圾邮件过滤？1、本体匹配通信作者：古丽娜孜艾力木江.E-mail:alay328@163.com. 超文本分类2和情感分析1等许多信息检索DOI: 10.11992/tis.201711007 网络出版地址: http://kns.cnki.net/kcms/detail/23.1538.tp.20180423.1540.016.html 基于支持向量的最近邻文本分类方法古丽娜孜·艾力木江1,2,3，乎西旦·居马洪1 ，孙铁利2 ，梁义1 （1. 伊犁师范学院电子与信息工程学院，新疆伊宁 835000; 2. 东北师范大学计算机科学与技术学院，吉林长春 130117; 3. 东北师范大学地理科学学院，吉林长春 130024）摘要：文本分类为一个文档自动分配一组预定义的类别或主题。文本分类中，文档的表示对学习机的学习性能有很大的影响。以实现哈萨克语文本分类为目的，根据哈萨克语语法规则设计实现哈萨克语文本的词干提取，完成哈萨克语文本的预处理。提出基于最近支持向量机的样本距离公式，避免 k 参数的选定，以 SVM 与 KNN 分类算法的特殊组合算法（SV-NN）实现了哈萨克语文本的分类。结合自己构建的哈萨克语文本语料库的语料进行文本分类仿真实验，数值实验展示了提出算法的有效性并证实了理论结果。关键词：词干提取；预处理；支持向量机；文本分类；分类精度中图分类号：TP309 文献标志码：A 文章编号：1673−4785(2018)05−0799−09 中文引用格式：古丽娜孜·艾力木江, 乎西旦·居马洪, 孙铁利, 等. 基于支持向量的最近邻文本分类方法[J]. 智能系统学报, 2018, 13(5): 799–807. 英文引用格式：GULNAZ Alimjan, HURXIDA Jumahun, SUN Tieli, et al. The nearest neighbor text classification method based on support vector[J]. CAAI transactions on intelligent systems, 2018, 13(5): 799–807. The nearest neighbor text classification method based on support vector GULNAZ Alimjan1,2,3 ，HURXIDA Jumahun1 ，SUN Tieli2 ，LIANG Yi1 (1. Department of Electronics and Information Engineering, Yili Normal University, Yining 835000, China; 2. School of Information Science and Technology, Northeast Normal University, Changchun 130117, China; 3. Department of Geographical Science, Northeast Normal University, Changchun 130024, China) Abstract: Text categorization automatically assigns a set of predefined categories or topics to a document. In text classification, the representation of the document has a great influence on the learning performance of the learning machine. The aim is to achieve Kazakh text classification, according to Kazakh grammar rules, the stemming of Kazakh texts is designed to complete the preprocessing of Kazakh text. A sample distance formula based on the latest support vector machine (SVM) is proposed to avoid the selection of k-parameters. The Kazakh texts are classified by special combination of SVM and KNN classification algorithms (SV-NN) . Combining the corpus of Kazakh text corpora constructed by himself, text categorization simulation experiments were conducted. Numerical experiments showed the effectiveness of the proposed algorithm and confirmed the theoretical results. Keywords: stemming; preprocessing; support vector machines; text categorization; classification accuracy 文本分类 (text classification，TC) 是对一个文档自动分配一组预定义的类别或应用主题的过程 [1]。随着数字图书馆的快速增长，TC 已成为文本数据组织与处理的关键技术。数字化数据有不同的形式，它可以是文字、图像、空间形式等，其中最常见和应用最多的是文本数据，阅读的新闻、社交媒体上的帖子和信息主要以文本形式出现。文本自动分类在网站分类[2-3] 、自动索引[4-5] 、电子邮件过滤[6] 、垃圾邮件过滤[7-9] 、本体匹配[10] 、超文本分类[11-12]和情感分析[13-14]等许多信息检索收稿日期：2017−11−02. 网络出版日期：2018−04−24. 基金项目：伊犁师范学院一般项目 (2016WXYB0004)；国家自然科学基金项目 (61663045)；新疆高校科研计划重点研究项目 (XJEDU2014I043)；伊犁师范学院重点项目 (2016YSZD04). 通信作者：古丽娜孜·艾力木江. E-mail：alay328@163.com. 第 13 卷第 5 期智能系统学报 Vol.13 No.5 2018 年 10 月 CAAI Transactions on Intelligent Systems Oct. 2018

向下翻页>>

点击下载：【自然语言处理与理解】基于支持向量的最近邻文本分类方法