DOI: 10.11992/tis.201804052 网络出版地址: h

正在加载图片...

第14卷第3期智能系统学报 Vol.14 No.3 2019年5月 CAAI Transactions on Intelligent Systems May 2019 D0:10.11992/tis.201804052 网络出版地址：http:/kns.cnki.net/kcms/detail/23.1538.TP.20180627.1529.002.html 基于PageRank的主动学习算法邓思宇，刘福伦，黄雨婷，汪敏2 (1.西南石油大学计算机科学学院，四川成都610500,2.西南石油大学电气信息学院，四川成都610500) 摘要：在许多分类任务中，存在大量未标记的样本，并且获取样本标签耗时且昂贵。利用主动学习算法确定最应被标记的关键样本，来构建高精度分类器，可以最大限度地诚少标记成本。本文提出一种基于PageRank 的主动学习算法(PAL),充分利用数据分布信息进行有效的样本选择。利用PageRank根据样本间的相似度关系依次计算邻域、分值矩阵和排名向量：选择代表样本，并根据其相似度关系构建二叉树，利用该二叉树对代表样本进行聚类，标记和预测；将代表样本作为训练集，对其他样本进行分类。实验采用8个公开数据集，与 5种传统的分类算法和3种流行的主动学习算法比较，结果表明PAL算法能取得更好的分类效果。关键词：分类；主动学习；PageRank;邻域；聚类；二叉树中图分类号：TP181 文献标志码：A文章编号：1673-4785(2019)03-0551-09 中文引用格式：邓思宇，刘福伦，黄雨婷，等.基于PageRank的主动学习算法J.智能系统学报，2019,14(3)：551-559 英文引用格式：DENG Siyu,LIU Fulun,HUANG Yuting,etal.Active learning through PageRankJ]..CAAI transactions on intelli- gent systems,.2019,14(3:551-559. Active learning through PageRank DENG Siyu',LIU Fulun',HUANG Yuting',WANG Min2 (1.School of Computer Science,Southwest Petroleum University,Chengdu 610500,China;2.School of Electrical Engineering and Information,Southwest Petroleum University,Chengdu 610500,China) Abstract:In many classification tasks,there are a large number of unlabeled samples,and it is expensive and time-con- suming to obtain a label for each class.The goal of active learning is to train an accurate classifier with minimum cost by labeling the most informative samples.In this paper,we propose a PageRank-based active learning algorithm(PAL), which makes full use of sample distribution information for effective sample selection.First,based on the PageRank the- ory,we sequentially calculate the neighborhoods,score matrices,and ranking vectors based on similarity relationships in the data.Next,we select representative samples and establish a binary tree to express the relationships between repres- entative samples.Then,we use a binary tree to cluster,label,and predict representative samples.Lastly,we regard the representative samples as training sets for classifying other samples.We conducted experiments on eight datasets to compare the performance of our proposed algorithm with those of five traditional classification algorithms and three state-of-the-art active learning algorithms.The results demonstrate that PAL obtained higher classification accuracy Keywords:classification;active learning;PageRank;neighborhood;clustering;binary tree 传统的监督学习算法，如Naive Bayes!、One-实的数据分析场景下，大量的无标注样本较易获 R和J48等，其分类效果依赖于训练数据的有效取，而已标注样本数量稀少且难以获取。对海量性。通常情况下，使用已标记的样本作为训练数据进行标注是耗时、昂贵且困难的。在此情况集，学习算法以此训练出分类模型。然而，在真下，半监督学习(semi-supervised learning)和主动学习(active learning)被提出并得到快速发展，已收稿日期：2018-04-26.网络出版日期：2018-06-28. 基金项目：国家自然科学基金项目(61379089). 经被广泛地应用在文本分类向、语音识别和图像通信作者：汪敏.E-mail:wangmin80616@163.com. 分类等领域。DOI: 10.11992/tis.201804052 网络出版地址: http://kns.cnki.net/kcms/detail/23.1538.TP.20180627.1529.002.html 基于 PageRank 的主动学习算法邓思宇1 ，刘福伦1 ，黄雨婷1 ，汪敏2 （1. 西南石油大学计算机科学学院，四川成都 610500; 2. 西南石油大学电气信息学院，四川成都 610500）摘要：在许多分类任务中，存在大量未标记的样本，并且获取样本标签耗时且昂贵。利用主动学习算法确定最应被标记的关键样本，来构建高精度分类器，可以最大限度地减少标记成本。本文提出一种基于 PageRank 的主动学习算法 (PAL)，充分利用数据分布信息进行有效的样本选择。利用 PageRank 根据样本间的相似度关系依次计算邻域、分值矩阵和排名向量；选择代表样本，并根据其相似度关系构建二叉树，利用该二叉树对代表样本进行聚类，标记和预测；将代表样本作为训练集，对其他样本进行分类。实验采用 8 个公开数据集，与 5 种传统的分类算法和 3 种流行的主动学习算法比较，结果表明 PAL 算法能取得更好的分类效果。关键词：分类；主动学习；PageRank；邻域；聚类；二叉树中图分类号：TP181 文献标志码：A 文章编号：1673−4785(2019)03−0551−09 中文引用格式：邓思宇, 刘福伦, 黄雨婷, 等. 基于 PageRank 的主动学习算法[J]. 智能系统学报, 2019, 14(3): 551–559. 英文引用格式：DENG Siyu, LIU Fulun, HUANG Yuting, et al. Active learning through PageRank[J]. CAAI transactions on intelligent systems, 2019, 14(3): 551–559. Active learning through PageRank DENG Siyu1 ，LIU Fulun1 ，HUANG Yuting1 ，WANG Min2 (1. School of Computer Science, Southwest Petroleum University, Chengdu 610500, China; 2. School of Electrical Engineering and Information, Southwest Petroleum University, Chengdu 610500, China) Abstract: In many classification tasks, there are a large number of unlabeled samples, and it is expensive and time-consuming to obtain a label for each class. The goal of active learning is to train an accurate classifier with minimum cost by labeling the most informative samples. In this paper, we propose a PageRank-based active learning algorithm (PAL), which makes full use of sample distribution information for effective sample selection. First, based on the PageRank theory, we sequentially calculate the neighborhoods, score matrices, and ranking vectors based on similarity relationships in the data. Next, we select representative samples and establish a binary tree to express the relationships between representative samples. Then, we use a binary tree to cluster, label, and predict representative samples. Lastly, we regard the representative samples as training sets for classifying other samples. We conducted experiments on eight datasets to compare the performance of our proposed algorithm with those of five traditional classification algorithms and three state-of-the-art active learning algorithms. The results demonstrate that PAL obtained higher classification accuracy. Keywords: classification; active learning; PageRank; neighborhood; clustering; binary tree 传统的监督学习算法，如 Naïve Bayes[1] 、OneR [2]和 J48[3]等，其分类效果依赖于训练数据的有效性。通常情况下，使用已标记的样本作为训练集，学习算法以此训练出分类模型。然而，在真实的数据分析场景下，大量的无标注样本较易获取，而已标注样本数量稀少且难以获取。对海量数据进行标注是耗时、昂贵且困难的。在此情况下，半监督学习 (semi-supervised learning)[4]和主动学习 (active learning)[5]被提出并得到快速发展，已经被广泛地应用在文本分类[6] 、语音识别[7]和图像分类[8]等领域。收稿日期：2018−04−26. 网络出版日期：2018−06−28. 基金项目：国家自然科学基金项目 (61379089). 通信作者：汪敏. E-mail：wangmin80616@163.com. 第 14 卷第 3 期智能系统学报 Vol.14 No.3 2019 年 5 月 CAAI Transactions on Intelligent Systems May 2019

向下翻页>>

点击下载：【机器学习】基于PageRank的主动学习算法