第4卷第3期 智能系统学报 Vol.4 No.3 2009年6月 CAAI Transactions on Intelligent Systems Jn.2009 doi:10.3969/j.issn.16734785.2009.03.003 回报函数学习的学徒学习综述 金卓军,钱徽,陈沈轶,朱森良 (浙江大学计算机学院,浙江杭州310027)】 摘要:通过研究基于回报函数学习的学徒学习的发展历史和目前的主要工作,概述了基于回报函数学习的学徒学 习方法.分别在回报函数为线性和非线性条件下讨论,并且在线性条件下比较了2类方法一基于逆向增强学习 (L)和最大化边际规划(MMP)的学徒学习.前者有较为快速的近似算法,但对于演示的最优性作了较强的假设; 后者形式上更易于扩展,但计算量大.最后,提出了该领域现在还存在的问题和未来的研究方向,如把学徒学习应用 于POMDP环境下,用PBVI等近似算法或者通过PCA等降维方法对数据进行学习持征的提取,从而减少高维度带来 的大计算量问题。 关键词:学徒学习:回报函数:逆向增强学习:最大化边际规划 中图分类号:TP181文献标识码:A文章编号:16734785(2009)03-020805 Survey of apprenticeship learning based on reward function learning JIN Zhuo-jun,QIAN Hui,CHEN Shen-yi,ZHU Miao-liang (Department of Computer Science,Zhejiang University,Hangzhou 310027,China) Abstract:This paper focuses on apprenticeship leaming,based on reward function learning.Both the historical ba- sis of this field and a broad selection of current work were investigated.In this paper,two kinds of algorithm-ap- prenticeship learning methods based on inverse reinforcement leaming (IRL)and maximum margin planning (MMP)frameworks were discussed under respective assumptions of linear and nonlinear reward functions.Compar- ison was made under the linear assumption conditions.The former can be implemented with an efficient approxi- mate method but has made a strong supposition of optimal demonstration.The latter has a relatively easy to extend form but may take large amounts of computation.Finally,some suggestions were given for further research in re- ward function leaming in a partially observable Markov decision process (POMDP)environment and in continuous/ high dimensional space,using either an approximate algorithm such as point-based value iteration (PBVI)or a fea- ture abstraction algorithm using dimension reduction methods such as principle component analysis (PCA).Adop- ting these may alleviate the curse of dimensionality. Keywords:apprenticeship leaming;reward function;inverse reinforcement leaming;maximum margin planning 学徒学习(apprenticeship leaming),又称为示教对基于逆向增强学习的学徒学习和MMP框架2类 学习(imitation leaming)、模仿学习(imitation leam 方法的作了更具体的分析和比较,另外讨论了近几 ing)或者观察学习(leaming by watching)等.它是指 年来该领域的一些最新进展. 学习者模仿专家的行为或者控制策略的过程].文 在移动机器人控制当中,规划模块将回报函数作 献[2]介绍了基于边际最大化的学徒学习的发展过 为输入,然后算出一条使回报函数值最大的决策序列, 程,其重点放在MMP(maximum margin planning)框 这种方法已经成为自动移动机器人系统的核心.但是, 架的提出、解决及优化.本文在文献[3]的基础上, 要建立回报函数的过程在实际操作中是比较困难的, 回报函数常常需要手工调节,然后观察运行结果,继而 收稿日期:2008-10-08 再调整回报函数,如此迭代来完成回报函数的构建这 基金项目:国家自然科学基金资助项目(90820306):渐江省科技厅重 样的过程在现实情况中往往是不可取的: 大资助项目(006c13096). 通信作者:钱徽.E-mail:qianhui(@u.d.cn 基于回报函数学习的学徒学习被用来从专家演