提高收敛速度. 根据前面的分析 ,提高收敛速度 ,有以下几类方法 : 1

正在加载图片...

第1期王学宁，等：增强学习中的直接策略搜索方法综述 ·23· 提高收敛速度.根据前面的分析，提高收敛速度，有 1995 以下几类方法： [10]TSITSIKL IS J N,ROY V B.Feature-based methods 1)减小方差，相对来说，这类方法研究的已经很 for large scale dynamic programming [J ]Machine 多，但是效果有限 Learning,1996(22):59.94. 2)利用自然梯度方法，这类方法依然要用到常 [11]徐昕，贺汉根.神经网络增强学习的梯度算法研究 [01.计算机学报，2003,26(2)：227.233. 规的梯度，这就离不开梯度估计算法，仍然不能避免 XU Xin,HE Hangen.A gradient algorithm for rein- 梯度估计过程中的方差过大的问题 forcement learning based on neural networks[J].Chi- 3)利用先验知识.利用先验知识，是策略梯度算 nese Journal of Computers,2003,26(2):227-233. 法走向实用化的一个很有效并且很重要的手段.但 [12 ]BAXTER J,BARTL ETT P L.Infinite-horizon policy- 是，如何方便有效地利用先验知识？先验知识的引 gradient estimation[J].Journal of Artificial Intelligence 入是否影响算法的收敛性？先验知识并不能百分之 Research,200115):319.350 百的正确，如何消除错误的先验知识带来的错误导 [13]ABERDEEN D A.Policy-gradient algorithms for par- 向？这3个问题有待进一步的研究 tially observable Markov decision processes [D].Aus- 4)利用层次化策略梯度方法.对于能分解的问 tralian National University,2003. 题，这是一个有效的方法.但是并不是所有的问题都 [14]GREENSMITH P L,BAXTER J.Variance reduction techniques for gradient estimation in reinforcement 可以分解。 learning [J ]Journal of Machine Learning Reseach, 利用局部策略梯度估计方法.由于仅仅考虑行 2002(4):1471-1530. 为改变时的情况，因此极大地减小了运算量，这类方 [15]王学宁，徐昕，吴涛，贺汉根.策略梯度强化学习中法虽然目前还没有引起足够的重视，但是有望能较的最优回报基线卩1.计算机学报，2005,28(6)：1021- 大幅度地提高算法的性能， 1026. WANG Xuening,XU Xin,WU Tao,HE Hangen.The 参考文献： optimal reward baseline for policy-gradient reinforce- [1]徐昕，增强学习及其在移动机器人导航与控制中的应 ment learning[J ]Chinese Journal of Computers,2005, 用研究[D].长沙：国防科技大学，2002. 28(6):1021-1026. XU Xin.Reinforcement learning and its applications in [16]SCHWARTZ A.A reinforcement learning method for navigation and control of mobile robots[D].Changsha: maximizing undiscounted rewards[A ]In Proceedings National University of Defence Technology,2002 of the Tenth International Conference on Machine [2]SUTTON R ,BARTO A.Reinforcement learning,an in- Learning[C].Morgan Kaufmann,San Mateo,CA,1993. troduction[M].MIT Press,1998. [17]MA HADEVAN S.To discount or not to discount in re- [3 SIN GH S P.Learning to solve Markovian decision inforcement learning a case study comparing R-learn- processes[D].University of Massachusetts,1994. ing and Q-learning[A].In:Proc.Of International Ma- [4]RO Y B V.Learning and value function approximation in chine Learning Conf[C].New Brunswick,USA,1994. complex decision processes[M].MIT Press,1998. [18 WILLIAMS R J.Simple statistical gradient-following [5]WA TKINS C.Learning from delayed rewards[D].Cam- algorithms for connectionist reinforcement learning[J ] brideg:University of Cambridge,1989. Machine Learning,1992(8):229-256. [6]HUMPHRYS M.Action selection methods using rein- [19]KONDA V R,TSITSIKL IS J N.Actor-critic algo- forcement learning [D].Cambrideg:University of Cam- rithms[J].Adv.Neural Inform Processing Syst,2000 bridge,1996. (12):1008.1014. [7]BERTSEKAS D P,TSITSIKLIS J N.Neuro-dynamic [20]AMARI S.Natural gradient works efficiently in learning programming[M].Athena Scientific,Belmont,Mass., [J].Neural Computation,1998,10(2):251-276. 1996. [21]KA KADE S.A natural policy gradient[A].Advances (8]SUTTON R S,MCALL ESTER D,SIN GH S,et al. in Neural Information Processing Systems 14[C].MIT Policy gradient methods for reinforcement learning with Press.2002. function approximation[A].In:Advances in Neural In [22]PETERS J,VUAYA KUMAR S,SCHAAL S.Natural formation Processing Systems[C].Denver,USA,2000. actor-critic[A].In 16th European Conference on Ma- [9]BAIRD L C.Residual algorithms:reinforcement learning chine Learning (ECML 2005)[C].[s.I.],2005. with function approximation[A].In:Proc.Of the 12 [23]GREENSMITH E.Variance reduction techniques for Int.Conf.on Machine Learning [C].San Francisco, gradient estimation in reinforcement learning[J].Jour- 1994-2009 China Academic Journal Electronic Publishing House.All rights reserved.http://www.cnki.net提高收敛速度. 根据前面的分析 ,提高收敛速度 ,有以下几类方法 : 1) 减小方差 ,相对来说 ,这类方法研究的已经很多 ,但是效果有限. 2) 利用自然梯度方法 ,这类方法依然要用到常规的梯度 ,这就离不开梯度估计算法 ,仍然不能避免梯度估计过程中的方差过大的问题. 3) 利用先验知识. 利用先验知识 ,是策略梯度算法走向实用化的一个很有效并且很重要的手段. 但是 ,如何方便有效地利用先验知识 ? 先验知识的引入是否影响算法的收敛性 ? 先验知识并不能百分之百的正确 ,如何消除错误的先验知识带来的错误导向 ? 这 3 个问题有待进一步的研究. 4) 利用层次化策略梯度方法. 对于能分解的问题 ,这是一个有效的方法. 但是并不是所有的问题都可以分解. 利用局部策略梯度估计方法. 由于仅仅考虑行为改变时的情况 ,因此极大地减小了运算量 ,这类方法虽然目前还没有引起足够的重视 ,但是有望能较大幅度地提高算法的性能. 参考文献 : [1 ]徐昕. 增强学习及其在移动机器人导航与控制中的应用研究[D]. 长沙 :国防科技大学 , 2002. XU Xin. Reinforcement learning and its applications in navigation and control of mobile robots[ D ]. Changsha : National University of Defence Technology , 2002. [2 ]SU TTON R ,BARTO A. Reinforcement learning , an in2 troduction[ M]. MIT Press , 1998. [3 ] SIN GH S P. Learning to solve Markovian decision processes[D]. University of Massachusetts , 1994. [4 ]RO Y B V. Learning and value function approximation in complex decision processes[ M]. MIT Press , 1998. [ 5 ]WA T KINS C. Learning from delayed rewards[D]. Cam2 brideg : University of Cambridge , 1989. [6 ] HUMPHR YS M. Action selection methods using rein2 forcement learning [ D ]. Cambrideg : University of Cam2 bridge ,1996. [7 ] BERTSEKAS D P , TSITSIKL IS J N. Neuro2dynamic programming[ M ]. Athena Scientific ,Belmont , Mass. , 1996. [8 ] SU TTON R S , MCALL ESTER D , SIN GH S , et al. Policy gradient methods for reinforcement learning with function approximation[ A ]. In : Advances in Neural In2 formation Processing Systems[C]. Denver , USA ,2000. [9 ]BAIRD L C. Residual algorithms: reinforcement learning with function approximation[ A ]. In : Proc. Of the 12 # Int. Conf. on Machine Learning [ C ]. San Francisco , 1995. [10 ] TSITSIKL IS J N , RO Y V B. Feature2based methods for large scale dynamic programming [ J ]. Machine Learning ,1996 (22) :59 - 94. [11 ]徐昕 ,贺汉根. 神经网络增强学习的梯度算法研究 [J ]. 计算机学报 ,2003 ,26 (2) :227 - 233. XU Xin , HE Hangen. A gradient algorithm for rein2 forcement learning based on neural networks[J ]. Chi2 nese Journal of Computers , 2003 , 26 (2) : 227 - 233. [12 ]BAXTER J , BARTL ETT P L. Infinite2horizon policy2 gradient estimation[J ]. Journal of Artificial Intelligence Research ,2001 (15) :319 - 350. [13 ]ABERDEEN D A. Policy - gradient algorithms for par2 tially observable Markov decision processes [ D ]. Aus2 tralian National University , 2003. [14 ] GREENSMITH P L , BAXTER J. Variance reduction techniques for gradient estimation in reinforcement learning [J ]. Journal of Machine Learning Reseach , 2002 (4) : 1471 - 1530. [15 ]王学宁 ,徐昕 ,吴涛 ,贺汉根. 策略梯度强化学习中的最优回报基线[J ]. 计算机学报 ,2005 ,28 (6) :1021 - 1026. WAN G Xuening , XU Xin , WU Tao , HE Hangen. The optimal reward baseline for policy2gradient reinforce2 ment learning[J ]. Chinese Journal of Computers , 2005 , 28 (6) :1021 - 1026. [ 16 ] SCHWARTZ A. A reinforcement learning method for maximizing undiscounted rewards[ A ]. In Proceedings of the Tenth International Conference on Machine Learning[C]. Morgan Kaufmann ,San Mateo ,CA ,1993. [ 17 ]MA HADEVAN S. To discount or not to discount in re2 inforcement learning : a case study comparing R2learn2 ing and Q2learning[A ]. In : Proc. Of International Ma2 chine Learning Conf[C]. New Brunswick , USA ,1994. [ 18 ] WILL IAMS R J. Simple statistical gradient2following algorithms for connectionist reinforcement learning[J ]. Machine Learning ,1992 (8) :229 - 256. [19 ] KONDA V R , TSITSIKL IS J N. Actor2critic algo2 rithms[J ]. Adv. Neural Inform Processing Syst , 2000 (12) : 1008 - 1014. [20 ]AMARI S. Natural gradient works efficiently in learning [J ]. Neural Computation , 1998 , 10 (2) :251 - 276. [21 ] KA KADE S. A natural policy gradient [ A ]. Advances in Neural Information Processing Systems 14[ C]. MIT Press , 2002. [22 ]PETERS J , VIJ A YA KUMAR S ,SCHAAL S. Natural actor2critic[ A ]. In 16th European Conference on Ma2 chine Learning ( ECML 2005) [C]. [s. l. ] ,2005. [23 ] GREENSMITH E. Variance reduction techniques for gradient estimation in reinforcement learning[J ]. Jour2 第 1 期王学宁 ,等 :增强学习中的直接策略搜索方法综述 ·23 ·

<<向上翻页向下翻页>>

点击下载：【学术论文】增强学习中的直接策略搜索方法综述