正在加载图片...
第4期 殷昌盛,等:多智能体分层强化学习综述 ·653· 就是多智能体和分层强化学习两种方法的结合, [9]XUE B.GLEN B.DeepLoco:dynamic locomotion skills 但由于其自身原理所限,MAHRL在探索的有效 using hierarchical deep reinforcement learning[J].ACM 性、样本的利用率、模型的鲁棒性等方面仍不够 transactions on graphics,2017,36(4):1-13. 理想。因此,针对性地研究监督学习、元学习、模 [10]SUTTON R S.BARTO A G.Reinforcement learning:an 仿学习、迁移学习以及增量式学习等其他方法在 introduction[M].Cambridge:MIT Press,1998 [11]SILVER D.SCHRITTEIESER J.SIMONYAN K.et al. MAHRL中的应用与结合,将是MAHRL研究和 Mastering the game of go without human knowledge[J]. 发展的一个重要方向。 Nature,2017,550(7676):354-391 4结束语 [12]刘全,翟建伟,章宗长,等.深度强化学习综述仞.计算 机学报,2018,41(1)1-27 本文对多智能体分层强化学习进行了回顾, LIU Quan,ZHAI Jianwei,ZHANG Zongchang,et al.A 首先对强化学习、半马尔可夫决策过程、多智能 survey on deep reinforcement learning[J].Chinese journ- 体技术等相关研究现状进行了介绍,然后基于分 al of computers,2018,41(1):1-27. [13]HAUSKNECHT M,STONE P.Deep recurrent q-learn- 层的角度,对多智能体分层强化学习进行了综 ing for patially observable mdps[EB/OL].[2017-11-161. 述,阐述了基于选项、基于分层抽象机、基于值函 https://arxiv.org/abs/1507.06527. 数分解和基于端到端等4种多智能体分层强化学 [14]HASSELT H V,GUEZ A,SILVER D.Deep reinforce- 习方法的算法原理和研究现状。介绍了MAHRL ment learning with double Q learning [EB/OL].[2015-12- 在机器人控制、博弈决策以及任务规划等领域的 8].https://arxiv.org/abs/1509.06461v1. 应用现状。作为解决大规模复杂背景下协同决策 [15]RUMMERY G A,NIRNJAN M.On-line q-learning us- 的一种潜在途径,MAHRL虽然现在仍有许多问 ing connectionist systems[EB/OL].[2018-2-2].https:/ 题尚未解决,但可以预见的是,随着研究的不断 www.researchgate.net/publication/250611 On-Line Q- 深入,多智能体分层强化学习将成为解决智能决 Learning_Using_Connectionist_Systems. [16]WATKINS C,DAYAN P.Q-learning[J].Machine learn- 策问题的重要方法。 ing,1992,8(34):279-292 参考文献: [17]SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms [C]//International Conference [1]LECUN Y,BENGIO Y,HINTON G.Deep learning[J]. on Machine Learning 2014.Beijing,China,2014: Nature,2015,521:436-444. 387-395 [2]SILVER D.HUBERT T,SCHRITTWIESER J,et al.A [18]MNIH V,BADIA A P,MIRZA M,et al.Asynchronous general reinforcement learning algorithm that masters methods for deep reinforcement learning [EB/OL].[2016- chess,shogi,and Go through self-play[J].Science,2018, 6-161.https://arxiv.org/abs/1602.01783. 362:1140-1144 [19]SCHULMAN J.LEVINE S.ABBEEL P,et al.Trust re- [3]JADERBERG M.CZARNECKI MM,DUNNING L,et gion policy optimization [EB/OL].[2015-2-19]. al.Human-level performance in 3D multiplayer games https://arxiv.org/abs/1502.05477. with population-based reinforcement learning[J].Science, [20]HEESS N.WAYNE G.SILVER D,et al.Learning con- 2019.364(6443):859-865. tinuous control policies by stochastic value gradi- [4]LIU Siqi,LEVER G,MEREL J,HEESS N,et al.Emer- ents[EB/OL].[2015-10-30].https://arxiv.org/abs/1510.09142. gent coordination through completion[EB/OL].[2019-2- [21]LEVINE S.KOLTUM V.Guided policy search[EB/OLl. 21].https://arxiv.org/abs/1902.07151. [2016-10-3].https:l/axiv.org/abs/1610.00529. [5]WU Bin,FU Qiang,LIANG Jing,et al.Hierarchical macro [22]SCHULMAN J.WOLSKI F.DHARIWAL P.et al.Prox- strategy model for MOBA game AI[EB/OL].[2018-12- imal policy optimization algorithms[EB/OL].[2018-9- 19].https://arxiv.org/abs/1812.07887v1. 181.https://arxiv.org/abs/1707.06347. [6]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing [23]SCHULMAN J,MORITZ P,LEVINE S,et al.High di- atari with deep reinforcement learning[EB/OL].[2013-12- mensional continuous control using generalized advant- 19].https://arxiv.org/abs/1312.5602. age estimation [EB/OL].[2011-11-16].https://arxiv. [7]WOOLDRIDGE M.An introduction to multi-agent sys- org/abs/1506.024398. tems[J].Wiley Sons,2011,4(2):125-128. [24]SUTTON R S.Dyna,an integrated architecture for learn- [8]GIL P,NUNES L.Hierarchical reinforcement learning us- ing,planning and reacting[J].ACM SIGART bulletin, ing path clustering[C]//Proceedings of 8th Iberian Confer- 1991.2(4):160-163. ence on Information Systems and Technologies.Lisaboa, [25]DING Shifei,ZHAO Xingyu,XU Xinzheng,et al.An ef- Portugal,2013:1-6. fective asynchronous framework for small scale reinforce就是多智能体和分层强化学习两种方法的结合, 但由于其自身原理所限,MAHRL 在探索的有效 性、样本的利用率、模型的鲁棒性等方面仍不够 理想。因此,针对性地研究监督学习、元学习、模 仿学习、迁移学习以及增量式学习等其他方法在 MAHRL 中的应用与结合,将是 MAHRL 研究和 发展的一个重要方向。 4 结束语 本文对多智能体分层强化学习进行了回顾, 首先对强化学习、半马尔可夫决策过程、多智能 体技术等相关研究现状进行了介绍,然后基于分 层的角度,对多智能体分层强化学习进行了综 述,阐述了基于选项、基于分层抽象机、基于值函 数分解和基于端到端等 4 种多智能体分层强化学 习方法的算法原理和研究现状。介绍了 MAHRL 在机器人控制、博弈决策以及任务规划等领域的 应用现状。作为解决大规模复杂背景下协同决策 的一种潜在途径,MAHRL 虽然现在仍有许多问 题尚未解决,但可以预见的是,随着研究的不断 深入,多智能体分层强化学习将成为解决智能决 策问题的重要方法。 参考文献: LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521: 436–444. [1] SILVER D, HUBERT T, SCHRITTWIESER J, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play[J]. Science, 2018, 362: 1140–1144. [2] JADERBERG M, CZARNECKI M M, DUNNING L, et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning[J]. Science, 2019, 364(6443): 859–865. [3] LIU Siqi, LEVER G, MEREL J, HEESS N, et al. Emer￾gent coordination through completion[EB/OL]. [2019-2- 21]. https://arxiv.org/abs/1902.07151. [4] WU Bin, FU Qiang, LIANG Jing, et al. Hierarchical macro strategy model for MOBA game AI[EB/OL]. [2018-12- 19]. https://arxiv.org/abs/1812.07887v1. [5] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning[EB/OL]. [2013-12- 19]. https://arxiv.org/abs/1312.5602. [6] WOOLDRIDGE M. An introduction to multi-agent sys￾tems[J]. Wiley & Sons, 2011, 4(2): 125–128. [7] GIL P, NUNES L. Hierarchical reinforcement learning us￾ing path clustering[C]//Proceedings of 8th Iberian Confer￾ence on Information Systems and Technologies. Lisaboa, Portugal, 2013: 1−6. [8] XUE B, GLEN B. DeepLoco: dynamic locomotion skills using hierarchical deep reinforcement learning[J]. ACM transactions on graphics, 2017, 36(4): 1–13. [9] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. Cambridge: MIT Press, 1998. [10] SILVER D, SCHRITTEIESER J, SIMONYAN K, et al. Mastering the game of go without human knowledge[J]. Nature, 2017, 550(7676): 354–391. [11] 刘全, 翟建伟, 章宗长, 等. 深度强化学习综述 [J]. 计算 机学报, 2018, 41(1): 1–27. LIU Quan, ZHAI Jianwei, ZHANG Zongchang, et al. A survey on deep reinforcement learning[J]. Chinese journ￾al of computers, 2018, 41(1): 1–27. [12] HAUSKNECHT M, STONE P. Deep recurrent q-learn￾ing for patially observable mdps[EB/OL]. [2017-11-16]. https://arxiv.org/abs/1507.06527. [13] HASSELT H V, GUEZ A, SILVER D. Deep reinforce￾ment learning with double Q learning[EB/OL]. [2015-12- 8]. https://arxiv.org/abs/1509.06461v1. [14] RUMMERY G A, NIRNJAN M. On-line q-learning us￾ing connectionist systems[EB/OL]. [2018-2-2]. https:// www.researchgate.net/publication/250611_On-Line_Q￾Learning_Using_Connectionist_Systems. [15] WATKINS C, DAYAN P. Q-learning[J]. Machine learn￾ing, 1992, 8(34): 279–292. [16] SILVER D, LEVER G, HEESS N, et al. Deterministic policy gradient algorithms [C]//International Conference on Machine Learning 2014. Beijing, China, 2014: 387−395. [17] MNIH V, BADIA A P, MIRZA M, et al. Asynchronous methods for deep reinforcement learning [EB/OL]. [2016- 6-16]. https://arxiv.org/abs/1602.01783. [18] SCHULMAN J, LEVINE S, ABBEEL P, et al. Trust re￾gion policy optimization [EB/OL]. [2015-2-19]. https://arxiv.org/abs/1502.05477. [19] HEESS N, WAYNE G, SILVER D, et al. Learning con￾tinuous control policies by stochastic value gradi￾ents[EB/OL]. [2015-10-30]. https://arxiv.org/abs/1510.09142. [20] LEVINE S, KOLTUM V. Guided policy search[EB/OL]. [2016-10-3]. https://arxiv.org/abs/1610.00529. [21] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Prox￾imal policy optimization algorithms[EB/OL]. [2018-9- 18]. https://arxiv.org/abs/1707.06347. [22] SCHULMAN J, MORITZ P, LEVINE S, et al. High di￾mensional continuous control using generalized advant￾age estimation [EB/OL]. [2011-11-16]. https://arxiv. org/abs/1506.024398. [23] SUTTON R S. Dyna, an integrated architecture for learn￾ing, planning and reacting[J]. ACM SIGART bulletin, 1991, 2(4): 160–163. [24] DING Shifei, ZHAO Xingyu, XU Xinzheng, et al. An ef￾fective asynchronous framework for small scale reinforce- [25] 第 4 期 殷昌盛,等:多智能体分层强化学习综述 ·653·
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有