正在加载图片...
·158· 智能系统学报 第15卷 法学习无模型的控制器,获得了高效的学习算法。 bots by annealed chain fitting and keyframe wave extrac- tion[C]//Proceedings of 2009 IEEE/RSJ International Con- 5结束语 ference on Intelligent Robots and Systems.St.Louis,USA, 2009:840-845. 本文从问题形式化、策略表示方法和策略学 [8]TAKEMORI T,TANAKA M,MATSUNO F.Gait design 习3个方面对当前强化学习算法应用到仿生机器 for a snake robot by connecting curve segments and experi- 人的运动步态控制任务中的研究情况进行了分析 mental demonstration[J].IEEE transactions on robotics. 和总结,并给出了强化学习算法应用到该领域尚 2018.34(5:1384-1391. 待解决的问题和未来的发展方向。总体而言,不 [9]MNIH V.KAVUKCUOGLU K.SILVER D,et al.Human- 同于仿真环境,仿生机器人的步态运动控制受到 level control through deep reinforcement learning[J]. 实际机器人系统的驱动、机构、通信等多方面的 Nature,.2015,518(7540):529-533. 限制,使得强化学习算法在该领域中的应用表现 [10]SILVER D.SCHRITTWIESER J.SIMONYAN K.et al. 出极大的挑战。一般而言,在形式化方面,需要 Mastering the game of Go without human knowledge[J]. 利用约束马尔可夫决策过程对该问题进行建模; Nature.2017,550(7676):354-359. [11]LEVINE S,KOLTUN V.Learning complex neural net- 在策略表示方面,更倾向于领域结构化的表示方 work policies with trajectory optimization[C]//Proceed- 法;在策略学习方面,高效的直接策略搜索方法 ings of the 31st International Conference on Machine 表现更佳。然而,目前强化学习算法用于仿生机 Learning.Beijing,China,2014:829-837. 器人运动步态学习和控制仍然面临着样本效率 [12]SCHULMAN J.LEVINE S.MORITZ P.et al.Trust re- 低、无法有效地进行多任务学习、从仿真环境到 gion policy optimization[C]//Proceedings of the 31st In- 实际平台的迁移性差和学习鲁棒性差等问题。新 ternational Conference on Machine Learning.Lille, 的方法如基于模型的强化学习、元强化学习和分 France,2015:1889-1897. 层强化学习等有望解决或缓解这些问题。 [13]SCHULMAN J.WOLSKI F.DHARIWAL P.et al.Prox- imal policy optimization algorithms[EB/OL].(2017-08- 参考文献 28).https://arxiv.org/abs/1707.06347. [14]PENG Xuebin,BERSETH G,YIN Kangkang,et al. [1]GEHRING C,COROS S,HUTTER M.et al.Practice DeepLoco:dynamic locomotion skills using hierarchical makes perfect:an optimization-based approach to con- deep reinforcement learning[J].ACM transactions on trolling agile motions for a quadruped robot[J].IEEE ro- graphics.2017,36(4):1-13. botics&automation magazine,2016,23(1):34-43. [15]ABDOLMALEKI A.SPRINGENBERG J T.TASSA Y. [2]APGAR T,CLARY P,GREEN K,et al.Fast online tra- et al.Maximum a posteriori policy optimisation[EB/OL]. jectory optimization for the bipedal robot Cassie[Cl//Pro- (2018-06-14).https://arxiv.org/abs/1806.06920. ceedings of Robotics:Science and Systems 2018.Pitts- [16]HAARNOJA T.ZHOU A,HARTIKAINEN K,et al.Soft burgh,USA,2018. actor-critic algorithms and applications[EB/OL].(2019- [3]RAIBERT M.BLANKESPOOR K,NELSON G,et al. 01-29).https://arxiv.org/abs/1812.05905. BigDog,the rough-terrain quadruped robot[C]//Proceed- [17]FUJIMOTO S,VAN HOOF H,MEGER D.Addressing ings of the 17th World Congress of the International Feder- function approximation error in actor-critic methods[C]// ation of Automatic Control.Seoul,Korea,2008: Proceedings of the 35th International Conference on Ma- 10822-10825 chine Learning.Stockholmsmassan,Sweden,2018: [4]Spotmini autonomous navigation[EB/OL].[2018-08-11]. 1587-1596. https://ucrazy.ru/video/1526182828-spotmini-autonomous- [18]HWANGBO J.LEE J.DOSOVITSKIY A.et al.Learn- navigation.html. ing agile and dynamic motor skills for legged robots[J]. [5]PARK H W,PARK S,KIM S.Variable-speed quadruped- Science robotics,2019,4(26):5872-5880. al bounding using impulse planning:Untethered high- [19]HAARNOJA T,HA S,ZHOU A,et al.Learning to walk speed 3D running of MIT Cheetah 2[Cl//Proceedings of via deep reinforcement learning[EB/OL].(2019-06-19) 2015 IEEE International Conference on Robotics and https://arxiv.org/abs/1812.11103 Automation.Seattle,USA,2015:5163-5170. [20]SUTTON R S,BARTO A G.Reinforcement learning:an [6]HIROSE S.YAMADA H.Snake-like robots:machine introduction[M].Cambridge:MIT Press,1998. design of biologically inspired robots[J].IEEE robotics and [21]LILLICRAP T P,HUNT JJ,PRITZEL A,et al.Continu- automation magazine,2009,16(1):88-98 ous control with deep reinforcement learning[J].Com- [7]HATTON RL,CHOSET H.Generating gaits for snake ro- puter science,2015,8(6):A187.法学习无模型的控制器,获得了高效的学习算法。 5 结束语 本文从问题形式化、策略表示方法和策略学 习 3 个方面对当前强化学习算法应用到仿生机器 人的运动步态控制任务中的研究情况进行了分析 和总结,并给出了强化学习算法应用到该领域尚 待解决的问题和未来的发展方向。总体而言,不 同于仿真环境,仿生机器人的步态运动控制受到 实际机器人系统的驱动、机构、通信等多方面的 限制,使得强化学习算法在该领域中的应用表现 出极大的挑战。一般而言,在形式化方面,需要 利用约束马尔可夫决策过程对该问题进行建模; 在策略表示方面,更倾向于领域结构化的表示方 法;在策略学习方面,高效的直接策略搜索方法 表现更佳。然而,目前强化学习算法用于仿生机 器人运动步态学习和控制仍然面临着样本效率 低、无法有效地进行多任务学习、从仿真环境到 实际平台的迁移性差和学习鲁棒性差等问题。新 的方法如基于模型的强化学习、元强化学习和分 层强化学习等有望解决或缓解这些问题。 参考文献: GEHRING C, COROS S, HUTTER M, et al. Practice makes perfect: an optimization-based approach to con￾trolling agile motions for a quadruped robot[J]. IEEE ro￾botics & automation magazine, 2016, 23(1): 34–43. [1] APGAR T, CLARY P, GREEN K, et al. Fast online tra￾jectory optimization for the bipedal robot Cassie[C]//Pro￾ceedings of Robotics: Science and Systems 2018. Pitts￾burgh, USA, 2018. [2] RAIBERT M, BLANKESPOOR K, NELSON G, et al. BigDog, the rough-terrain quadruped robot[C]//Proceed￾ings of the 17th World Congress of the International Feder￾ation of Automatic Control. Seoul, Korea, 2008: 10822−10825. [3] Spotmini autonomous navigation[EB/OL].[2018-08-11]. https://ucrazy.ru/video/1526182828-spotmini-autonomous￾navigation.html. [4] PARK H W, PARK S, KIM S. Variable-speed quadruped￾al bounding using impulse planning: Untethered high￾speed 3D running of MIT Cheetah 2[C]//Proceedings of 2015 IEEE International Conference on Robotics and Automation. Seattle, USA, 2015: 5163−5170. [5] HIROSE S, YAMADA H. Snake-like robots: machine design of biologically inspired robots[J]. IEEE robotics and automation magazine, 2009, 16(1): 88–98. [6] [7] HATTON R L, CHOSET H. Generating gaits for snake ro￾bots by annealed chain fitting and keyframe wave extrac￾tion[C]//Proceedings of 2009 IEEE/RSJ International Con￾ference on Intelligent Robots and Systems. St. Louis, USA, 2009: 840−845. TAKEMORI T, TANAKA M, MATSUNO F. Gait design for a snake robot by connecting curve segments and experi￾mental demonstration[J]. IEEE transactions on robotics, 2018, 34(5): 1384–1391. [8] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human￾level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529–533. [9] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of Go without human knowledge[J]. Nature, 2017, 550(7676): 354–359. [10] LEVINE S, KOLTUN V. Learning complex neural net￾work policies with trajectory optimization[C]//Proceed￾ings of the 31st International Conference on Machine Learning. Beijing, China, 2014: 829−837. [11] SCHULMAN J, LEVINE S, MORITZ P, et al. Trust re￾gion policy optimization[C]//Proceedings of the 31st In￾ternational Conference on Machine Learning. Lille, France, 2015: 1889−1897. [12] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Prox￾imal policy optimization algorithms[EB/OL]. (2017-08- 28). https://arxiv.org/abs/1707.06347. [13] PENG Xuebin, BERSETH G, YIN Kangkang, et al. DeepLoco: dynamic locomotion skills using hierarchical deep reinforcement learning[J]. ACM transactions on graphics, 2017, 36(4): 1–13. [14] ABDOLMALEKI A, SPRINGENBERG J T, TASSA Y, et al. Maximum a posteriori policy optimisation[EB/OL]. (2018-06-14). https://arxiv.org/abs/1806.06920. [15] HAARNOJA T, ZHOU A, HARTIKAINEN K, et al. Soft actor-critic algorithms and applications[EB/OL]. (2019- 01-29). https://arxiv.org/abs/1812.05905. [16] FUJIMOTO S, VAN HOOF H, MEGER D. Addressing function approximation error in actor-critic methods[C]// Proceedings of the 35th International Conference on Ma￾chine Learning. Stockholmsmässan, Sweden, 2018: 1587−1596. [17] HWANGBO J, LEE J, DOSOVITSKIY A, et al. Learn￾ing agile and dynamic motor skills for legged robots[J]. Science robotics, 2019, 4(26): 5872–5880. [18] HAARNOJA T, HA S, ZHOU A, et al. Learning to walk via deep reinforcement learning[EB/OL]. (2019-06-19). https://arxiv.org/abs/1812.11103 [19] SUTTON R S, BARTO A G. Reinforcement learning: an introduction[M]. Cambridge: MIT Press, 1998. [20] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continu￾ous control with deep reinforcement learning[J]. Com￾puter science, 2015, 8(6): A187. [21] ·158· 智 能 系 统 学 报 第 15 卷
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有