正在加载图片...
第1期 莫凌飞,等:基于深度学习的视频预测研究综述 ·93· 地建模视频场景来推测未来视频,从而帮助机器能 formance on imagenet classification[C]//Proceedings of the 够更好地决策,还在于其以无监督方式学习到的内 2015 IEEE International Conference on Computer Vision. 部视觉表征可以加速或提升弱监督学习和有监督学 Santiago,Chile,2015:1026-1034. 习的性能,因此得到了越来越多学者的关注,也取 [5]SIMONYAN K,ZISSERMAN A.Very deep convolutional 得了非常多的进展。但是,现有的方法仍旧存在许 networks for large-scale image recognition[Z].arXiv pre- 多不足: print arXiv:1409.1556,2014. 1)当前提出的各种模型,结构比较单一,多数 [6]HE Kaiming,ZHANG Xiangyu,REN Shaoqing,et al.Deep 是基于自编码器、递归神经网络(包括LSTM)和生 residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern 成对抗网络,虽然这些架构取得了不错的效果,但 Recognition.Las Vegas,NV,USA.2016:770-778. 是仍无法高效建模自然界复杂的动态结构,导致当 [7]HINTON G,DENG Li,YU Dong,et al.Deep neural net- 前的模型仅能预测有限的几帧或者几十帧图像,且 works for acoustic modeling in speech recognition:The 在预测的后期画面会变模糊或者失去语义信息。 shared views of four research groups[J].IEEE signal pro- 2)日前学术界使用的视频预测损失函数比较 cessing magazine,2012,29(6):82-97 单一,常使用的损失函数是均方误差损失、对抗损 [8]SUTSKEVER I,VINYALS O,LE Q V.Sequence to se- 失函数和图像梯度差分损失函数。因为图像具有高 quence learning with neural networks[C]//Proceedings of 维复杂结构信息,当前常用损失函数没有充分考虑 the 27th International Conference on Neural Information 结构信息,导致模型预测的图像缺乏语义信息。另 Processing Systems.Montreal,Quebec,Canada,2014: 外,使用峰值信噪比、结构相似性作为图像评价标 3104-3112 准,与人眼的视觉感知并不完全一致,人眼的视觉 [9]BENGIO Y,DUCHARME R,VINCENT P,et al.A neural 对于误差的敏感度并不是绝对的,其感知结果会受 probabilistic language model[J].Journal of machine learn- 到许多因素的影响而产生变化,因此在图形评价指 ing research,.2003,3:1137-1155 标上仍有待研究。 [10]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing 3)理论上,预测视频动态在机器人决策、无人 atari with deep reinforcement learning[Z].arXiv preprint 驾驶和视频监控系统等领域具有广泛的应用价值, arXiv:1312.5602.2013. 但当前视频预测的研究多数在学术界,且研究处于 [11]SILVER D,HUANG A,MADDISON C J,et al.Master- ing the game of Go with deep neural networks and tree 早期阶段,具体在工业界的应用还未起步。 search[J.Nature,2016,529(7587):484-489. 视频预测学习是理解和建模自然界场景动态的 [12]DENG Jia,DONG Wei,SOCHER R,et al.ImageNet:A 有力手段,也是无监督学习的一个新的、重要的突 large-scale hierarchical image database[C]//Proceedings of 破点,尽管该领域的研究面临着不少挑战和未解决 the 2009 IEEE Conference on Computer Vision and Pat- 的问题,但当前认知科学和深度学习领域发展非常 tern Recognition.Miami,FL,USA,2009:248-255 迅速,尤其是在增强学习、半监督学习和无监督学 [13]SRIVASTAVA N.MANSIMOV E,SALAKHUDINOV R 习方向,且当前的计算机计算能力越来越强,这些 Unsupervised learning of video representations using 有利因素定会加速视频预测研究的进展。 LSTMs[Cl//Proceedings of the 32nd International Confer- ence on Machine Learning.Lille,France,2015:843-852. 参考文献: [14]MCCULLOCH WS,PITTS W.A logical calculus of the [1]LECUN Y.Predictive Learning[R]//Proceedings of the 30th ideas immanent in nervous activity[J].The bulletin of Annual Conference on Neural Information Processing Sys- mathematical biophysics,1943,5(4):115-133 tems.Barcelona,Spain,2016 [15]HEBB D O.The organization of behavior:A neuropsycho- [2]LECUN Y,BENGIO Y,HINTON G.Deep learning[J]. logical theory[M].New York:Chapman Hall,1949. Nature,2015,521(7553):436-444. [16]MINSKY ML,PAPERT S A.Perceptrons:an introduc- [3]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Im- tion to computational geometry[M].2nd ed.Cambridge, ageNet classification with deep convolutional neural net- UK:MIT Press,1988. works[C]//Proceedings of the 26th Annual Conference on [17]RUMELHART D E,HINTON G E,WILLIAMS R J. Neural Information Processing Systems 2012.South Lake Learning representations by back-propagating errors[J]. Tahoe,NV,USA2012:1097-1105 Nature,1986,323(6088):533-536. [4]HE Kaiming,ZHANG Xiangyu,REN Shaoqing,et al. [18]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-based Delving deep into rectifiers:Surpassing human-level per- learning applied to document recognition[J].Proceedings地建模视频场景来推测未来视频,从而帮助机器能 够更好地决策,还在于其以无监督方式学习到的内 部视觉表征可以加速或提升弱监督学习和有监督学 习的性能,因此得到了越来越多学者的关注,也取 得了非常多的进展。但是,现有的方法仍旧存在许 多不足: 1) 当前提出的各种模型,结构比较单一,多数 是基于自编码器、递归神经网络 (包括 LSTM) 和生 成对抗网络,虽然这些架构取得了不错的效果,但 是仍无法高效建模自然界复杂的动态结构,导致当 前的模型仅能预测有限的几帧或者几十帧图像,且 在预测的后期画面会变模糊或者失去语义信息。 2) 目前学术界使用的视频预测损失函数比较 单一,常使用的损失函数是均方误差损失、对抗损 失函数和图像梯度差分损失函数。因为图像具有高 维复杂结构信息,当前常用损失函数没有充分考虑 结构信息,导致模型预测的图像缺乏语义信息。另 外,使用峰值信噪比、结构相似性作为图像评价标 准,与人眼的视觉感知并不完全一致,人眼的视觉 对于误差的敏感度并不是绝对的,其感知结果会受 到许多因素的影响而产生变化,因此在图形评价指 标上仍有待研究。 3) 理论上,预测视频动态在机器人决策、无人 驾驶和视频监控系统等领域具有广泛的应用价值, 但当前视频预测的研究多数在学术界,且研究处于 早期阶段,具体在工业界的应用还未起步。 视频预测学习是理解和建模自然界场景动态的 有力手段,也是无监督学习的一个新的、重要的突 破点,尽管该领域的研究面临着不少挑战和未解决 的问题,但当前认知科学和深度学习领域发展非常 迅速,尤其是在增强学习、半监督学习和无监督学 习方向,且当前的计算机计算能力越来越强,这些 有利因素定会加速视频预测研究的进展。 参考文献: LECUN Y. Predictive Learning[R]//Proceedings of the 30th Annual Conference on Neural Information Processing Sys￾tems. Barcelona, Spain, 2016 [1] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436–444. [2] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Im￾ageNet classification with deep convolutional neural net￾works[C]//Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012. South Lake Tahoe, NV, USA, 2012: 1097–1105. [3] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Delving deep into rectifiers: Surpassing human-level per- [4] formance on imagenet classification[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile, 2015: 1026–1034. SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[Z]. arXiv pre￾print arXiv: 1409.1556, 2014. [5] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, 2016: 770–778. [6] HINTON G, DENG Li, YU Dong, et al. Deep neural net￾works for acoustic modeling in speech recognition: The shared views of four research groups[J]. IEEE signal pro￾cessing magazine, 2012, 29(6): 82–97. [7] SUTSKEVER I, VINYALS O, LE Q V. Sequence to se￾quence learning with neural networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Quebec, Canada, 2014: 3104–3112. [8] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of machine learn￾ing research, 2003, 3: 1137–1155. [9] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning[Z]. arXiv preprint arXiv: 1312.5602, 2013. [10] SILVER D, HUANG A, MADDISON C J, et al. Master￾ing the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484–489. [11] DENG Jia, DONG Wei, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pat￾tern Recognition. Miami, FL, USA, 2009: 248–255. [12] SRIVASTAVA N, MANSIMOV E, SALAKHUDINOV R. Unsupervised learning of video representations using LSTMs[C]//Proceedings of the 32nd International Confer￾ence on Machine Learning. Lille, France, 2015: 843–852. [13] MCCULLOCH W S, PITTS W. A logical calculus of the ideas immanent in nervous activity[J]. The bulletin of mathematical biophysics, 1943, 5(4): 115–133. [14] HEBB D O. The organization of behavior: A neuropsycho￾logical theory[M]. New York: Chapman & Hall, 1949. [15] MINSKY M L, PAPERT S A. Perceptrons: an introduc￾tion to computational geometry[M]. 2nd ed. Cambridge, UK: MIT Press, 1988. [16] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors[J]. Nature, 1986, 323(6088): 533–536. [17] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings [18] 第 1 期 莫凌飞,等:基于深度学习的视频预测研究综述 ·93·
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有