地建模视频场景来推测未来视频，从而帮助机器能够更好地决策，还在于其以无监

正在加载图片...

第1期莫凌飞，等：基于深度学习的视频预测研究综述 ·93· 地建模视频场景来推测未来视频，从而帮助机器能 formance on imagenet classification[C]//Proceedings of the 够更好地决策，还在于其以无监督方式学习到的内 2015 IEEE International Conference on Computer Vision. 部视觉表征可以加速或提升弱监督学习和有监督学 Santiago,Chile,2015:1026-1034. 习的性能，因此得到了越来越多学者的关注，也取 [5]SIMONYAN K,ZISSERMAN A.Very deep convolutional 得了非常多的进展。但是，现有的方法仍旧存在许 networks for large-scale image recognition[Z].arXiv pre- 多不足： print arXiv:1409.1556,2014. 1)当前提出的各种模型，结构比较单一，多数 [6]HE Kaiming,ZHANG Xiangyu,REN Shaoqing,et al.Deep 是基于自编码器、递归神经网络（包括LSTM)和生 residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern 成对抗网络，虽然这些架构取得了不错的效果，但 Recognition.Las Vegas,NV,USA.2016:770-778. 是仍无法高效建模自然界复杂的动态结构，导致当 [7]HINTON G,DENG Li,YU Dong,et al.Deep neural net- 前的模型仅能预测有限的几帧或者几十帧图像，且 works for acoustic modeling in speech recognition:The 在预测的后期画面会变模糊或者失去语义信息。 shared views of four research groups[J].IEEE signal pro- 2)日前学术界使用的视频预测损失函数比较 cessing magazine,2012,29(6):82-97 单一，常使用的损失函数是均方误差损失、对抗损 [8]SUTSKEVER I,VINYALS O,LE Q V.Sequence to se- 失函数和图像梯度差分损失函数。因为图像具有高 quence learning with neural networks[C]//Proceedings of 维复杂结构信息，当前常用损失函数没有充分考虑 the 27th International Conference on Neural Information 结构信息，导致模型预测的图像缺乏语义信息。另 Processing Systems.Montreal,Quebec,Canada,2014: 外，使用峰值信噪比、结构相似性作为图像评价标 3104-3112 准，与人眼的视觉感知并不完全一致，人眼的视觉 [9]BENGIO Y,DUCHARME R,VINCENT P,et al.A neural 对于误差的敏感度并不是绝对的，其感知结果会受 probabilistic language model[J].Journal of machine learn- 到许多因素的影响而产生变化，因此在图形评价指 ing research,.2003,3:1137-1155 标上仍有待研究。 [10]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Playing 3)理论上，预测视频动态在机器人决策、无人 atari with deep reinforcement learning[Z].arXiv preprint 驾驶和视频监控系统等领域具有广泛的应用价值， arXiv:1312.5602.2013. 但当前视频预测的研究多数在学术界，且研究处于 [11]SILVER D,HUANG A,MADDISON C J,et al.Master- ing the game of Go with deep neural networks and tree 早期阶段，具体在工业界的应用还未起步。 search[J.Nature,2016,529(7587):484-489. 视频预测学习是理解和建模自然界场景动态的 [12]DENG Jia,DONG Wei,SOCHER R,et al.ImageNet:A 有力手段，也是无监督学习的一个新的、重要的突 large-scale hierarchical image database[C]//Proceedings of 破点，尽管该领域的研究面临着不少挑战和未解决 the 2009 IEEE Conference on Computer Vision and Pat- 的问题，但当前认知科学和深度学习领域发展非常 tern Recognition.Miami,FL,USA,2009:248-255 迅速，尤其是在增强学习、半监督学习和无监督学 [13]SRIVASTAVA N.MANSIMOV E,SALAKHUDINOV R 习方向，且当前的计算机计算能力越来越强，这些 Unsupervised learning of video representations using 有利因素定会加速视频预测研究的进展。 LSTMs[Cl//Proceedings of the 32nd International Confer- ence on Machine Learning.Lille,France,2015:843-852. 参考文献： [14]MCCULLOCH WS,PITTS W.A logical calculus of the [1]LECUN Y.Predictive Learning[R]//Proceedings of the 30th ideas immanent in nervous activity[J].The bulletin of Annual Conference on Neural Information Processing Sys- mathematical biophysics,1943,5(4):115-133 tems.Barcelona,Spain,2016 [15]HEBB D O.The organization of behavior:A neuropsycho- [2]LECUN Y,BENGIO Y,HINTON G.Deep learning[J]. logical theory[M].New York:Chapman Hall,1949. Nature,2015,521(7553):436-444. [16]MINSKY ML,PAPERT S A.Perceptrons:an introduc- [3]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Im- tion to computational geometry[M].2nd ed.Cambridge, ageNet classification with deep convolutional neural net- UK:MIT Press,1988. works[C]//Proceedings of the 26th Annual Conference on [17]RUMELHART D E,HINTON G E,WILLIAMS R J. Neural Information Processing Systems 2012.South Lake Learning representations by back-propagating errors[J]. Tahoe,NV,USA2012:1097-1105 Nature,1986,323(6088):533-536. [4]HE Kaiming,ZHANG Xiangyu,REN Shaoqing,et al. [18]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-based Delving deep into rectifiers:Surpassing human-level per- learning applied to document recognition[J].Proceedings地建模视频场景来推测未来视频，从而帮助机器能够更好地决策，还在于其以无监督方式学习到的内部视觉表征可以加速或提升弱监督学习和有监督学习的性能，因此得到了越来越多学者的关注，也取得了非常多的进展。但是，现有的方法仍旧存在许多不足： 1) 当前提出的各种模型，结构比较单一，多数是基于自编码器、递归神经网络 (包括 LSTM) 和生成对抗网络，虽然这些架构取得了不错的效果，但是仍无法高效建模自然界复杂的动态结构，导致当前的模型仅能预测有限的几帧或者几十帧图像，且在预测的后期画面会变模糊或者失去语义信息。 2) 目前学术界使用的视频预测损失函数比较单一，常使用的损失函数是均方误差损失、对抗损失函数和图像梯度差分损失函数。因为图像具有高维复杂结构信息，当前常用损失函数没有充分考虑结构信息，导致模型预测的图像缺乏语义信息。另外，使用峰值信噪比、结构相似性作为图像评价标准，与人眼的视觉感知并不完全一致，人眼的视觉对于误差的敏感度并不是绝对的，其感知结果会受到许多因素的影响而产生变化，因此在图形评价指标上仍有待研究。 3) 理论上，预测视频动态在机器人决策、无人驾驶和视频监控系统等领域具有广泛的应用价值，但当前视频预测的研究多数在学术界，且研究处于早期阶段，具体在工业界的应用还未起步。视频预测学习是理解和建模自然界场景动态的有力手段，也是无监督学习的一个新的、重要的突破点，尽管该领域的研究面临着不少挑战和未解决的问题，但当前认知科学和深度学习领域发展非常迅速，尤其是在增强学习、半监督学习和无监督学习方向，且当前的计算机计算能力越来越强，这些有利因素定会加速视频预测研究的进展。参考文献： LECUN Y. Predictive Learning[R]//Proceedings of the 30th Annual Conference on Neural Information Processing Systems. Barcelona, Spain, 2016 [1] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436–444. [2] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012. South Lake Tahoe, NV, USA, 2012: 1097–1105. [3] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Delving deep into rectifiers: Surpassing human-level per- [4] formance on imagenet classification[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile, 2015: 1026–1034. SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[Z]. arXiv preprint arXiv: 1409.1556, 2014. [5] HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, 2016: 770–778. [6] HINTON G, DENG Li, YU Dong, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups[J]. IEEE signal processing magazine, 2012, 29(6): 82–97. [7] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Quebec, Canada, 2014: 3104–3112. [8] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[J]. Journal of machine learning research, 2003, 3: 1137–1155. [9] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Playing atari with deep reinforcement learning[Z]. arXiv preprint arXiv: 1312.5602, 2013. [10] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484–489. [11] DENG Jia, DONG Wei, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA, 2009: 248–255. [12] SRIVASTAVA N, MANSIMOV E, SALAKHUDINOV R. Unsupervised learning of video representations using LSTMs[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille, France, 2015: 843–852. [13] MCCULLOCH W S, PITTS W. A logical calculus of the ideas immanent in nervous activity[J]. The bulletin of mathematical biophysics, 1943, 5(4): 115–133. [14] HEBB D O. The organization of behavior: A neuropsychological theory[M]. New York: Chapman & Hall, 1949. [15] MINSKY M L, PAPERT S A. Perceptrons: an introduction to computational geometry[M]. 2nd ed. Cambridge, UK: MIT Press, 1988. [16] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors[J]. Nature, 1986, 323(6088): 533–536. [17] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings [18] 第 1 期莫凌飞，等：基于深度学习的视频预测研究综述 ·93·

<<向上翻页向下翻页>>

点击下载：基于深度学习的视频预测研究综述（东南大学：莫凌飞、蒋红亮、李煊鹏）