skip to main content
research-article

Modeling Long-Term Dependencies from Videos Using Deep Multiplicative Neural Networks

Authors Info & Claims
Published:14 July 2020Publication History
Skip Abstract Section

Abstract

Understanding temporal dependencies of videos is fundamental for vision problems, but deep learning–based models are still insufficient in this field. In this article, we propose a novel deep multiplicative neural network (DMNN) for learning hierarchical long-term representations from video. The DMNN is built upon the multiplicative block that remembers the pairwise transformations between consecutive frames using multiplicative interactions rather than the regular weighted-sum ones. The block is slided over the timesteps to update the memory of the networks on the frame pairs. Deep architecture can be implemented by stacking multiple layers of the sliding blocks. The multiplicative interactions lead to exact, rather than approximate, modeling of temporal dependencies. The memory mechanism can remember the temporal dependencies for an arbitrary length of time. The multiple layers output multiple-level representations that reflect the multi-timescale structure of video. Moreover, to address the difficulty of training DMNNs, we derive a theoretically sound convergent method, which leads to a fast and stable convergence. We demonstrate a new state-of-the-art classification performance with proposed networks on the UCF101 dataset and the effectiveness of capturing complicate temporal dependencies on a variety of synthetic datasets.

References

  1. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems. 1171--1179.Google ScholarGoogle Scholar
  2. Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu. 2015. Natural neural networks. In Advances in Neural Information Processing Systems. 2071--2079.Google ScholarGoogle Scholar
  3. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR’15). 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  4. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 249--256.Google ScholarGoogle Scholar
  5. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 315--323.Google ScholarGoogle Scholar
  6. Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2712--2719.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision. 1026--1034.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-Rahman Mohamed, Navdeep Jaitly, Andrew Senior, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (2012), 82--97.Google ScholarGoogle ScholarCross RefCross Ref
  9. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. 448--456. http://proceedings.mlr.press/v37/ioffe15.html.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 221--231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14).Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539.Google ScholarGoogle Scholar
  13. Kishore Reddy Konda, Roland Memisevic, and Vincent Michalski. 2013. Learning to encode motion using spatio-temporal synchrony. arXiv:1306.3162.Google ScholarGoogle Scholar
  14. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Zhengzhong Lan, Ming Lin, Xuanchong Li, Alex G. Hauptmann, and Bhiksha Raj. 2015. Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 204--212.Google ScholarGoogle Scholar
  16. Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th International Conference on Machine Learning (ICML’13), Vol. 30.Google ScholarGoogle Scholar
  17. Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen Schmidhuber. 2011. Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN 2011. Lecture Notes in Computer Science, Vol. 6791. Springer, 52--59.Google ScholarGoogle ScholarCross RefCross Ref
  18. Roland Memisevic. 2011. Gradient-based learning of higher-order image features. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV’11). IEEE, Los Alamitos, CA, 1591--1598.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Roland Memisevic. 2013. Learning to relate images. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1829--1846.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Vincent Michalski, Roland Memisevic, and Kishore Konda. 2014. Modeling deep temporal dependencies with recurrent grammar cells. In Advances in Neural Information Processing Systems. 1925--1933.Google ScholarGoogle Scholar
  21. Hossein Mobahi, Ronan Collobert, and Jason Weston. 2009. Deep learning from temporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 737--744.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Grégoire Montavon and Klaus-Robert Müller. 2012. Deep Boltzmann machines and the centering trick. In Neural Networks: Tricks of the Trade. Springer, 621--637.Google ScholarGoogle Scholar
  23. Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 807--814.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Tapani Raiko, Harri Valpola, and Yann LeCun. 2012. Deep learning made easier by linear transformations in perceptrons. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 924--932.Google ScholarGoogle Scholar
  25. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.Google ScholarGoogle Scholar
  26. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations. 2017--2027.Google ScholarGoogle Scholar
  27. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. CRCV-TR12-01. Center for Research in Computer Vision, University of Central Floriday, Orlando, FL.Google ScholarGoogle Scholar
  28. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 843–852.Google ScholarGoogle Scholar
  30. Ilya Sutskever, Oriol Vinyals, and Quoc V. V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104--3112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Graham W. Taylor, Rob Fergus, Yann LeCun, and Christoph Bregler. 2010. Convolutional learning of spatio-temporal features. In Computer Vision–ECCV 2010. Lecture Notes in Computer Science, Vol. 6316. Springer, 140--153.Google ScholarGoogle ScholarCross RefCross Ref
  32. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google ScholarGoogle ScholarCross RefCross Ref
  34. Li Wang, Ting Liu, Gang Wang, Kap Luk Chan, and Qingxiong Yang. 2015. Video tracking using learned hierarchical features. IEEE Transactions on Image Processing 24, 4 (2015), 1424--1435.Google ScholarGoogle ScholarCross RefCross Ref
  35. Naiyan Wang and Dit-Yan Yeung. 2013. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems. 809--817.Google ScholarGoogle Scholar
  36. Robert West, Evgeniy Gabrilovich, Kevin Murphy, Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014. Knowledge base completion via search-based question answering. In Proceedings of the 23rd International Conference on World Wide Web. ACM, New York, NY, 515--526.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Geoffrey Hinton, Yann LeCun, and Yoshua Bengio. 2015. Deep learning. Nature 521, 7553 (05 2015), 436--444.Google ScholarGoogle Scholar

Index Terms

  1. Modeling Long-Term Dependencies from Videos Using Deep Multiplicative Neural Networks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 2s
        Special Issue on Smart Communications and Networking for Future Video Surveillance and Special Section on Extended MMSYS-NOSSDAV 2019 Best Papers
        April 2020
        291 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3407689
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 July 2020
        • Online AM: 7 May 2020
        • Accepted: 1 August 2019
        • Revised: 1 July 2019
        • Received: 1 May 2019
        Published in tomm Volume 16, Issue 2s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!