Abstract
Understanding temporal dependencies of videos is fundamental for vision problems, but deep learning–based models are still insufficient in this field. In this article, we propose a novel deep multiplicative neural network (DMNN) for learning hierarchical long-term representations from video. The DMNN is built upon the multiplicative block that remembers the pairwise transformations between consecutive frames using multiplicative interactions rather than the regular weighted-sum ones. The block is slided over the timesteps to update the memory of the networks on the frame pairs. Deep architecture can be implemented by stacking multiple layers of the sliding blocks. The multiplicative interactions lead to exact, rather than approximate, modeling of temporal dependencies. The memory mechanism can remember the temporal dependencies for an arbitrary length of time. The multiple layers output multiple-level representations that reflect the multi-timescale structure of video. Moreover, to address the difficulty of training DMNNs, we derive a theoretically sound convergent method, which leads to a fast and stable convergence. We demonstrate a new state-of-the-art classification performance with proposed networks on the UCF101 dataset and the effectiveness of capturing complicate temporal dependencies on a variety of synthetic datasets.
- Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems. 1171--1179.Google Scholar
- Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu. 2015. Natural neural networks. In Advances in Neural Information Processing Systems. 2071--2079.Google Scholar
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR’15). 2625--2634.Google Scholar
Cross Ref
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 249--256.Google Scholar
- Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 315--323.Google Scholar
- Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2712--2719.Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision. 1026--1034.Google Scholar
Digital Library
- Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-Rahman Mohamed, Navdeep Jaitly, Andrew Senior, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (2012), 82--97.Google Scholar
Cross Ref
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. 448--456. http://proceedings.mlr.press/v37/ioffe15.html.Google Scholar
Digital Library
- Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 221--231.Google Scholar
Digital Library
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14).Google Scholar
Digital Library
- Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539.Google Scholar
- Kishore Reddy Konda, Roland Memisevic, and Vincent Michalski. 2013. Learning to encode motion using spatio-temporal synchrony. arXiv:1306.3162.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.Google Scholar
Digital Library
- Zhengzhong Lan, Ming Lin, Xuanchong Li, Alex G. Hauptmann, and Bhiksha Raj. 2015. Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 204--212.Google Scholar
- Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th International Conference on Machine Learning (ICML’13), Vol. 30.Google Scholar
- Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen Schmidhuber. 2011. Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN 2011. Lecture Notes in Computer Science, Vol. 6791. Springer, 52--59.Google Scholar
Cross Ref
- Roland Memisevic. 2011. Gradient-based learning of higher-order image features. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV’11). IEEE, Los Alamitos, CA, 1591--1598.Google Scholar
Digital Library
- Roland Memisevic. 2013. Learning to relate images. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1829--1846.Google Scholar
Digital Library
- Vincent Michalski, Roland Memisevic, and Kishore Konda. 2014. Modeling deep temporal dependencies with recurrent grammar cells. In Advances in Neural Information Processing Systems. 1925--1933.Google Scholar
- Hossein Mobahi, Ronan Collobert, and Jason Weston. 2009. Deep learning from temporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 737--744.Google Scholar
Digital Library
- Grégoire Montavon and Klaus-Robert Müller. 2012. Deep Boltzmann machines and the centering trick. In Neural Networks: Tricks of the Trade. Springer, 621--637.Google Scholar
- Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 807--814.Google Scholar
Digital Library
- Tapani Raiko, Harri Valpola, and Yann LeCun. 2012. Deep learning made easier by linear transformations in perceptrons. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 924--932.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations. 2017--2027.Google Scholar
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. CRCV-TR12-01. Center for Research in Computer Vision, University of Central Floriday, Orlando, FL.Google Scholar
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.Google Scholar
Digital Library
- Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 843–852.Google Scholar
- Ilya Sutskever, Oriol Vinyals, and Quoc V. V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104--3112.Google Scholar
Digital Library
- Graham W. Taylor, Rob Fergus, Yann LeCun, and Christoph Bregler. 2010. Convolutional learning of spatio-temporal features. In Computer Vision–ECCV 2010. Lecture Notes in Computer Science, Vol. 6316. Springer, 140--153.Google Scholar
Cross Ref
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.Google Scholar
Digital Library
- Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google Scholar
Cross Ref
- Li Wang, Ting Liu, Gang Wang, Kap Luk Chan, and Qingxiong Yang. 2015. Video tracking using learned hierarchical features. IEEE Transactions on Image Processing 24, 4 (2015), 1424--1435.Google Scholar
Cross Ref
- Naiyan Wang and Dit-Yan Yeung. 2013. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems. 809--817.Google Scholar
- Robert West, Evgeniy Gabrilovich, Kevin Murphy, Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014. Knowledge base completion via search-based question answering. In Proceedings of the 23rd International Conference on World Wide Web. ACM, New York, NY, 515--526.Google Scholar
Digital Library
- Geoffrey Hinton, Yann LeCun, and Yoshua Bengio. 2015. Deep learning. Nature 521, 7553 (05 2015), 436--444.Google Scholar
Index Terms
Modeling Long-Term Dependencies from Videos Using Deep Multiplicative Neural Networks
Recommendations
Multiplicative neuron model artificial neural network based on Gaussian activation function
Multiplicative neuron model-based artificial neural networks are one of the artificial neural network types which have been proposed recently and have produced successful forecasting results. Sigmoid activation function was used in multiplicative neuron ...
Model-Agnostic Reachability Analysis on Deep Neural Networks
Advances in Knowledge Discovery and Data MiningAbstractVerification plays an essential role in the formal analysis of safety-critical systems. Most current verification methods have specific requirements when working on Deep Neural Networks (DNNs). They either target one particular network category, ...
Improvement of Bidirectional Recurrent Neural Network for Learning Long-Term Dependencies
ICPR '04: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 4 - Volume 04Bidirectional recurrent neural network (BRNN) is a non-causal generalization of recurrent neural networks (RNNs). Due to the problem of vanishing gradients, BRNN cannot learn long-term dependencies efficiently with gradient descent. To tackle the long-...






Comments