skip to main content
survey

Deep Learning–Based Multimedia Analytics: A Review

Published:24 January 2019Publication History
Skip Abstract Section

Abstract

The multimedia community has witnessed the rise of deep learning–based techniques in analyzing multimedia content more effectively. In the past decade, the convergence of deep-learning and multimedia analytics has boosted the performance of several traditional tasks, such as classification, detection, and regression, and has also fundamentally changed the landscape of several relatively new areas, such as semantic segmentation, captioning, and content generation. This article aims to review the development path of major tasks in multimedia analytics and take a look into future directions. We start by summarizing the fundamental deep techniques related to multimedia analytics, especially in the visual domain, and then review representative high-level tasks powered by recent advances. Moreover, the performance review of popular benchmarks gives a pathway to technology advancement and helps identify both milestone works and future directions.

References

  1. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. Springer, 382--398.Google ScholarGoogle ScholarCross RefCross Ref
  2. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6077--6086.Google ScholarGoogle ScholarCross RefCross Ref
  3. M. Arjovsky, S. Chintala, and L. Bottou. 2017. Wasserstein GAN. In arXiv:1701.07875.Google ScholarGoogle Scholar
  4. Vijay Badrinarayanan, Ankur Handa, and Roberto Cipolla. 2015. SegNet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. In arXiv:1505.07293.Google ScholarGoogle Scholar
  5. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In arXiv:1409.0473.Google ScholarGoogle Scholar
  6. Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2015. Delving deeper into convolutional networks for learning video representations. In arXiv:1511.06432.Google ScholarGoogle Scholar
  7. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, 65--72.Google ScholarGoogle Scholar
  8. Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3185--3194.Google ScholarGoogle ScholarCross RefCross Ref
  9. Larry S. Davis and Bharat Singh. 2018. An analysis of scale invariance in object detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3578--3587.Google ScholarGoogle Scholar
  10. David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 190--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jingwen Chen, Ting Yao, and Hongyang Chao. 2018. See and chat: Automatically generating viewer-level comments on images. Multimedia Tools and Applications. (In press).Google ScholarGoogle Scholar
  12. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2014. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In arXiv:1412.7062.Google ScholarGoogle Scholar
  13. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2018. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2018), 834--848.Google ScholarGoogle ScholarCross RefCross Ref
  14. Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. In arXiv: 1706.05587.Google ScholarGoogle Scholar
  15. Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In arXiv:1802.02611.Google ScholarGoogle Scholar
  16. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. In arXiv:1504.00325.Google ScholarGoogle Scholar
  17. François Chollet. 2016. Xception: Deep learning with depthwise separable convolutions. In arXiv:1610.02357.Google ScholarGoogle Scholar
  18. Djork-Arnãl' Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (ELUs). In arXiv:1511.07289.Google ScholarGoogle Scholar
  19. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The Cityscapes dataset for semantic urban scene understanding. In arXiv:1604.01685.Google ScholarGoogle Scholar
  20. Jifeng Dai, Kaiming He, and Jian Sun. 2015. BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In IEEE International Conference on Computer Vision. IEEE Computer Society, 1635--1643. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Emily L. Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems. The MIT Press, 1486--1494. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, and Margaret Mitchell. 2015. Language models for image captioning: The quirks and what works. In Proceedings of Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 100--105.Google ScholarGoogle ScholarCross RefCross Ref
  23. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  24. Mark Everingham, S. M. Eslami, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. 2015. The Pascal visual object classes challenge: A retrospective. Int. J. Comput. Vision 111, 1 (2015), 98--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. Springer, 15--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mostafa Gamal, Mennatullah Siam, and Moemen Abdel-Razek. 2018. ShuffleSeg: Real-time semantic segmentation network. In arXiv:1803.03816.Google ScholarGoogle Scholar
  27. Ross Girshick. 2015. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision. IEEE Computer Society, 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2013. Rich feature hierarchies for accurate object detection and semantic segmentation. In arXiv:1311.2524. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. JMLR W&CP: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Vol. 9. 249--256.Google ScholarGoogle Scholar
  30. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, Vol. 2. 2672--2680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alex Graves. 2013. Generating sequences with recurrent neural networks. In arXiv:1308.0850.Google ScholarGoogle Scholar
  32. Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In IEEE International Conference on Computer Vision. IEEE Computer Society, 2712--2719. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha Mahowald, Rodney J. Douglas, and H. Sebastian Seung. 2000. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 6789 (2000), 947--51.Google ScholarGoogle Scholar
  34. Jun Han and Claudio Moraga. 1995. The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning, Vol. 930, Lecture Notes in Computer Science. Springer, 195--201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision. IEEE Computer Society, 2980--2988.Google ScholarGoogle Scholar
  36. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In arXiv:1406.4729.Google ScholarGoogle Scholar
  37. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. In arXiv:1512.03385.Google ScholarGoogle Scholar
  38. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 1026--1034. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R. Hershey, Tim K. Marks, and Kazuhiko Sumi. 2017. Attention-based multimodal fusion for video description. In IEEE International Conference on Computer Vision. IEEE Computer Society, 4203--4212.Google ScholarGoogle ScholarCross RefCross Ref
  40. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. In arXiv:1704.04861.Google ScholarGoogle Scholar
  41. Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3588--3597.Google ScholarGoogle ScholarCross RefCross Ref
  42. Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. CondenseNet: An efficient DenseNet using learned group convolutions. In arXiv:1711.09224.Google ScholarGoogle Scholar
  43. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2261--2269.Google ScholarGoogle Scholar
  44. Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5MB model size. In arXiv:1602.07360.Google ScholarGoogle Scholar
  45. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. Omnipress, 448--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. Image-to-image translation with conditional adversarial networks. In arXiv:1611.07004.Google ScholarGoogle Scholar
  47. Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue, and Shih-Fu Chang. 2018. Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Transactions on Multimedia 20, 11 (2018), 3137--3147.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Li Shen Jie Hu and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 7132--7141.Google ScholarGoogle Scholar
  49. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of GANs for improved quality, stability, and variation. In arXiv:1710.10196v2.Google ScholarGoogle Scholar
  50. Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. 2015. Bayesian SegNet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In arXiv:1511.02680.Google ScholarGoogle Scholar
  51. Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Vol. 70. 1857--1865. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In arXiv:1412.6980.Google ScholarGoogle Scholar
  53. Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision 50, 2 (2002), 171--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, and Yurong Chen. 2017. RON: Reverse connection with objectness prior networks for object detection. In arXiv:1707.01691.Google ScholarGoogle Scholar
  55. Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 706--715.Google ScholarGoogle ScholarCross RefCross Ref
  56. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Girish Kulkarni, Visruth Premraj, Vicente Ordonez, et al. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891--2903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. 2017. Deep Laplacian pyramid networks for fast and accurate super-resolution. In arXiv:1704.03915.Google ScholarGoogle Scholar
  59. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 4 (1989), 541--551. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. 2018. Recurrent tubelet proposal and recognition networks for action detection. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 306--322.Google ScholarGoogle ScholarCross RefCross Ref
  61. Dong Li, Ting Yao, Lingyu Duan, Tao Mei, and Yong Rui. 2018. Unified spatio-temporal attention networks for action recognition in videos. IEEE Transactions on Multimedia. (In Press).Google ScholarGoogle Scholar
  62. Qing Li, Zhaofan Qiu, Ting Yao, Tao Mei, Yong Rui, and Jiebo Luo. 2016. Action recognition by learning deep multi-granular spatio-temporal video representation. In Proceedings of the ACM on International Conference on Multimedia Retrieval. ACM, 159--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Qing Li, Zhaofan Qiu, Ting Yao, Tao Mei, Yong Rui, and Jiebo Luo. 2017. Learning hierarchical video representation for action recognition. International Journal of Multimedia Information Retrieval 6, 1 (2017), 85--98.Google ScholarGoogle ScholarCross RefCross Ref
  64. Yehao Li, Ting Yao, Tao Mei, Hongyang Chao, and Yong Rui. 2016. Share-and-chat: Achieving human-level video commenting by search and multi-view embedding. In ACM Multimedia. ACM, 928--937. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 7492--7500.Google ScholarGoogle ScholarCross RefCross Ref
  66. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL Workshop on Text Summarization Branches Out. 10 pages.Google ScholarGoogle Scholar
  67. Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D. Reid. 2016. RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In arXiv:1611.06612.Google ScholarGoogle Scholar
  68. Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. In arXiv:1312.4400.Google ScholarGoogle Scholar
  69. Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 936--944.Google ScholarGoogle Scholar
  70. Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 2999--3007.Google ScholarGoogle ScholarCross RefCross Ref
  71. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 740--755.Google ScholarGoogle Scholar
  72. Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Optimization of image description metrics using policy gradient methods. In arXiv:1612.00370.Google ScholarGoogle Scholar
  73. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott E. Reed. 2015. SSD: Single shot MultiBox detector. In arXiv:1512.02325.Google ScholarGoogle Scholar
  74. Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2014. Fully convolutional networks for semantic segmentation. In arXiv:1411.4038.Google ScholarGoogle Scholar
  75. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3242--3250.Google ScholarGoogle ScholarCross RefCross Ref
  76. Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 122--138.Google ScholarGoogle ScholarCross RefCross Ref
  77. Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Vol. 28.Google ScholarGoogle Scholar
  78. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. In arXiv:arXiv:1410.1090.Google ScholarGoogle Scholar
  79. Xiao-Jiao Mao, Chunhua Shen, and Yu-Bin Yang. 2016. Image restoration using very deep fully convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems. Curran Associates, 2810--2818. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. M. Mirza and S. Osindero. 2014. Conditional generative adversarial nets. In arXiv:1411.1784.Google ScholarGoogle Scholar
  81. Margaret Mitchell, Xufeng Han, Amit Goyal, et al. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics. The Association for Computer Linguistics, 747--756. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning. Omnipress, 807--814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Augustus Odena, Christopher Olah, and Jonathon Shlens. 2016. Conditional image synthesis with auxiliary classifier GANs. In arXiv:1610.09585.Google ScholarGoogle Scholar
  84. Yingwei Pan, Yehao Li, Ting Yao, Tao Mei, Houqiang Li, and Yong Rui. 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In Proceedings of the International Joint Conference on Artificial Intelligence. AAAI Press, 3832--3838. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4594--4602.Google ScholarGoogle ScholarCross RefCross Ref
  86. Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. 2017. Seeing bot. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1341--1344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. 2017. To create what you tell: Generating videos from captions. In ACM Multimedia. ACM, 1789--1798. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In arXiv:1611.07675.Google ScholarGoogle Scholar
  89. Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 717--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting on Association for Computational Linguistics. 311--318 Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. 2016. ENet: A deep neural network architecture for real-time semantic segmentation. In arXiv:1606.02147.Google ScholarGoogle Scholar
  92. Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. 2016. Semantic segmentation using adversarial networks. In arXiv:1611.08408.Google ScholarGoogle Scholar
  93. Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. 2017. Large kernel matters - improve semantic segmentation by global convolutional network. In arXiv:1703.02719.Google ScholarGoogle Scholar
  94. Pedro H. O. Pinheiro and Ronan Collobert. 2015. From image-level to pixel-level labeling with Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1713--1721.Google ScholarGoogle Scholar
  95. B. T. Polyak and A. B. Juditsky. 1992. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 4 (July 1992), 838--855. Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. Zhaofan Qiu, Qing Li, Ting Yao, Tao Mei, and Yong Rui. 2015. MSR Asia MSM at THUMOS Challenge 2015. In THUMOS Challenge Workshop of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society.Google ScholarGoogle Scholar
  97. Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Deep quantization: Encoding convolutional activations with deep generative model. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4085--4094.Google ScholarGoogle ScholarCross RefCross Ref
  98. Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4489--4497.Google ScholarGoogle ScholarCross RefCross Ref
  99. Zhaofan Qiu, Ting Yao, and Tao Mei. 2018. Learning deep spatio-temporal dependence for semantic video segmentation. IEEE Transactions on Multimedia 20, 4 (2018), 939--949. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. In arXiv:1511.06434.Google ScholarGoogle Scholar
  101. Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You only look once: Unified, real-time object detection. In arXiv:1506.02640.Google ScholarGoogle Scholar
  102. Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, faster, stronger. In arXiv:1612.08242.Google ScholarGoogle Scholar
  103. Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text-to-image synthesis. In Proceedings of the International Conference on International Conference on Machine Learning, Vol. 48. JMLR, 1060--1069. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (2017), 1137--1149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1179--1195.Google ScholarGoogle ScholarCross RefCross Ref
  106. Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 433--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  107. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In arXiv:1505.04597.Google ScholarGoogle Scholar
  108. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Learning Representations by Back-propagating Errors. In Neurocomputing: Foundations of Research. MIT Press, 696--699. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. David Saad. 1998. On-line Learning in Neural Networks. Cambridge University Press, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Dominik Scherer, Andreas Müller, and Sven Behnke. 2010. Evaluation of pooling operations in convolutional architectures for object recognition. In Proceedings of the 20th International Conference on Artificial Neural Networks: Part III. 92--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue. 2017. DSOD: Learning deeply supervised object detectors from scratch. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 1937--1945.Google ScholarGoogle ScholarCross RefCross Ref
  113. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the International Conference on Neural Information Processing Systems, Vol 1. MIT Press, Cambridge, 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In arXiv:1409.1556.Google ScholarGoogle Scholar
  115. Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. 2016. Deep video deblurring. In arXiv:1611.08387.Google ScholarGoogle Scholar
  116. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  117. Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. 2016. Inception-v4, inception-ResNet and the impact of residual connections on learning. In arXiv:1602.07261.Google ScholarGoogle Scholar
  118. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. In arXiv:1409.4842.Google ScholarGoogle Scholar
  119. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. In arXiv:1512.00567.Google ScholarGoogle Scholar
  120. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. 2016. Instance normalization: The missing ingredient for fast stylization. In arXiv:1607.08022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4566--4575.Google ScholarGoogle ScholarCross RefCross Ref
  123. S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence to sequence - video to text. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4534--4542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1494--1504.Google ScholarGoogle ScholarCross RefCross Ref
  125. Dumoulin Vincent and Visin Francesco. 2016. A guide to convolution arithmetic for deep learning. In arXiv:1603.07285.Google ScholarGoogle Scholar
  126. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3156--3164.Google ScholarGoogle ScholarCross RefCross Ref
  127. Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems?. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 203--212.Google ScholarGoogle Scholar
  128. Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016. Multi-stream multi-class fusion of deep networks for video classification. ACM Multimedia (2016). ACM, 791--800. Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. Proceedings of ACM Multimedia (2015). ACM, 461--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated residual transformations for deep neural networks. In arXiv:1611.05431.Google ScholarGoogle Scholar
  131. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1063--6919.Google ScholarGoogle ScholarCross RefCross Ref
  132. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. PMLR, 2048--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. 2017. Deep image matting. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 311--320.Google ScholarGoogle ScholarCross RefCross Ref
  134. Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 444--454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. Zhilin Yang, Ye Yuan, Yuexin Wu, William W. Cohen, and Ruslan R. Salakhutdinov. 2016. Review networks for caption generation. In Proceedings of the Advances in Neural Information Processing Systems. 2361--2369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4507--4515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. Ting Yao, Yehao Li, Zhaofan Qiu, Fuchen Long, Yingwei Pan, Dong Li, and Tao Mei. 2017. MSR Asia MSM at ActivityNet challenge 2017: Trimmed action recognition, temporal action proposals and dense-captioning events in videos. In CVPR ActivityNet Challenge Workshop.Google ScholarGoogle Scholar
  138. Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 28--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017. Incorporating copying mechanism in image captioning for learning novel objects. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 5263--5271.Google ScholarGoogle ScholarCross RefCross Ref
  140. Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. Springer, 711--727.Google ScholarGoogle ScholarCross RefCross Ref
  141. Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4904--4912.Google ScholarGoogle ScholarCross RefCross Ref
  142. Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. DualGAN: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 2868--2876.Google ScholarGoogle ScholarCross RefCross Ref
  143. Yibo Yang, Zhisheng Zhong, Tiancheng Shen, and Zhouchen Lin. 2018. Convolutional neural networks with alternately updated clique. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2413--2422.Google ScholarGoogle ScholarCross RefCross Ref
  144. Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4651--4659.Google ScholarGoogle ScholarCross RefCross Ref
  145. Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018. BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision. Springer, 334--349.Google ScholarGoogle ScholarCross RefCross Ref
  146. Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. In arXiv:1511.07122.Google ScholarGoogle Scholar
  147. Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4584--4593.Google ScholarGoogle ScholarCross RefCross Ref
  148. H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. 2017. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In arXiv:1612.03242.Google ScholarGoogle Scholar
  149. Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2017. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6848--6856.Google ScholarGoogle Scholar
  150. Yiheng Zhang, Zhaofan Qiu, Ting Yao, Dong Liu, and Tao Mei. 2018. Fully convolutional adaptation networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6810--6818.Google ScholarGoogle ScholarCross RefCross Ref
  151. Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2016. Pyramid scene parsing network. In arXiv:1612.01105.Google ScholarGoogle Scholar
  152. Nuno Vasconcelos Zhaowei Cai. 2018. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6154--6162.Google ScholarGoogle Scholar
  153. Luowei Zhou, Chenliang Xu, Parker Koch, and Jason J Corso. 2016. Image caption generation with text-conditional semantic attention. In arXiv:1606.04621.Google ScholarGoogle Scholar
  154. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 2242--2251Google ScholarGoogle ScholarCross RefCross Ref
  155. Wentao Zhu, Xiang Xiang, Trac D. Tran, and Xiaohui Xie. 2016. Adversarial deep structural networks for mammographic mass segmentation. In arXiv:1612.05970.Google ScholarGoogle Scholar

Index Terms

  1. Deep Learning–Based Multimedia Analytics: A Review

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 1s
          Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data
          January 2019
          265 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3309769
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 January 2019
          • Revised: 1 September 2018
          • Accepted: 1 September 2018
          • Received: 1 June 2018
          Published in tomm Volume 15, Issue 1s

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • survey
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!