Abstract
The multimedia community has witnessed the rise of deep learning–based techniques in analyzing multimedia content more effectively. In the past decade, the convergence of deep-learning and multimedia analytics has boosted the performance of several traditional tasks, such as classification, detection, and regression, and has also fundamentally changed the landscape of several relatively new areas, such as semantic segmentation, captioning, and content generation. This article aims to review the development path of major tasks in multimedia analytics and take a look into future directions. We start by summarizing the fundamental deep techniques related to multimedia analytics, especially in the visual domain, and then review representative high-level tasks powered by recent advances. Moreover, the performance review of popular benchmarks gives a pathway to technology advancement and helps identify both milestone works and future directions.
- Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. Springer, 382--398.Google Scholar
Cross Ref
- Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6077--6086.Google Scholar
Cross Ref
- M. Arjovsky, S. Chintala, and L. Bottou. 2017. Wasserstein GAN. In arXiv:1701.07875.Google Scholar
- Vijay Badrinarayanan, Ankur Handa, and Roberto Cipolla. 2015. SegNet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. In arXiv:1505.07293.Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In arXiv:1409.0473.Google Scholar
- Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. 2015. Delving deeper into convolutional networks for learning video representations. In arXiv:1511.06432.Google Scholar
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, 65--72.Google Scholar
- Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3185--3194.Google Scholar
Cross Ref
- Larry S. Davis and Bharat Singh. 2018. An analysis of scale invariance in object detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3578--3587.Google Scholar
- David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 190--200. Google Scholar
Digital Library
- Jingwen Chen, Ting Yao, and Hongyang Chao. 2018. See and chat: Automatically generating viewer-level comments on images. Multimedia Tools and Applications. (In press).Google Scholar
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2014. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In arXiv:1412.7062.Google Scholar
- Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2018. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2018), 834--848.Google Scholar
Cross Ref
- Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation. In arXiv: 1706.05587.Google Scholar
- Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In arXiv:1802.02611.Google Scholar
- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. In arXiv:1504.00325.Google Scholar
- François Chollet. 2016. Xception: Deep learning with depthwise separable convolutions. In arXiv:1610.02357.Google Scholar
- Djork-Arnãl' Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (ELUs). In arXiv:1511.07289.Google Scholar
- Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The Cityscapes dataset for semantic urban scene understanding. In arXiv:1604.01685.Google Scholar
- Jifeng Dai, Kaiming He, and Jian Sun. 2015. BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In IEEE International Conference on Computer Vision. IEEE Computer Society, 1635--1643. Google Scholar
Digital Library
- Emily L. Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. 2015. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems. The MIT Press, 1486--1494. Google Scholar
Digital Library
- Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, and Margaret Mitchell. 2015. Language models for image captioning: The quirks and what works. In Proceedings of Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 100--105.Google Scholar
Cross Ref
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2625--2634.Google Scholar
Cross Ref
- Mark Everingham, S. M. Eslami, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. 2015. The Pascal visual object classes challenge: A retrospective. Int. J. Comput. Vision 111, 1 (2015), 98--136. Google Scholar
Digital Library
- Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision. Springer, 15--29. Google Scholar
Digital Library
- Mostafa Gamal, Mennatullah Siam, and Moemen Abdel-Razek. 2018. ShuffleSeg: Real-time semantic segmentation network. In arXiv:1803.03816.Google Scholar
- Ross Girshick. 2015. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision. IEEE Computer Society, 1440--1448. Google Scholar
Digital Library
- Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2013. Rich feature hierarchies for accurate object detection and semantic segmentation. In arXiv:1311.2524. Google Scholar
Digital Library
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. JMLR W&CP: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, Vol. 9. 249--256.Google Scholar
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, Vol. 2. 2672--2680. Google Scholar
Digital Library
- Alex Graves. 2013. Generating sequences with recurrent neural networks. In arXiv:1308.0850.Google Scholar
- Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In IEEE International Conference on Computer Vision. IEEE Computer Society, 2712--2719. Google Scholar
Digital Library
- Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha Mahowald, Rodney J. Douglas, and H. Sebastian Seung. 2000. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 6789 (2000), 947--51.Google Scholar
- Jun Han and Claudio Moraga. 1995. The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning, Vol. 930, Lecture Notes in Computer Science. Springer, 195--201. Google Scholar
Digital Library
- Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision. IEEE Computer Society, 2980--2988.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In arXiv:1406.4729.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. In arXiv:1512.03385.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 1026--1034. Google Scholar
Digital Library
- Chiori Hori, Takaaki Hori, Teng-Yok Lee, Ziming Zhang, Bret Harsham, John R. Hershey, Tim K. Marks, and Kazuhiko Sumi. 2017. Attention-based multimodal fusion for video description. In IEEE International Conference on Computer Vision. IEEE Computer Society, 4203--4212.Google Scholar
Cross Ref
- Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. In arXiv:1704.04861.Google Scholar
- Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3588--3597.Google Scholar
Cross Ref
- Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. CondenseNet: An efficient DenseNet using learned group convolutions. In arXiv:1711.09224.Google Scholar
- Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2261--2269.Google Scholar
- Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5MB model size. In arXiv:1602.07360.Google Scholar
- Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. Omnipress, 448--456. Google Scholar
Digital Library
- Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2016. Image-to-image translation with conditional adversarial networks. In arXiv:1611.07004.Google Scholar
- Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue, and Shih-Fu Chang. 2018. Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Transactions on Multimedia 20, 11 (2018), 3137--3147.Google Scholar
Digital Library
- Li Shen Jie Hu and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 7132--7141.Google Scholar
- Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of GANs for improved quality, stability, and variation. In arXiv:1710.10196v2.Google Scholar
- Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. 2015. Bayesian SegNet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In arXiv:1511.02680.Google Scholar
- Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Vol. 70. 1857--1865. Google Scholar
Digital Library
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In arXiv:1412.6980.Google Scholar
- Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision 50, 2 (2002), 171--184. Google Scholar
Digital Library
- Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, and Yurong Chen. 2017. RON: Reverse connection with objectness prior networks for object detection. In arXiv:1707.01691.Google Scholar
- Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 706--715.Google Scholar
Cross Ref
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25. 1097--1105. Google Scholar
Digital Library
- Girish Kulkarni, Visruth Premraj, Vicente Ordonez, et al. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891--2903. Google Scholar
Digital Library
- Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. 2017. Deep Laplacian pyramid networks for fast and accurate super-resolution. In arXiv:1704.03915.Google Scholar
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 4 (1989), 541--551. Google Scholar
Digital Library
- Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. 2018. Recurrent tubelet proposal and recognition networks for action detection. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 306--322.Google Scholar
Cross Ref
- Dong Li, Ting Yao, Lingyu Duan, Tao Mei, and Yong Rui. 2018. Unified spatio-temporal attention networks for action recognition in videos. IEEE Transactions on Multimedia. (In Press).Google Scholar
- Qing Li, Zhaofan Qiu, Ting Yao, Tao Mei, Yong Rui, and Jiebo Luo. 2016. Action recognition by learning deep multi-granular spatio-temporal video representation. In Proceedings of the ACM on International Conference on Multimedia Retrieval. ACM, 159--166. Google Scholar
Digital Library
- Qing Li, Zhaofan Qiu, Ting Yao, Tao Mei, Yong Rui, and Jiebo Luo. 2017. Learning hierarchical video representation for action recognition. International Journal of Multimedia Information Retrieval 6, 1 (2017), 85--98.Google Scholar
Cross Ref
- Yehao Li, Ting Yao, Tao Mei, Hongyang Chao, and Yong Rui. 2016. Share-and-chat: Achieving human-level video commenting by search and multi-view embedding. In ACM Multimedia. ACM, 928--937. Google Scholar
Digital Library
- Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 7492--7500.Google Scholar
Cross Ref
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL Workshop on Text Summarization Branches Out. 10 pages.Google Scholar
- Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D. Reid. 2016. RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In arXiv:1611.06612.Google Scholar
- Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. In arXiv:1312.4400.Google Scholar
- Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 936--944.Google Scholar
- Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 2999--3007.Google Scholar
Cross Ref
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 740--755.Google Scholar
- Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Optimization of image description metrics using policy gradient methods. In arXiv:1612.00370.Google Scholar
- Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott E. Reed. 2015. SSD: Single shot MultiBox detector. In arXiv:1512.02325.Google Scholar
- Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2014. Fully convolutional networks for semantic segmentation. In arXiv:1411.4038.Google Scholar
- Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3242--3250.Google Scholar
Cross Ref
- Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. 2018. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 122--138.Google Scholar
Cross Ref
- Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Vol. 28.Google Scholar
- Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. In arXiv:arXiv:1410.1090.Google Scholar
- Xiao-Jiao Mao, Chunhua Shen, and Yu-Bin Yang. 2016. Image restoration using very deep fully convolutional encoder-decoder networks with symmetric skip connections. In Advances in Neural Information Processing Systems. Curran Associates, 2810--2818. Google Scholar
Digital Library
- M. Mirza and S. Osindero. 2014. Conditional generative adversarial nets. In arXiv:1411.1784.Google Scholar
- Margaret Mitchell, Xufeng Han, Amit Goyal, et al. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics. The Association for Computer Linguistics, 747--756. Google Scholar
Digital Library
- Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning. Omnipress, 807--814. Google Scholar
Digital Library
- Augustus Odena, Christopher Olah, and Jonathon Shlens. 2016. Conditional image synthesis with auxiliary classifier GANs. In arXiv:1610.09585.Google Scholar
- Yingwei Pan, Yehao Li, Ting Yao, Tao Mei, Houqiang Li, and Yong Rui. 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In Proceedings of the International Joint Conference on Artificial Intelligence. AAAI Press, 3832--3838. Google Scholar
Digital Library
- Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4594--4602.Google Scholar
Cross Ref
- Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. 2017. Seeing bot. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1341--1344. Google Scholar
Digital Library
- Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. 2017. To create what you tell: Generating videos from captions. In ACM Multimedia. ACM, 1789--1798. Google Scholar
Digital Library
- Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In arXiv:1611.07675.Google Scholar
- Yingwei Pan, Ting Yao, Tao Mei, Houqiang Li, Chong-Wah Ngo, and Yong Rui. 2014. Click-through-based cross-view learning for image search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 717--726. Google Scholar
Digital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting on Association for Computational Linguistics. 311--318 Google Scholar
Digital Library
- Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. 2016. ENet: A deep neural network architecture for real-time semantic segmentation. In arXiv:1606.02147.Google Scholar
- Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. 2016. Semantic segmentation using adversarial networks. In arXiv:1611.08408.Google Scholar
- Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. 2017. Large kernel matters - improve semantic segmentation by global convolutional network. In arXiv:1703.02719.Google Scholar
- Pedro H. O. Pinheiro and Ronan Collobert. 2015. From image-level to pixel-level labeling with Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1713--1721.Google Scholar
- B. T. Polyak and A. B. Juditsky. 1992. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 4 (July 1992), 838--855. Google Scholar
Digital Library
- Zhaofan Qiu, Qing Li, Ting Yao, Tao Mei, and Yong Rui. 2015. MSR Asia MSM at THUMOS Challenge 2015. In THUMOS Challenge Workshop of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society.Google Scholar
- Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Deep quantization: Encoding convolutional activations with deep generative model. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4085--4094.Google Scholar
Cross Ref
- Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4489--4497.Google Scholar
Cross Ref
- Zhaofan Qiu, Ting Yao, and Tao Mei. 2018. Learning deep spatio-temporal dependence for semantic video segmentation. IEEE Transactions on Multimedia 20, 4 (2018), 939--949. Google Scholar
Digital Library
- Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. In arXiv:1511.06434.Google Scholar
- Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. You only look once: Unified, real-time object detection. In arXiv:1506.02640.Google Scholar
- Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better, faster, stronger. In arXiv:1612.08242.Google Scholar
- Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text-to-image synthesis. In Proceedings of the International Conference on International Conference on Machine Learning, Vol. 48. JMLR, 1060--1069. Google Scholar
Digital Library
- Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (2017), 1137--1149. Google Scholar
Digital Library
- Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1179--1195.Google Scholar
Cross Ref
- Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 433--440. Google Scholar
Digital Library
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In arXiv:1505.04597.Google Scholar
- David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Learning Representations by Back-propagating Errors. In Neurocomputing: Foundations of Research. MIT Press, 696--699. Google Scholar
Digital Library
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252. Google Scholar
Digital Library
- David Saad. 1998. On-line Learning in Neural Networks. Cambridge University Press, New York, NY. Google Scholar
Digital Library
- Dominik Scherer, Andreas Müller, and Sven Behnke. 2010. Evaluation of pooling operations in convolutional architectures for object recognition. In Proceedings of the 20th International Conference on Artificial Neural Networks: Part III. 92--101. Google Scholar
Digital Library
- Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue. 2017. DSOD: Learning deeply supervised object detectors from scratch. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 1937--1945.Google Scholar
Cross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the International Conference on Neural Information Processing Systems, Vol 1. MIT Press, Cambridge, 568--576. Google Scholar
Digital Library
- K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. In arXiv:1409.1556.Google Scholar
- Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. 2016. Deep video deblurring. In arXiv:1611.08387.Google Scholar
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. Google Scholar
Digital Library
- Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. 2016. Inception-v4, inception-ResNet and the impact of residual connections on learning. In arXiv:1602.07261.Google Scholar
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going deeper with convolutions. In arXiv:1409.4842.Google Scholar
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the inception architecture for computer vision. In arXiv:1512.00567.Google Scholar
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4489--4497. Google Scholar
Digital Library
- Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. 2016. Instance normalization: The missing ingredient for fast stylization. In arXiv:1607.08022.Google Scholar
Digital Library
- Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4566--4575.Google Scholar
Cross Ref
- S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. 2015. Sequence to sequence - video to text. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4534--4542. Google Scholar
Digital Library
- Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 1494--1504.Google Scholar
Cross Ref
- Dumoulin Vincent and Visin Francesco. 2016. A guide to convolution arithmetic for deep learning. In arXiv:1603.07285.Google Scholar
- Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3156--3164.Google Scholar
Cross Ref
- Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems?. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 203--212.Google Scholar
- Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016. Multi-stream multi-class fusion of deep networks for video classification. ACM Multimedia (2016). ACM, 791--800. Google Scholar
Digital Library
- Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. Proceedings of ACM Multimedia (2015). ACM, 461--470. Google Scholar
Digital Library
- Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated residual transformations for deep neural networks. In arXiv:1611.05431.Google Scholar
- Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 1063--6919.Google Scholar
Cross Ref
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. PMLR, 2048--2057. Google Scholar
Digital Library
- Ning Xu, Brian Price, Scott Cohen, and Thomas Huang. 2017. Deep image matting. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 311--320.Google Scholar
Cross Ref
- Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 444--454. Google Scholar
Digital Library
- Zhilin Yang, Ye Yuan, Yuexin Wu, William W. Cohen, and Ruslan R. Salakhutdinov. 2016. Review networks for caption generation. In Proceedings of the Advances in Neural Information Processing Systems. 2361--2369. Google Scholar
Digital Library
- L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4507--4515. Google Scholar
Digital Library
- Ting Yao, Yehao Li, Zhaofan Qiu, Fuchen Long, Yingwei Pan, Dong Li, and Tao Mei. 2017. MSR Asia MSM at ActivityNet challenge 2017: Trimmed action recognition, temporal action proposals and dense-captioning events in videos. In CVPR ActivityNet Challenge Workshop.Google Scholar
- Ting Yao, Tao Mei, and Chong-Wah Ngo. 2015. Learning query and image similarities with ranking canonical correlation analysis. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 28--36. Google Scholar
Digital Library
- Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017. Incorporating copying mechanism in image captioning for learning novel objects. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 5263--5271.Google Scholar
Cross Ref
- Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision. Springer, 711--727.Google Scholar
Cross Ref
- Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 4904--4912.Google Scholar
Cross Ref
- Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. DualGAN: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 2868--2876.Google Scholar
Cross Ref
- Yibo Yang, Zhisheng Zhong, Tiancheng Shen, and Zhouchen Lin. 2018. Convolutional neural networks with alternately updated clique. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2413--2422.Google Scholar
Cross Ref
- Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4651--4659.Google Scholar
Cross Ref
- Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018. BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision. Springer, 334--349.Google Scholar
Cross Ref
- Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. In arXiv:1511.07122.Google Scholar
- Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4584--4593.Google Scholar
Cross Ref
- H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. 2017. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In arXiv:1612.03242.Google Scholar
- Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2017. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6848--6856.Google Scholar
- Yiheng Zhang, Zhaofan Qiu, Ting Yao, Dong Liu, and Tao Mei. 2018. Fully convolutional adaptation networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6810--6818.Google Scholar
Cross Ref
- Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2016. Pyramid scene parsing network. In arXiv:1612.01105.Google Scholar
- Nuno Vasconcelos Zhaowei Cai. 2018. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 6154--6162.Google Scholar
- Luowei Zhou, Chenliang Xu, Parker Koch, and Jason J Corso. 2016. Image caption generation with text-conditional semantic attention. In arXiv:1606.04621.Google Scholar
- Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. IEEE Computer Society, 2242--2251Google Scholar
Cross Ref
- Wentao Zhu, Xiang Xiang, Trac D. Tran, and Xiaohui Xie. 2016. Adversarial deep structural networks for mammographic mass segmentation. In arXiv:1612.05970.Google Scholar
Index Terms
Deep Learning–Based Multimedia Analytics: A Review
Recommendations
Deep Learning for Multimedia: Science or Technology?
MM '18: Proceedings of the 26th ACM international conference on MultimediaDeep learning has been successfully explored in addressing different multimedia topics recent years, ranging from object detection, semantic classification, entity annotation, to multimedia captioning, multimedia question answering and storytelling. ...
Multimedia Analysis + Visual Analytics = Multimedia Analytics
To deal with the extent and variety of digital media, researchers are combining multimedia analysis and visual analytics to form the new field of multimedia analytics. This article gives some historical background, discusses surveys of related research, ...
Multimedia Big Data Analytics: A Survey
With the proliferation of online services and mobile technologies, the world has stepped into a multimedia big data era. A vast amount of research work has been done in the multimedia area, targeting different aspects of big data analytics, such as the ...






Comments