Abstract
Large-scale image datasets and deep convolutional neural networks (DCNNs) are the two primary driving forces for the rapid progress in generic object recognition tasks in recent years. While lots of network architectures have been continuously designed to pursue lower error rates, few efforts are devoted to enlarging existing datasets due to high labeling costs and unfair comparison issues. In this article, we aim to achieve lower error rates by augmenting existing datasets in an automatic manner. Our method leverages both the web and DCNN, where the web provides massive images with rich contextual information, and DCNN replaces humans to automatically label images under the guidance of web contextual information. Experiments show that our method can automatically scale up existing datasets significantly from billions of web pages with high accuracy. The performance on object recognition tasks and transfer learning tasks have been significantly improved by using the automatically augmented datasets, which demonstrates that more supervisory information has been automatically gathered from the web. Both the dataset and models trained on the dataset have been made publicly available.
- Martin Arjovsky, Soumith Chintala, and Lãĺon Bottou. 2017. Wasserstein GAN. arXiv:1701.07875 (2017).Google Scholar
- Yalong Bai, Kuiyuan Yang, Wei Yu, Chang Xu, Wei-Ying Ma, and Tiejun Zhao. 2015. Automatic image dataset construction from click-through logs using deep neural network. In Proceedings of the 23rd ACM International Conference on Multimedia. 441--450. Google Scholar
Digital Library
- Brendan Collins, Jia Deng, Kai Li, and Li Fei-Fei. 2008. Towards scalable dataset construction: An active learning approach. In Proceedings of the European Conference on Computer Vision. Springer, 86--98. Google Scholar
Digital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR’09). IEEE, 248--255.Google Scholar
Cross Ref
- M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303--338. Google Scholar
Digital Library
- R. Ewerth, K. Ballafkir, M. Muhling, D. Seiler, and B. Freisleben. 2012. Long-term incremental web-supervised learning of visual concepts via random savannas. IEEE Transactions on Multimedia 14, 4 (2012), 1008--1020. Google Scholar
Digital Library
- Li Fei-Fei, Rob Fergus, and Pietro Perona. 2007. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Image Understanding 106, 1 (2007), 59--70. Google Scholar
Digital Library
- Gregory Griffin, Alex Holub, and Pietro Perona. 2007. Caltech-256 object category dataset. California Institute of Technology.Google Scholar
- David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664. Google Scholar
Digital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. arXiv:1512.03385 (2015).Google Scholar
- Xiaofei He, Deng Cai, Ji-Rong Wen, Wei-Ying Ma, and Hong-Jiang Zhang. 2007. Clustering and searching WWW images using link and page layout analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 3, 2 (May 2007), Article 10. Google Scholar
Digital Library
- Xian-Sheng Hua, Linjun Yang, Jingdong Wang, Jing Wang, Ming Ye, Kuansan Wang, Yong Rui, and Jin Li. 2013. Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, 243--252. Google Scholar
Digital Library
- Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv:1607.01759.Google Scholar
- Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. 2011. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR Workshop on Fine-Grained Visual Categorization (FGVC), Vol. 2. 1 page.Google Scholar
- Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, and Li Fei-Fei. 2015. The unreasonable effectiveness of noisy data for fine-grained recognition. arXiv:1511.06789.Google Scholar
- Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, and others. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332 (2016).Google Scholar
- Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Vol. 1, no. 4.Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google Scholar
Digital Library
- Wen Li, Li Niu, and Dong Xu. 2014. Exploiting privileged information from web data for image categorization. In Proceedings of the European Conference on Computer Vision. Springer, 437--452.Google Scholar
Cross Ref
- Wen Li, Limin Wang, Eirikur Agustsson, and Luc Van Gool. 2017. WebVision: Visual Understanding by Learning from Web Data. Retrieved August 6, 2017 from http://www.vision.ee.ethz.ch/webvision.Google Scholar
- Z. Li and J. Tang. 2015. Weakly supervised deep metric learning for community-contributed image retrieval. IEEE Transactions on Multimedia 17, 11 (2015), 1989--1999.Google Scholar
Digital Library
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft Coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.Google Scholar
- Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. 2013. Fine-grained visual classification of aircraft. arXiv:1306.5151 (2013).Google Scholar
- Nizar Massouh, Francesca Babiloni, Tatiana Tommasi, Jay Young, Nick Hawes, and Barbara Caputo. 2017. Learning deep visual object models from noisy web data: How to make it work. arXiv:1702.08513 (2017).Google Scholar
- George A. Miller. 1995. WordNet: A lexical database for English. Communications of the ACM (1995). Google Scholar
Digital Library
- Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193 (2012), 217--250. Google Scholar
Digital Library
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2014. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575Google Scholar
- Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 806--813. Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.Google Scholar
- Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. 2014. Training convolutional networks with noisy labels. arXiv:1406.2080.Google Scholar
- Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). IEEE, 843--852.Google Scholar
Cross Ref
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.Google Scholar
Cross Ref
- Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Communications of the ACM 59, 2 (2016), 64--73. Google Scholar
Digital Library
- Antonio Torralba and Alexei A Efros. 2011. Unbiased look at dataset bias. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE, 1521--1528. Google Scholar
Digital Library
- Antonio Torralba, Rob Fergus, and William T. Freeman. 2008. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 11 (2008), 1958--1970. Google Scholar
Digital Library
- Phong D. Vo, Alexandru Ginsca, Hervé Le Borgne, and Adrian Popescu. 2015. On deep representation learning from noisy web images. arXiv:1512.04785.Google Scholar
- Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. 2011. The Caltech-UCSD birds-200-2011 dataset. California Institute of Technology.Google Scholar
- Shuang Wang and Shuqiang Jiang. 2015. INSTRE: A new benchmark for instance-level object retrieval and recognition. ACM Transaactions of Multimedia Computing, Communications, and Applications 11, (Feb. 2015) 3, Article 37, 21 pages. Google Scholar
Digital Library
- F. Wu, Z. Wang, Z. Zhang, Y. Yang, J. Luo, W. Zhu, and Y. Zhuang. 2015. Weakly semi-supervised deep learning for multi-label image annotation. IEEE Transactions on Big Data 1, 3 (2015), 109--122.Google Scholar
Cross Ref
- Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. 2015. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2691--2699.Google Scholar
- Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5288--5296.Google Scholar
Cross Ref
- Y. Yao, J. Zhang, F. Shen, X. Hua, J. Xu, and Z. Tang. 2017. Exploiting web images for dataset construction: A domain robust approach. IEEE Transactions on Multimedia 19, 8 (2017), 1771--1784.Google Scholar
Digital Library
- Yazhou Yao, Fumin Shen, Jian Zhang, Li Liu, Zhenmin Tang, and Ling Shao. 2018. Discovering and distinguishing multiple visual senses for web learning. IEEE Transactions on Multimedia.Google Scholar
- W. Yu, K. Yang, Y. Bai, H. Yao, and Y. Rui. 2015. Learning cross space mapping via DNN using large scale click-through logs. IEEE Transactions on Multimedia 17, 11 (2015), 2000--2007.Google Scholar
Cross Ref
- Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European Conference on Computer Vision. Springer, 818--833.Google Scholar
- Lei Zhang and Yong Rui. 2013. Image search-from thousands to billions in 20 years. ACM Transactions on Multimedia Comput. Communications, and Applications 9, 1s (Oct. 2013), Article 36, 20 pages. Google Scholar
Digital Library
- Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems. 487--495. Google Scholar
Digital Library
Index Terms
Automatic Data Augmentation from Massive Web Images for Deep Visual Recognition
Recommendations
Edge-preserving image denoising using a deep convolutional neural network
Highlights- This paper makes use of a deep CNN for image denoising.
- The network is trained ...
AbstractThis paper introduces a novel denoising approach making use of a deep convolutional neural network to preserve image edges. The network is trained by using the edge map obtained from the well-known Canny algorithm and aims at ...
Enhancing Face Recognition from Massive Weakly Labeled Data of New Domains
Training data are critical in face recognition systems. Labeling a large scale dataset for a particular domain needs lots of manpower. Without dataset related to current face recognition domain, we can't get a strong face recognition model with existing ...
Food image recognition with deep convolutional features
UbiComp '14 Adjunct: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct PublicationIn this paper, we report the feature obtained from the Deep Convolutional Neural Network boosts food recognition accuracy greatly by integrating it with conventional hand-crafted image features, Fisher Vectors with HoG and Color patches. In the ...






Comments