skip to main content
10.1145/2964284.2964316acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Transform-Invariant Convolutional Neural Networks for Image Classification and Search

Published: 01 October 2016 Publication History

Abstract

Convolutional neural networks (CNNs) have achieved state-of-the-art results on many visual recognition tasks. However, current CNN models still exhibit a poor ability to be invariant to spatial transformations of images. Intuitively, with sufficient layers and parameters, hierarchical combinations of convolution (matrix multiplication and non-linear activation) and pooling operations should be able to learn a robust mapping from transformed input images to transform-invariant representations. In this paper, we propose randomly transforming (rotation, scale, and translation) feature maps of CNNs during the training stage. This prevents complex dependencies of specific rotation, scale, and translation levels of training images in CNN models. Rather, each convolutional kernel learns to detect a feature that is generally helpful for producing the transform-invariant answer given the combinatorially large variety of transform levels of its input feature maps. In this way, we do not require any extra training supervision or modification to the optimization process and training images. We show that random transformation provides significant improvements of CNNs on many benchmark tasks, including small-scale image recognition, large-scale image recognition, and image retrieval.

References

[1]
J. M. Alvarez, Y. LeCun, T. Gevers, and A. M. Lopez. Semantic road segmentation via multi-scale ensembles of learned features. ECCV Workshop on Computer Vision in Vehicle Technology: From Earth to Mars, 2012.
[2]
J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell., 35:1872--1886, 2013.
[3]
T. S. Cohen and M. Welling. Transformation properties of learned visual representations. ICLR, 2015.
[4]
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. CVPR, 2015.
[5]
H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, L. Zitnick, and G. Zweig. From captions to visual concepts and back. CVPR, 2015.
[6]
C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell, 2013.
[7]
C. Gan, N. Wang, Y. Yang, D.-Y. Yeung, and A. G. Hauptmann. Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, pages 2568--2577, 2015.
[8]
C. Gan, T. Yang, and B. Gong. Learning attributes equals multi-source domain generalization. CVPR, 2016.
[9]
C. Gan, T. Yao, K. Yang, Y. Yang, and T. Mei. You lead, we exceed: Labor-free video concept learning by jointly exploiting web videos and images. CVPR, 2016.
[10]
R. Gens and P. M. Domingos. Deep symmetry networks. NIPS, 2014.
[11]
R. G. Georgia Gkioxari and J. Malik. Contextual action recognition with r*cnn. ICCV, 2015.
[12]
I. J. Goodfellow, Q. V. Le, A. M. Saxe, H. Lee, and A. Y. Ng. Measuring invariances in deep networks. NIPS, 2009.
[13]
M. J. Huiskes and M. S. Lew. The mir flickr retrieval evaluation. In MIR. ACM, 2008.
[14]
E. S. J. Long and T. Darrell. Fully convolutional networks for semantic segmentation. CVPR, 2015.
[15]
M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. NIPS, 2015.
[16]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[17]
S. R. Kaiming He, Xiangyu Zhang and J. Sun. Deep residual learning for image recognition. CVPR, 2016.
[18]
A. Kanazawa and A. Sharma. Locally scale-invariant convolutional neural networks. NIPS, 2014.
[19]
A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image description. CVPR, 2015.
[20]
J. J. Kivinen and C. K. I. Williams. Transformation equivariant boltzmann machines. In Artificial Neural Networks and Machine Learning, pages 1--9, 2011.
[21]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS, pages 1097--1105, 2012.
[22]
D. Laptev, N. Savinov, J. M. Buhmann, and M. Pollefeys. Ti-pooling: transformation-invariant pooling for feature learning in convolutional neural networks. In CVPR, 2016.
[23]
Q. V. Le, J. Ngiam, Z. Chen, D. J. hao Chia, P. W. Koh, and A. Y. Ng. Tiled convolutional neural networks. NIPS, 2010.
[24]
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278--2324, 1998.
[25]
K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equivalence. CVPR, 2015.
[26]
D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, pages 2161--2168, 2006.
[27]
S. Nitish, H. Geoffrey, K. Alex, S. Ilya, and S. Ruslan. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, pages 1929--1958, 2014.
[28]
Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Joint modeling embedding and translation to bridge video and language. CVPR, 2016.
[29]
S. Richard. Computer Vision: Algorithms and Applications. 2010.
[30]
P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. IJCNN, 2011.
[31]
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. NIPS, 2014.
[32]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[33]
K. Sohn and H. Lee. Learning invariant representations with local transformations. In ICML, 2012.
[34]
K. Sohn and H. Lee. Understanding image representations by measuring their equivariance and equivalence. CVPR, 2015.
[35]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolution. CVPR, 2015.
[36]
Z. Wang, J. Yang, H. Jin, E. Shechtman, A. Agarwala, J. Brandt, and T. S. Huang. Deepfont: Identify your font from an image. In ACM MM, 2015, pages 451--459, 2015.
[37]
L. Xie, L. Zheng, J. Wang, A. Yuille, and Q. Tian. Interactive: Inter-layer activeness propagation. In CVPR, 2016.
[38]
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. ICCV, 2015.
[39]
L. Zheng, S. Wang, Z. Liu, and Q. Tian. Packing and padding: Coupled multi-index for accurate image retrieval. In CVPR, pages 1939--1946, 2014.
[40]
L. Zheng, S. Wang, L. Tian, F. He, Z. Liu, and Q. Tian. Query-adaptive late fusion for image search and person re-identification. In CVPR, pages 1741--1750, 2015.
[41]
L. Zheng, S. Wang, J. Wang, and Q. Tian. Accurate image search with multi-scale contextual evidences. International Journal of Computer Vision, pages 1--13, 2016.

Cited By

View all
  • (2024)Learning Gaussian Data Augmentation in Feature Space for One-shot Object Detection in MangaProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700174(1-8)Online publication date: 3-Dec-2024
  • (2024)OneDConv: Generalized Convolution for Transform-Invariant Representation2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC54092.2024.10831051(2238-2243)Online publication date: 6-Oct-2024
  • (2023)Restore Translation Using Equivariant Neural NetworksNeural Information Processing10.1007/978-981-99-8132-8_44(583-603)Online publication date: 26-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. convolutional neural networks
  2. transform invariance

Qualifiers

  • Research-article

Funding Sources

Conference

MM '16
Sponsor:
MM '16: ACM Multimedia Conference
October 15 - 19, 2016
Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)3
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Learning Gaussian Data Augmentation in Feature Space for One-shot Object Detection in MangaProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700174(1-8)Online publication date: 3-Dec-2024
  • (2024)OneDConv: Generalized Convolution for Transform-Invariant Representation2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC54092.2024.10831051(2238-2243)Online publication date: 6-Oct-2024
  • (2023)Restore Translation Using Equivariant Neural NetworksNeural Information Processing10.1007/978-981-99-8132-8_44(583-603)Online publication date: 26-Nov-2023
  • (2022)Convolutional Neural Network and Histogram of Oriented Gradient Based Invariant Handwritten MODI Character RecognitionPattern Recognition and Image Analysis10.1134/S105466182202010932:2(402-418)Online publication date: 6-Jul-2022
  • (2022)CNN-based network has Network Anisotropy -work harder to learn rotated feature than non-rotated feature2022 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)10.1109/AIPR57179.2022.10092224(1-5)Online publication date: 11-Oct-2022
  • (2022)Data augmentation: A comprehensive survey of modern approachesArray10.1016/j.array.2022.10025816(100258)Online publication date: Dec-2022
  • (2022)Generating unrestricted adversarial examples via three parameteresMultimedia Tools and Applications10.1007/s11042-022-12007-x81:15(21919-21938)Online publication date: 17-Mar-2022
  • (2021)Deep Convolutional Neural Network for Object ClassificationHandbook of Research on Deep Learning-Based Image Analysis Under Constrained and Unconstrained Environments10.4018/978-1-7998-6690-9.ch016(317-343)Online publication date: 2021
  • (2021)Spatial Assembly Networks for Image Representation Learning2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR46437.2021.01366(13871-13880)Online publication date: Jun-2021
  • (2021)The analysis of constructing and evaluating tensor operation paralleling algorithmsJournal of Physics: Conference Series10.1088/1742-6596/1999/1/0120791999:1(012079)Online publication date: 1-Sep-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media