skip to main content
announcement

Correspondence Autoencoders for Cross-Modal Retrieval

Authors Info & Claims
Published:21 October 2015Publication History
Skip Abstract Section

Abstract

This article considers the problem of cross-modal retrieval, such as using a text query to search for images and vice-versa. Based on different autoencoders, several novel models are proposed here for solving this problem. These models are constructed by correlating hidden representations of a pair of autoencoders. A novel optimal objective, which minimizes a linear combination of the representation learning errors for each modality and the correlation learning error between hidden representations of two modalities, is used to train the model as a whole. Minimizing the correlation learning error forces the model to learn hidden representations with only common information in different modalities, while minimizing the representation learning error makes hidden representations good enough to reconstruct inputs of each modality. To balance the two kind of errors induced by representation learning and correlation learning, we set a specific parameter in our models. Furthermore, according to the modalities the models attempt to reconstruct they are divided into two groups. One group including three models is named multimodal reconstruction correspondence autoencoder since it reconstructs both modalities. The other group including two models is named unimodal reconstruction correspondence autoencoder since it reconstructs a single modality. The proposed models are evaluated on three publicly available datasets. And our experiments demonstrate that our proposed correspondence autoencoders perform significantly better than three canonical correlation analysis based models and two popular multimodal deep models on cross-modal retrieval tasks.

References

  1. Muhammet Bastan, Hayati Cam, Ugur Gdkbay, and Özgür Ulusoy. 2010. Bilvideo-7: An MPEG-7-compatible video indexing and retrieval system. IEEE MultiMedia 17, 3, 62--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Yoshua Bengio. 2009. Learning deep architectures for AI. Found. Trends Machine Learn. 2, 1, 1--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Steven Bird. 2006. NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive Presentation Sessions (COLING-ACL'06). 69--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. David M. Blei and Michael I. Jordan. 2003. Modeling Annotated Data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR'03). 127--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Machine Learn. Res. 3, 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Anna Bosch, Andrew Zisserman, and Xavier Muñoz. 2007. Image Classification using Random Forests and Ferns. In Proceedings of the International Conference on Computer Vision (ICCV'07). 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  7. Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of ACM Conference on Image and Video Retrieval (CIVR'09). 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision (ECCV'10). 15--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Fangxiang Feng, XiaojieWang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the International Conference on Multimedia (MM'14). 7--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Andrea Frome, Greg Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. In Neural Information Processing Systems (NIPS'13), 2121--2129.Google ScholarGoogle Scholar
  11. David R. Hardoon, Sndor Szedmk, and John Shawe-Taylor. 2004. Canonical correlation analysis; An overview with application to learning methods. Neural Comput. 16, 2639--2664. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Hinton and R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786, 504--507.Google ScholarGoogle Scholar
  13. G. E. Hinton. 2002. Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 8, 1771--1800. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. E. Hinton, S. Osindero, and Y. Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7, 1527--1554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. 2011. Learning cross-modality similarity for multinomial data. In Proceedings of the International Conference on Computer Vision (ICCV'11). 2407--2414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jungi Kim, Jinseok Nam, and Iryna Gurevych. 2012. Learning semantics with deep belief network for cross-language information retrieval. In Proceedings of the 25th International Conference on Computational Linguistics (COLING'12). 579--588.Google ScholarGoogle Scholar
  17. B. S. Manjunath, J. R. Ohm, V. V. Vinod, and A. Yamada. 2001. Color and texture descriptors. IEEE Trans. Circuits Syst. Video Technol. 11, 6, 703--715. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML'11). 689--696.Google ScholarGoogle Scholar
  19. A. Oliva and A. Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision 42, 3, 145--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the International Conference on Multimedia (MM'10). 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Salakhutdinov and G. Hinton. 2009. Replicated Softmax: an Undirected Topic Model. In Neural Information Processing Systems (NIPS'09), 1607--1614.Google ScholarGoogle Scholar
  22. Ruslan R. Salakhutdinov and Geoffrey G. Hinton. 2012. An efficient learning procedure for deep Boltzmann machines. Neural Comput. 24, 8, 1967--2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Carina Silberer and Mirella Lapata. 2014. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 721--732.Google ScholarGoogle ScholarCross RefCross Ref
  24. P. Smolensky. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1. MIT Press, Cambridge, MA, Chapter Information processing in dynamical systems: foundations of harmony theory, 194--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In Neural Information Processing Systems (NIPS'13), 935--943.Google ScholarGoogle Scholar
  26. N. Srivastava and R. Salakhutdinov. 2012a. Learning representations for multimodal data with deep belief nets. In Proceedings of the International Conference on Machine Learning Representation Learning Workshop.Google ScholarGoogle Scholar
  27. N. Srivastava and R. Salakhutdinov. 2012b. Multimodal learning with deep Boltzmann machines. In Neural Information Processing Systems (NIPS'12), 2231--2239.Google ScholarGoogle Scholar
  28. L. J. P. van der Maaten and G. E. Hinton. 2008. Visualizing High-Dimensional Data Using t-SNE. J. Machine Learn. Res. 9, 2579--2605.Google ScholarGoogle Scholar
  29. Andrea Vedaldi and Brian Fulkerson. 2010. Vlfeat: An open and portable library of computer vision algorithms. In Proceedings of the International Conference on Multimedia (MM'10). 1469--1472. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Welling, M. Rosen-Zvi, and G. Hinton. 2004. Exponential family harmoniums with an application to information retrieval. In Neural Information Processing Systems (NIPS'04), 501--508.Google ScholarGoogle Scholar
  31. Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: Learning to rank with joint word-image embeddings. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD'10). 21--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yueting Zhuang, Yan Fei Wang, Fei Wu, Yin Zhang, and Weiming Lu. 2013. Supervised coupled dictionary learning with group structures for multi-modal retrieval. In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI'13). 1070--1076.Google ScholarGoogle Scholar

Index Terms

  1. Correspondence Autoencoders for Cross-Modal Retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 12, Issue 1s
        Special Issue on Smartphone-Based Interactive Technologies, Systems, and Applications and Special Issue on Extended Best Papers from ACM Multimedia 2014
        October 2015
        317 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/2837676
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 October 2015
        • Accepted: 1 July 2015
        • Revised: 1 May 2015
        • Received: 1 January 2015
        Published in tomm Volume 12, Issue 1s

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • announcement
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!