Abstract
This article considers the problem of cross-modal retrieval, such as using a text query to search for images and vice-versa. Based on different autoencoders, several novel models are proposed here for solving this problem. These models are constructed by correlating hidden representations of a pair of autoencoders. A novel optimal objective, which minimizes a linear combination of the representation learning errors for each modality and the correlation learning error between hidden representations of two modalities, is used to train the model as a whole. Minimizing the correlation learning error forces the model to learn hidden representations with only common information in different modalities, while minimizing the representation learning error makes hidden representations good enough to reconstruct inputs of each modality. To balance the two kind of errors induced by representation learning and correlation learning, we set a specific parameter in our models. Furthermore, according to the modalities the models attempt to reconstruct they are divided into two groups. One group including three models is named multimodal reconstruction correspondence autoencoder since it reconstructs both modalities. The other group including two models is named unimodal reconstruction correspondence autoencoder since it reconstructs a single modality. The proposed models are evaluated on three publicly available datasets. And our experiments demonstrate that our proposed correspondence autoencoders perform significantly better than three canonical correlation analysis based models and two popular multimodal deep models on cross-modal retrieval tasks.
- Muhammet Bastan, Hayati Cam, Ugur Gdkbay, and Özgür Ulusoy. 2010. Bilvideo-7: An MPEG-7-compatible video indexing and retrieval system. IEEE MultiMedia 17, 3, 62--73. Google Scholar
Digital Library
- Yoshua Bengio. 2009. Learning deep architectures for AI. Found. Trends Machine Learn. 2, 1, 1--127. Google Scholar
Digital Library
- Steven Bird. 2006. NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive Presentation Sessions (COLING-ACL'06). 69--72. Google Scholar
Digital Library
- David M. Blei and Michael I. Jordan. 2003. Modeling Annotated Data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (SIGIR'03). 127--134. Google Scholar
Digital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Machine Learn. Res. 3, 993--1022. Google Scholar
Digital Library
- Anna Bosch, Andrew Zisserman, and Xavier Muñoz. 2007. Image Classification using Random Forests and Ferns. In Proceedings of the International Conference on Computer Vision (ICCV'07). 1--8.Google Scholar
Cross Ref
- Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of ACM Conference on Image and Video Retrieval (CIVR'09). 1--9. Google Scholar
Digital Library
- Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the European Conference on Computer Vision (ECCV'10). 15--29. Google Scholar
Digital Library
- Fangxiang Feng, XiaojieWang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the International Conference on Multimedia (MM'14). 7--16. Google Scholar
Digital Library
- Andrea Frome, Greg Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. In Neural Information Processing Systems (NIPS'13), 2121--2129.Google Scholar
- David R. Hardoon, Sndor Szedmk, and John Shawe-Taylor. 2004. Canonical correlation analysis; An overview with application to learning methods. Neural Comput. 16, 2639--2664. Google Scholar
Digital Library
- G. Hinton and R. Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786, 504--507.Google Scholar
- G. E. Hinton. 2002. Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 8, 1771--1800. Google Scholar
Digital Library
- G. E. Hinton, S. Osindero, and Y. Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7, 1527--1554. Google Scholar
Digital Library
- Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. 2011. Learning cross-modality similarity for multinomial data. In Proceedings of the International Conference on Computer Vision (ICCV'11). 2407--2414. Google Scholar
Digital Library
- Jungi Kim, Jinseok Nam, and Iryna Gurevych. 2012. Learning semantics with deep belief network for cross-language information retrieval. In Proceedings of the 25th International Conference on Computational Linguistics (COLING'12). 579--588.Google Scholar
- B. S. Manjunath, J. R. Ohm, V. V. Vinod, and A. Yamada. 2001. Color and texture descriptors. IEEE Trans. Circuits Syst. Video Technol. 11, 6, 703--715. Google Scholar
Digital Library
- J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML'11). 689--696.Google Scholar
- A. Oliva and A. Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vision 42, 3, 145--175. Google Scholar
Digital Library
- Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the International Conference on Multimedia (MM'10). 251--260. Google Scholar
Digital Library
- R. Salakhutdinov and G. Hinton. 2009. Replicated Softmax: an Undirected Topic Model. In Neural Information Processing Systems (NIPS'09), 1607--1614.Google Scholar
- Ruslan R. Salakhutdinov and Geoffrey G. Hinton. 2012. An efficient learning procedure for deep Boltzmann machines. Neural Comput. 24, 8, 1967--2006. Google Scholar
Digital Library
- Carina Silberer and Mirella Lapata. 2014. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 721--732.Google Scholar
Cross Ref
- P. Smolensky. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1. MIT Press, Cambridge, MA, Chapter Information processing in dynamical systems: foundations of harmony theory, 194--281. Google Scholar
Digital Library
- Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In Neural Information Processing Systems (NIPS'13), 935--943.Google Scholar
- N. Srivastava and R. Salakhutdinov. 2012a. Learning representations for multimodal data with deep belief nets. In Proceedings of the International Conference on Machine Learning Representation Learning Workshop.Google Scholar
- N. Srivastava and R. Salakhutdinov. 2012b. Multimodal learning with deep Boltzmann machines. In Neural Information Processing Systems (NIPS'12), 2231--2239.Google Scholar
- L. J. P. van der Maaten and G. E. Hinton. 2008. Visualizing High-Dimensional Data Using t-SNE. J. Machine Learn. Res. 9, 2579--2605.Google Scholar
- Andrea Vedaldi and Brian Fulkerson. 2010. Vlfeat: An open and portable library of computer vision algorithms. In Proceedings of the International Conference on Multimedia (MM'10). 1469--1472. Google Scholar
Digital Library
- M. Welling, M. Rosen-Zvi, and G. Hinton. 2004. Exponential family harmoniums with an application to information retrieval. In Neural Information Processing Systems (NIPS'04), 501--508.Google Scholar
- Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: Learning to rank with joint word-image embeddings. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD'10). 21--35.Google Scholar
Digital Library
- Yueting Zhuang, Yan Fei Wang, Fei Wu, Yin Zhang, and Weiming Lu. 2013. Supervised coupled dictionary learning with group structures for multi-modal retrieval. In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI'13). 1070--1076.Google Scholar
Index Terms
Correspondence Autoencoders for Cross-Modal Retrieval
Recommendations
Cross-modal Retrieval with Correspondence Autoencoder
MM '14: Proceedings of the 22nd ACM international conference on MultimediaThe problem of cross-modal retrieval, e.g., using a text query to search for images and vice-versa, is considered in this paper. A novel model involving correspondence autoencoder (Corr-AE) is proposed here for solving this problem. The model is ...
A new approach to cross-modal multimedia retrieval
MM '10: Proceedings of the 18th ACM international conference on MultimediaThe problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of ...
Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia RetrievalCommon approaches to problems involving multiple modalities (classification, retrieval, hyperlinking, etc.) are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...






Comments