Abstract
We investigate the use of word embeddings for query translation to improve precision in cross-language information retrieval (CLIR). Word vectors represent words in a distributional space such that syntactically or semantically similar words are close to each other in this space. Multilingual word embeddings are constructed in such a way that similar words across languages have similar vector representations. We explore the effective use of bilingual and multilingual word embeddings learned from comparable corpora of Indic languages to the task of CLIR.
We propose a clustering method based on the multilingual word vectors to group similar words across languages. For this we construct a graph with words from multiple languages as nodes and with edges connecting words with similar vectors. We use the Louvain method for community detection to find communities in this graph. We show that choosing target language words as query translations from the clusters or communities containing the query terms helps in improving CLIR. We also find that better-quality query translations are obtained when words from more languages are used to do the clustering even when the additional languages are neither the source nor the target languages. This is probably because having more similar words across multiple languages helps define well-defined dense subclusters that help us obtain precise query translations.
In this article, we demonstrate the use of multilingual word embeddings and word clusters for CLIR involving Indic languages. We also make available a tool for obtaining related words and the visualizations of the multilingual word vectors for English, Hindi, Bengali, Marathi, Gujarati, and Tamil.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages
- Waleed Ammar, George Mulcaire, Yulia Tsvetkov, Guillaume Lample, Chris Dyer, and Noah A. Smith. 2016. Massively multilingual word embeddings. CoRR abs/1602.01925 (2016). Retrieved from http://arxiv.org/abs/1602.01925.Google Scholar
- Lisa Ballesteros. 2000. Cross language retrieval via transitive translation. In Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval. Kluwer, Boston, MA. 230--234.Google Scholar
- Lisa Ballesteros and Bruce Croft. 1996. Dictionary methods for cross-lingual information retrieval. In Proceedings of the 7th International Conference on Database and Expert Systems Applications. Springer-Verlag, London, 791--801. http://dl.acm.org/citation.cfm?id=648309.754278. Google Scholar
Digital Library
- Antonio Valerio Miceli Barone. 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. arXiv Preprint arXiv:1608.02996 (2016).Google Scholar
- Paheli Bhattacharya, Pawan Goyal, and Sudeshna Sarkar. 2016. Query translation for cross-language information retrieval using multilingual word clusters. In Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing. 152--162.Google Scholar
- Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008, 10 (2008), P10008.Google Scholar
Digital Library
- Phil Blunsom and Karl Moritz Hermann. 2014. Multilingual Models for Compositional Distributional Semantics. (2014).Google Scholar
- Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C. Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems. 1853--1861. Google Scholar
Digital Library
- Peter A. Chew, Brett W. Bader, Tamara G. Kolda, and Ahmed Abdelali. 2007. Cross-language information retrieval using PARAFAC2. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 143--152. Google Scholar
Digital Library
- Manoj Kumar Chinnakotla, Sagar Ranadive, Om P. Damani, and Pushpak Bhattacharyya. 2007. Hindi to English and Marathi to English cross language information retrieval evaluation. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 111--118.Google Scholar
- Raj Dabre, Fabien Cromierès, Sadao Kurohashi, and Pushpak Bhattacharyya. 2015. Leveraging small multilingual corpora for SMT using many pivot languages. In The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT’15). 1192--1202. http://aclweb.org/anthology/N/N15/N15-1125.pdf.Google Scholar
Cross Ref
- Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. CoRR abs/1606.09403 (2016). http://arxiv.org/abs/1606.09403.Google Scholar
- Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 462--471.Google Scholar
Cross Ref
- Marc Franco-Salvador, Paolo Rosso, and Roberto Navigli. 2014. A knowledge-based representation for cross-language document retrieval and categorization. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL’14). 414--423.Google Scholar
Cross Ref
- Tim Gollins and Mark Sanderson. 2001. Improving cross language retrieval with triangulated translation. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, New York, 90--95. Google Scholar
Digital Library
- Stephan Gouws, Yoshua Bengio, and Greg Corrado. 2015. Bilbowa: Fast bilingual distributed representations without word alignments. In International Conference on Machine Learning (ICML’15). 748--756. Google Scholar
Digital Library
- Stephan Gouws and Anders Søgaard. 2015. Simple task-specific bilingual word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1386--1390.Google Scholar
Cross Ref
- Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In ACL (1). 1234--1244.Google Scholar
- Benjamin Herbert, György Szarvas, and Iryna Gurevych. 2011. Combining query translation techniques to improve cross-language information retrieval. In Proceedings of the 33rd European Conference on Advances in Information Retrieval (ECIR’11). 712--715. Google Scholar
Digital Library
- Karl Moritz Hermann and Phil Blunsom. 2013. Multilingual distributed representations without word alignment. arXiv preprint arXiv:1312.6173.Google Scholar
- Ali Hosseinzadeh Vahid, Piyush Arora, Qun Liu, and Gareth J. F. Jones. 2015. A comparative study of online translation services for cross language information retrieval. In Proceedings of the 24th International Conference on World Wide Web. ACM, 859--864. Google Scholar
Digital Library
- Kejun Huang, Matt Gardner, Evangelos E. Papalexakis, Christos Faloutsos, Nikos D. Sidiropoulos, Tom M. Mitchell, Partha Pratim Talukdar, and Xiao Fu. 2015. Translation invariant word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15). Association for Computational Linguistics. 1084--1088.Google Scholar
Cross Ref
- David A. Hull and Gregory Grefenstette. 1996. Querying across languages: A dictionary-based approach to multilingual information retrieval. In ACM SIGIR. Google Scholar
Digital Library
- Jagadeesh Jagarlamudi and A. Kumaran. 2007. Cross-Lingual Information Retrieval System for Indian Languages. In Workshop of the Cross-Language Evaluation Forum for European Languages. Springer, 80--87.Google Scholar
- Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual distributed representations of words. In Proceedings of COLING 2012. 1459--1474.Google Scholar
- Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. 2015. Hubness and Pollution: Delving into Cross-Space Mapping for Zero-Shot Learning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 1, Long Papers. Association for Computational Linguistics, 270--280.Google Scholar
Cross Ref
- Gina-Anne Levow, Douglas W. Oard, and Philip Resnik. 2005. Dictionary-based techniques for cross-language information retrieval. Inf. Process. Manage. 41, 3 (May 2005), 523--547. Google Scholar
Digital Library
- Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 2.Google Scholar
Cross Ref
- Omer Levy, Anders Søgaard, and Yoav Goldberg. 2017. A strong baseline for learning cross-lingual word embeddings from sentence alignments. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Vol. 1. 765--774.Google Scholar
Cross Ref
- Michael L. Littmana, Susan T. Dumais, and Thomas K. Landauer. 1998. Automatic cross-language information retrieval using latent semantic indexing. In Cross-Language Information Retrieval, 1998. Springer, 51--62.Google Scholar
Cross Ref
- Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. 151--159.Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv Preprint arXiv:1301.3781 (2013).Google Scholar
- Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. CoRR abs/1309.4168 (2013).Google Scholar
- Hieu Pham Minh-Thang Luong and Christopher D. Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of NAACL-HLT. Association for Computational Linguistics. 151--159.Google Scholar
- Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh, and Om P. Damani. 2008. Evaluation of Hindi to English, Marathi to English and English to Hindi CLIR at FIRE 2008. In Working Notes of Forum for Information Retrieval and Evaluation (FIRE’08).Google Scholar
- Hieu Pham, Thang Luong, and Christopher Manning. 2015. Learning distributed representations for multilingual text sequences. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. 88--94.Google Scholar
Cross Ref
- Ari Pirkola. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 55--63. Google Scholar
Digital Library
- Siyu Qiu, Qing Cui, Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014. Co-learning of word representations and morpheme representations. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING'14). 141--150.Google Scholar
- Sebastian Ruder. 2017. A survey of cross-lingual embedding models. CoRR abs/1706.04902 (2017). Retrieved from http://arxiv.org/abs/1706.04902.Google Scholar
- Shigehiko Schamoni, Felix Hieber, Artem Sokolov, and Stefan Riezler. 2014. Learning translational and knowledge-based similarities from relevance rankings for cross-language retrieval. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Volume 2, Short Papers, Vol. 2. 488--494.Google Scholar
Cross Ref
- Artem Sokolov, Felix Hieber, and Stefan Riezler. 2014. Learning to translate queries for CLIR. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval (SIGIR’14). 1179--1182. Google Scholar
Digital Library
- Philipp Sorg and Philipp Cimiano. 2008. Cross-lingual information retrieval with explicit semantic analysis. In Working Notes for the CLEF 2008 Workshop.Google Scholar
- Radu Soricut and Nan Ding. 2016. Multilingual word embeddings using multigraphs. arXiv Preprint arXiv:1612.04732 (2016).Google Scholar
- Hubert Soyer, Pontus Stenetorp, and Akiko Aizawa. 2014. Leveraging monolingual data for crosslingual compositional word representations. arXiv preprint arXiv:1412.6334.Google Scholar
- Ferhan Türe and Elizabeth Boschee. 2014. Learning to translate: A query-specific combination approach for cross-lingual information retrieval. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 589--599.Google Scholar
Cross Ref
- Ferhan Türe, Jimmy Lin, and Douglas W. Oard. 2012a. Combining statistical translation techniques for crosslanguage information retrieval. In Proceedings of COLING 2012. 2685--2702.Google Scholar
- Ferhan Türe, Jimmy Lin, and Douglas W. Oard. 2012b. Looking inside the box: Context-sensitive translation for cross-language information retrieval. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1105--1106. Google Scholar
Digital Library
- Raghavendra Udupa, K. Saravanan, Anton Bakalov, and Abhijit Bhole. 2009. They are out there, if you know where to look: Mining transliterations of OOV query terms for cross-language information retrieval. In European Conference on Information Retrieval. Springer, 437--448. Google Scholar
Digital Library
- Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word embeddings: An empirical comparison. arXiv preprint arXiv:1604.00425 (2016).Google Scholar
- Ivan Vulić and Anna Korhonen. 2016. On the role of seed lexicons in learning bilingual word embeddings.Google Scholar
- Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. 363--372. Google Scholar
Digital Library
- Chih-Ping Wei, Christopher C. Yang, and Chia-Min Lin. 2008. A latent semantic indexing-based approach to multilingual document clustering. Decision Support Systems 45, 3 (June 2008), 606--620. Google Scholar
Digital Library
- Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In HLT-NAACL. 1006--1011.Google Scholar
- Wen-tau Yih, Kristina Toutanova, John C. Platt, and Christopher Meek. 2011. Learning discriminative projections for text similarity measures. In Proceedings of the 15th Conference on Computational Natural Language Learning (CoNLL’11). Association for Computational Linguistics, 247--256. Google Scholar
Digital Library
Index Terms
Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages
Recommendations
Bootstrapping dictionaries for cross-language information retrieval
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalThe bottleneck for dictionary-based cross-language information retrieval is the lack of comprehensive dictionaries, in particular for many different languages. We here introduce a methodology by which multilingual dictionaries (for Spanish and Swedish) ...
Exploring Bilingual Word Vectors for Hindi-English Cross-Language Information Retrieval
ICIA-16: Proceedings of the International Conference on Informatics and AnalyticsTodays, The internet has become a source of multi-lingual content. Users are not aware of multiple languages, so the language diversity becomes a great barrier for world communication. Cross-Language Information Retrieval (CLIR) provides a solution for ...
Exploiting comparable corpora for cross-language information retrieval
PRICAI'10: Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligenceLarge-scale comparable corpora became more abundant and accessible than parallel corpora, with the explosive growth of the World Wide Web. Therefore, strategies on bilingual terminology extraction from comparable texts must be given more attention in ...






Comments