Abstract
With Web 2.0, there has been exponential growth in the number of Web users and the volume of Web content. Most of these users are not only consumers of the information but also generators of it. People express themselves here in colloquial languages, but using Roman script (transliteration). These texts are mostly informal and casual, and therefore seldom follow grammar rules. Also, there does not exist any prescribed set of spelling rules in transliterated text. This freedom leads to large-scale spelling variations, which is a major challenge in mixed script information processing. This article studies different existing phonetic algorithms to handle the issue of spelling variation, points out the limitations of them, and proposes a novel phonetic encoding approach with two different flavors in the light of Hindi transliteration. Experiments performed over Hindi song lyrics retrieval in mixed script domain with three different retrieval models show that proposed approaches outperform the existing techniques in a majority of the cases (sometimes statistically significantly) for a number of metrics like [email protected], [email protected], [email protected], MAP, MRR, and Recall.
- 2015. Forum for Information Retrieval Evaluation (FIRE). Retrieved February 14, 2015 from http://www.isical.ac.in/clia/.Google Scholar
- James Allan, Bruce Croft, Alistair Moffat, and Mark Sanderson. 2012. Frontiers, challenges, and opportunities for information retrieval: Report from swirl 2012 the second strategic workshop on information retrieval in Lorne. In ACM SIGIR Forum, Vol. 46. ACM, 2–32. Google Scholar
Digital Library
- Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 357–389. Google Scholar
Digital Library
- I. A. Bhat, V. Mujadia, A. Tammewar, R. A. Bhat, and M. Shrivastava. 2014. IIIT-H system submission for FIRE2014 shared task on transliterated search. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE). Google Scholar
Digital Library
- M. Choudhury, G. Chittaranjan, P. Gupta, and A. Das. 2014. Overview and datasets of FIRE 2014 track on transliterated search. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).Google Scholar
- Nigel Collier, Hideki Hirakawa, and Akira Kumano. 1998. Machine translation vs. dictionary term translation: A comparison for English-Japanese news article alignment. In Proceedings of the 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 263–267. Google Scholar
Digital Library
- James C. French, Allison L. Powell, and Eric Schulman. 1997. Applications of approximate word matching in information retrieval. In Proceedings of the 6th International Conference on Information and Knowledge Management. ACM, 9–15. Google Scholar
Digital Library
- T. N. Gadd. 1988. ‘Fisching fore werd’: Phonetic retrieval of written text in information systems. Program 22, 3 (1988), 222–237. Google Scholar
Digital Library
- T. N. Gadd. 1990. PHONIX: The algorithm. Program 24, 4 (1990), 363–366. Google Scholar
Digital Library
- Björn Gambäck and Amitava Das. 2016. Comparing the level of code-switching in corpora. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). https://www.lrec-conf.org/proceedings/lrec2016/summaries/669.htmlGoogle Scholar
- D. Ganguly, S. Pal, and G. J. F. Jones. 2014. DCUFIRE-2014: Fuzzy queries with rule-based normalization for mixed script information retrieval. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE). Google Scholar
Digital Library
- Kanika Gupta, Monojit Choudhury, and Kalika Bali. 2012. Mining Hindi-English transliteration pairs from online Hindi lyrics. In LREC. 2459–2465.Google Scholar
- Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo Rosso. 2014. Query expansion for mixed-script information retrieval. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 677–686. Google Scholar
Digital Library
- P. Gupta, P. Rosso, and R. E. Banchs. 2013. Encoding transliteration variation through dimensionality reduction: FIRE shared task on transliterated search. In Pre-proceedings 5th Workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).Google Scholar
- Chung-Chian Hsu and Chien-Hsing Chen. 2010. Mining synonymous transliterations from the World Wide Web. ACM Transactions on Asian Language Information Processing (TALIP) 9, 1 (2010), 1–28. Google Scholar
Digital Library
- K. Sparck Jones, Steve Walker, and Stephen E. Robertson. 2000. A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Information Processing & Management 36, 6 (2000), 809–840. Google Scholar
Digital Library
- H. Joshi, A. Bhatt, and H. Patel. 2013. Transliterated search using syllabification approach. In Pre-proceedings 5th workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).Google Scholar
- Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. 2011. Machine transliteration survey. ACM Computing Surveys (CSUR) 43, 3 (2011), 17:1–46. Google Scholar
Digital Library
- Ben King and Steven P. Abney. 2013. Labeling the languages of words in mixed-language documents using weakly supervised methods. In HLT-NAACL. Association for Computational Linguistics, 1110–1119.Google Scholar
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press, Cambridge. Google Scholar
Digital Library
- Mandar Mitra, Amit Singhal, and Chris Buckley. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 206–214. Google Scholar
Digital Library
- A. Mukherjee, K. Datta, and A. Ravi. 2014. Mixed-script query labelling using supervised learning and Adhoc retrieval using sub word indexing. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE). Google Scholar
Digital Library
- M. Odell and R. Russell. 1918. The soundex coding system. US Patent 1261167 (1918).Google Scholar
- P. Pakray and P. Bhaskar. 2013. Transliterated search system for Indian languages. In Pre-proceedings 5th Workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).Google Scholar
- A. Prakash and S. K. Saha. 2014. A relevance feedback based approach for mixed script transliterated text search: Shared task report by BIT Mesra. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).Google Scholar
- Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev, and Qiaozhu Mei. 2011. Rumor has it: Identifying misinformation in microblogs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1589–1599. Google Scholar
Digital Library
- Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing.Google Scholar
- Rishiraj Saha Roy, Monojit Choudhury, Prasenjit Majumder, and Komal Agarwal. 2013. Overview of the FIRE 2013 track on transliterated search. In Proceedings of the 5th 2013 Forum on Information Retrieval Evaluation. ACM, 4. Google Scholar
Digital Library
- Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 5 (1988), 513–523. Google Scholar
Digital Library
- Royal Sequiera, Monojit Choudhury, Parth Gupta, Paolo Rosso, Shubham Kumar, Somnath Banerjee, Sudip Kumar Naskar, Sivaji Bandyopadhyay, Gokul Chittaranjan, Amitava Das, and Kunal Chakma. 2015. Overview of FIRE-2015 shared task on mixed script information retrieval. In Post Proceedings of the Workshops at the 7th Forum for Information Retrieval Evaluation. 19–25. https://ceur-ws.org/Vol-1587/T2-1.pdfGoogle Scholar
- Xuerui Wang, Andrei Broder, Evgeniy Gabrilovich, Vanja Josifovski, and Bo Pang. 2009. Cross-language query classification using web search for exogenous knowledge. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, 74–83. Google Scholar
Digital Library
- Justin Zobel and Philip Dart. 1996. Phonetic string matching: Lessons from information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 166–172. Google Scholar
Digital Library
Index Terms
Query Expansion for Transliterated Text Retrieval
Recommendations
Automatic query expansion: A structural linguistic perspective
A user's query is considered to be an imprecise description of their information need. Automatic query expansion is the process of reformulating the original query with the goal of improving retrieval effectiveness. Many successful query expansion ...
Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrievalThis paper presents an approach to bilingual lexicon extraction from comparable corpora and evaluations on Cross-Language Information Retrieval. We explore a bi-directional extraction of bilingual terminology primarily from comparable corpora. A ...
A Hybrid Approach for Transliterated Word-Level Language Identification: CRF with Post-Processing Heuristics
FIRE '14: Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval EvaluationIn this paper, we describe a hybrid approach for word-level language (WLL) identification of Bangla words written in Roman script and mixed with English words as part of our participation in the shared task on transliterated search at Forum for ...






Comments