skip to main content
research-article

Query Expansion for Transliterated Text Retrieval

Published:20 July 2021Publication History
Skip Abstract Section

Abstract

With Web 2.0, there has been exponential growth in the number of Web users and the volume of Web content. Most of these users are not only consumers of the information but also generators of it. People express themselves here in colloquial languages, but using Roman script (transliteration). These texts are mostly informal and casual, and therefore seldom follow grammar rules. Also, there does not exist any prescribed set of spelling rules in transliterated text. This freedom leads to large-scale spelling variations, which is a major challenge in mixed script information processing. This article studies different existing phonetic algorithms to handle the issue of spelling variation, points out the limitations of them, and proposes a novel phonetic encoding approach with two different flavors in the light of Hindi transliteration. Experiments performed over Hindi song lyrics retrieval in mixed script domain with three different retrieval models show that proposed approaches outperform the existing techniques in a majority of the cases (sometimes statistically significantly) for a number of metrics like [email protected], [email protected], [email protected], MAP, MRR, and Recall.

References

  1. 2015. Forum for Information Retrieval Evaluation (FIRE). Retrieved February 14, 2015 from http://www.isical.ac.in/clia/.Google ScholarGoogle Scholar
  2. James Allan, Bruce Croft, Alistair Moffat, and Mark Sanderson. 2012. Frontiers, challenges, and opportunities for information retrieval: Report from swirl 2012 the second strategic workshop on information retrieval in Lorne. In ACM SIGIR Forum, Vol. 46. ACM, 2–32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 357–389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. I. A. Bhat, V. Mujadia, A. Tammewar, R. A. Bhat, and M. Shrivastava. 2014. IIIT-H system submission for FIRE2014 shared task on transliterated search. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Choudhury, G. Chittaranjan, P. Gupta, and A. Das. 2014. Overview and datasets of FIRE 2014 track on transliterated search. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).Google ScholarGoogle Scholar
  6. Nigel Collier, Hideki Hirakawa, and Akira Kumano. 1998. Machine translation vs. dictionary term translation: A comparison for English-Japanese news article alignment. In Proceedings of the 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 263–267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. James C. French, Allison L. Powell, and Eric Schulman. 1997. Applications of approximate word matching in information retrieval. In Proceedings of the 6th International Conference on Information and Knowledge Management. ACM, 9–15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. N. Gadd. 1988. ‘Fisching fore werd’: Phonetic retrieval of written text in information systems. Program 22, 3 (1988), 222–237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. N. Gadd. 1990. PHONIX: The algorithm. Program 24, 4 (1990), 363–366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Björn Gambäck and Amitava Das. 2016. Comparing the level of code-switching in corpora. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). https://www.lrec-conf.org/proceedings/lrec2016/summaries/669.htmlGoogle ScholarGoogle Scholar
  11. D. Ganguly, S. Pal, and G. J. F. Jones. 2014. DCUFIRE-2014: Fuzzy queries with rule-based normalization for mixed script information retrieval. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kanika Gupta, Monojit Choudhury, and Kalika Bali. 2012. Mining Hindi-English transliteration pairs from online Hindi lyrics. In LREC. 2459–2465.Google ScholarGoogle Scholar
  13. Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo Rosso. 2014. Query expansion for mixed-script information retrieval. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 677–686. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Gupta, P. Rosso, and R. E. Banchs. 2013. Encoding transliteration variation through dimensionality reduction: FIRE shared task on transliterated search. In Pre-proceedings 5th Workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).Google ScholarGoogle Scholar
  15. Chung-Chian Hsu and Chien-Hsing Chen. 2010. Mining synonymous transliterations from the World Wide Web. ACM Transactions on Asian Language Information Processing (TALIP) 9, 1 (2010), 1–28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Sparck Jones, Steve Walker, and Stephen E. Robertson. 2000. A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Information Processing & Management 36, 6 (2000), 809–840. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Joshi, A. Bhatt, and H. Patel. 2013. Transliterated search using syllabification approach. In Pre-proceedings 5th workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).Google ScholarGoogle Scholar
  18. Sarvnaz Karimi, Falk Scholer, and Andrew Turpin. 2011. Machine transliteration survey. ACM Computing Surveys (CSUR) 43, 3 (2011), 17:1–46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ben King and Steven P. Abney. 2013. Labeling the languages of words in mixed-language documents using weakly supervised methods. In HLT-NAACL. Association for Computational Linguistics, 1110–1119.Google ScholarGoogle Scholar
  20. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Vol. 1. Cambridge University Press, Cambridge. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mandar Mitra, Amit Singhal, and Chris Buckley. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 206–214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Mukherjee, K. Datta, and A. Ravi. 2014. Mixed-script query labelling using supervised learning and Adhoc retrieval using sub word indexing. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Odell and R. Russell. 1918. The soundex coding system. US Patent 1261167 (1918).Google ScholarGoogle Scholar
  24. P. Pakray and P. Bhaskar. 2013. Transliterated search system for Indian languages. In Pre-proceedings 5th Workshop FIRE-2013. Forum for Information Retrieval Evaluation (FIRE).Google ScholarGoogle Scholar
  25. A. Prakash and S. K. Saha. 2014. A relevance feedback based approach for mixed script transliterated text search: Shared task report by BIT Mesra. In Pre-proceedings 6th Workshop FIRE-2014. Forum for Information Retrieval Evaluation (FIRE).Google ScholarGoogle Scholar
  26. Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev, and Qiaozhu Mei. 2011. Rumor has it: Identifying misinformation in microblogs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1589–1599. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle Scholar
  28. Rishiraj Saha Roy, Monojit Choudhury, Prasenjit Majumder, and Komal Agarwal. 2013. Overview of the FIRE 2013 track on transliterated search. In Proceedings of the 5th 2013 Forum on Information Retrieval Evaluation. ACM, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 5 (1988), 513–523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Royal Sequiera, Monojit Choudhury, Parth Gupta, Paolo Rosso, Shubham Kumar, Somnath Banerjee, Sudip Kumar Naskar, Sivaji Bandyopadhyay, Gokul Chittaranjan, Amitava Das, and Kunal Chakma. 2015. Overview of FIRE-2015 shared task on mixed script information retrieval. In Post Proceedings of the Workshops at the 7th Forum for Information Retrieval Evaluation. 19–25. https://ceur-ws.org/Vol-1587/T2-1.pdfGoogle ScholarGoogle Scholar
  31. Xuerui Wang, Andrei Broder, Evgeniy Gabrilovich, Vanja Josifovski, and Bo Pang. 2009. Cross-language query classification using web search for exogenous knowledge. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM, 74–83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Justin Zobel and Philip Dart. 1996. Phonetic string matching: Lessons from information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 166–172. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Query Expansion for Transliterated Text Retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 4
        July 2021
        419 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3465463
        Issue’s Table of Contents

        Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 July 2021
        • Accepted: 1 January 2021
        • Revised: 1 November 2020
        • Received: 1 April 2020
        Published in tallip Volume 20, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)23
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!