skip to main content
research-article

Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-Hindi

Authors Info & Claims
Published:12 April 2023Publication History
Skip Abstract Section

Abstract

Neural Machine Translation (NMT) is widely employed for language translation tasks because it performs better than the conventional statistical and phrase-based approaches. However, NMT techniques involve challenges, such as requiring a large and clean corpus of parallel data and the inability to deal with rare words. They need to be faster for real-time applications. More work needs to be done using NMT to address the challenges in translating Sanskrit, one of the oldest and rich languages known to the world, with its morphological richness and limited multilingual parallel corpus. There is usually no similar data between a language pair; hence, no application exists so far that can translate Sanskrit to/from other languages. This study presents an in-depth analysis to address these challenges with the help of a low-resource Sanskrit-Hindi language pair. We employ a novel training corpus filtering with extended vocabulary in a zero-shot transformer architecture. The structure of the Sanskrit language is thoroughly investigated to justify the use of each step. Furthermore, the proposed method is analyzed based on variations in sentence length and also applied to a high-resource language pair in order to demonstrate its efficacy.

REFERENCES

  1. [1] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. Retrieved from https://arxiv.org/abs/1409.0473Google ScholarGoogle Scholar
  2. [2] Sitender and Seema Bawa. 2021. A Sanskrit-to-English machine translation using hybridization of direct and rule-based approach. Neural Computing and Applications 33, 7 (2021), 28192838.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Belinkov Yonatan, Durrani Nadir, Dalvi Fahim, Sajjad Hassan, and Glass James. 2017. What do neural machine translation models learn about morphology? arXiv:1704.03471. Retrieved from https://arxiv.org/abs/1704.03471Google ScholarGoogle Scholar
  4. [4] Bharati Akshar and Kulkarni Amba. 2009. Anusaaraka: An accessor cum machine translator. Department of Sanskrit Studies, University of Hyderabad, Hyderabad (2009).Google ScholarGoogle Scholar
  5. [5] Chimalamarri Santwana, Sitaram Dinkar, and Jain Ashritha. 2020. Morphological segmentation to improve crosslingual word embeddings for low resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 5 (2020), 115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078. Retrieved from https://arxiv.org/abs/1406.1078Google ScholarGoogle Scholar
  7. [7] Chung Junyoung, Cho Kyunghyun, and Bengio Yoshua. 2016. A character-level decoder without explicit segmentation for neural machine translation. arXiv:1603.06147. Retrieved from https://arxiv.org/abs/1603.06147Google ScholarGoogle Scholar
  8. [8] Dabre Raj, Chu Chenhui, and Kunchukuttan Anoop. 2020. A survey of multilingual neural machine translation. ACM Computing Surveys 53, 5 (2020), 138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Das Ayan and Sarkar Sudeshna. 2020. A survey of the model transfer approaches to cross-lingual dependency parsing. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 5 (2020), 160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Anoop Rahul Aralikatte, Miryam de Lhoneux and Søgaard Kunchukuttan, and Anders. 2021. Itihasa: A large-scale corpus for Sanskrit to English translation. WAT 2021 (2021), 191.Google ScholarGoogle Scholar
  11. [11] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805Google ScholarGoogle Scholar
  12. [12] Dolamic Ljiljana and Savoy Jacques. 2010. Comparative study of indexing and search strategies for the hindi, marathi, and bengali languages. ACM Transactions on Asian Language Information Processing 9, 3 (2010), 124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Dong Daxiang, Wu Hua, He Wei, Yu Dianhai, and Wang Haifeng. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 17231732.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Firat Orhan, Cho Kyunghyun, and Bengio Yoshua. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv:1601.01073. Retrieved from https://arxiv.org/abs/1601.01073Google ScholarGoogle Scholar
  15. [15] Firat Orhan, Sankaran Baskaran, Al-Onaizan Yaser, Vural Fatos T. Yarman, and Cho Kyunghyun. 2016. Zero-resource translation with multi-lingual neural machine translation. arXiv:1606.04164. Retrieved from https://arxiv.org/abs/1606.04164Google ScholarGoogle Scholar
  16. [16] Ha Thanh-Le, Niehues Jan, and Waibel Alexander. 2016. Toward multilingual neural machine translation with universal encoder and decoder. arXiv:1611.04798. Retrieved from https://arxiv.org/abs/1611.04798Google ScholarGoogle Scholar
  17. [17] Imankulova Aizhan, Sato Takayuki, and Komachi Mamoru. 2019. Filtered pseudo-parallel corpus improves low-resource neural machine translation. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 2 (2019), 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Johnson Melvin, Schuster Mike, Le Quoc V, Krikun Maxim, Wu Yonghui, Chen Zhifeng, Thorat Nikhil, Viégas Fernanda, Wattenberg Martin, Corrado Greg, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5 (2017), 339351.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Kalchbrenner Nal and Blunsom Phil. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 17001709.Google ScholarGoogle Scholar
  20. [20] Kanojia Diptesh, Dubey Abhijeet, Kulkarni Malhar, Bhattacharyya Pushpak, and Haffari Gholemreza. 2019. Utilizing word embeddings based features for phylogenetic tree generation of Sanskrit texts. In Proceedings of the 6th International Sanskrit Computational Linguistics Symposium. 152165.Google ScholarGoogle Scholar
  21. [21] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980Google ScholarGoogle Scholar
  22. [22] Kudo Taku. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv:1804.10959. Retrieved from https://arxiv.org/abs/1804.10959Google ScholarGoogle Scholar
  23. [23] Kumar Rashi, Jha Piyush, and Sahula Vineet. 2019. An augmented translation technique for low resource language pair: Sanskrit to Hindi translation. In Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence. 377383.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Kunchukuttan Anoop, Mehta Pratik, and Bhattacharyya Pushpak. 2017. The iit bombay english-hindi parallel corpus. arXiv:1710.02855. Retrieved from https://arxiv.org/abs/1710.02855Google ScholarGoogle Scholar
  25. [25] Lakew Surafel M., Lotito Quintino F., Negri Matteo, Turchi Marco, and Federico Marcello. 2018. Improving zero-shot translation of low-resource languages. arXiv:1811.01389. Retrieved from https://arxiv.org/abs/1811.01389Google ScholarGoogle Scholar
  26. [26] Laskar Sahinur Rahman, Khilji Abdullah Faiz Ur Rahman, Pakray Partha, and Bandyopadhyay Sivaji. 2020. Hindi-Marathi cross lingual model. In Proceedings of the 5th Conference on Machine Translation. Association for Computational Linguistics, Online, 396401. Retrieved from https://aclanthology.org/2020.wmt-1.45.Google ScholarGoogle Scholar
  27. [27] Laskar Sahinur Rahman, Pakray Partha, and Bandyopadhyay Sivaji. 2019. Neural machine translation: Hindi-Nepali. In Proceedings of the 4th Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2). Association for Computational Linguistics, 202207. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Lee Jason, Cho Kyunghyun, and Hofmann Thomas. 2017. Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics 5 (2017), 365378.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Liu Yinhan, Gu Jiatao, Goyal Naman, Li Xian, Edunov Sergey, Ghazvininejad Marjan, Lewis Mike, and Zettlemoyer Luke. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8 (2020), 726742.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Luong Minh-Thang, Le Quoc V., Sutskever Ilya, Vinyals Oriol, and Kaiser Lukasz. 2015. Multi-task sequence to sequence learning. arXiv:1511.06114. Retrieved from https://arxiv.org/abs/1511.06114Google ScholarGoogle Scholar
  31. [31] Luong Minh-Thang and Manning Christopher D.. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. arXiv:1604.00788. Retrieved from https://arxiv.org/abs/1604.00788Google ScholarGoogle Scholar
  32. [32] Maimaiti Mieradilijiang, Liu Yang, Luan Huanbo, and Sun Maosong. 2019. Multi-round transfer learning for low-resource NMT using multiple high-resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 4 (2019), 126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013), 31113119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Murthy Rudra, Khapra Mitesh M., and Bhattacharyya Pushpak. 2018. Improving NER tagging performance in low-resource languages via multilingual learning. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 2 (2018), 120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Pandey Rajneesh Kumar and Jha Girish Nath. 2016. Error analysis of sahit-a statistical sanskrit-hindi translator. Procedia Computer Science 96 (2016), 495501.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Ramesh Gowtham, Doddapaneni Sumanth, Bheemaraj Aravinth, Jobanputra Mayank, AK Raghavan, Sharma Ajitesh, Sahoo Sujit, Diddee Harshita, J Mahalakshmi, Kakwani Divyanshu, Kumar Navneet, Pradeep Aswin, Nagaraj Srihari, Deepak Kumar, Raghavan Vivek, Kunchukuttan Anoop, Kumar Pratyush, and Khapra Mitesh Shantadevi. 2022. Samanantar: The largest publicly available parallel corpora collection for 11 Indic Languages. Transactions of the Association for Computational Linguistics 10 (2022), 145162. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Raulji Jaideepsinh and Saini Jatinderkumar R.. 2020. Bilingual dictionary for Sanskrit–Gujarati MT implementation. In Proceedings of the ICT Analysis and Applications. Springer, 463470.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Raulji Jaideepsinh K. and Saini Jatinderkumar R.. 2019. Sanskrit-Gujarati constituency mapper for machine translation system. In Proceedings of the 2019 IEEE Bombay Section Signature Conference. IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Raulji Jaideepsinh K., Saini Jatinderkumar R., Pal Kaushika, and Kotecha Ketan. 2022. A novel framework for Sanskrit-Gujarati symbolic machine translation system. International Journal of Advanced Computer Science and Applications 13, 4 (2022).Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Roziere Baptiste, Zhang Jie M., Charton Francois, Harman Mark, Synnaeve Gabriel, and Lample Guillaume. 2021. Leveraging automated unit tests for unsupervised code translation. arXiv:2110.06773. Retrieved from https://arxiv.org/abs/2110.06773Google ScholarGoogle Scholar
  42. [42] Sandhan Jivnesh, Adideva Om, Komal Digumarthi, Behera Laxmidhar, and Goyal Pawan. 2021. Evaluating neural word embeddings for Sanskrit. arXiv:2104.00270. Retrieved from https://arxiv.org/abs/2104.00270Google ScholarGoogle Scholar
  43. [43] Saurav Kumar, Saunack Kumar, Kanojia Diptesh, and Bhattacharyya Pushpak. 2021. “A Passage to India”: Pre-trained word embeddings for Indian languages. arXiv:2112.13800. Retrieved from https://arxiv.org/abs/2112.13800Google ScholarGoogle Scholar
  44. [44] Schuster Mike and Nakajima Kaisuke. 2012. Japanese and korean voice search. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 51495152.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2015. Neural machine translation of rare words with subword units. arXiv:1508.07909. Retrieved from https://arxiv.org/abs/1508.07909Google ScholarGoogle Scholar
  46. [46] Sharma Ishank, Anand Shrey, Goyal Rinkaj, and Misra Sanjay. 2017. Representing contexual relations with sanskrit word embeddings. In Proceedings of the International Conference on Computational Science and Its Applications. Springer, 262273.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Shterionov Dimitar, Superbo Riccardo, Nagle Pat, Casanellas Laura, O’dowd Tony, and Way Andy. 2018. Human versus automatic quality evaluation of NMT and PBSMT. Machine Translation 32, 3 (2018), 217235.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (2014), 31043112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Tang Yuqing, Tran Chau, Li Xian, Chen Peng-Jen, Goyal Naman, Chaudhary Vishrav, Gu Jiatao, and Fan Angela. 2020. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv:2008.00401. Retrieved from https://arxiv.org/abs/2008.00401Google ScholarGoogle Scholar
  50. [50] Tanwar Ashwani and Majumder Prasenjit. 2020. Translating morphologically rich Indian languages under zero-resource conditions. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 6 (2020), 115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Snover Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. 223--231.Google ScholarGoogle Scholar
  52. [52] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 5998--6008.Google ScholarGoogle Scholar
  53. [53] Wang Pidong, Nakov Preslav, and Ng Hwee Tou. 2012. Source language adaptation for resource-poor machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 286296.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144. Retrieved from https://arxiv.org/abs/1609.08144Google ScholarGoogle Scholar
  55. [55] Yücesoy Veysel and Koç Aykut. 2019. Co-occurrence weight selection in generation of word embeddings for low resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3 (2019), 118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Zoph Barret and Knight Kevin. 2016. Multi-source neural translation. arXiv:1601.00710. Retrieved from https://arxiv.org/abs/1601.00710Google ScholarGoogle Scholar

Index Terms

  1. Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-Hindi

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
          April 2023
          682 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3588902
          Issue’s Table of Contents

          Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 April 2023
          • Online AM: 19 January 2023
          • Accepted: 5 January 2023
          • Revised: 29 November 2022
          • Received: 2 May 2022
          Published in tallip Volume 22, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)190
          • Downloads (Last 6 weeks)12

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!