Abstract
Neural Machine Translation (NMT) is widely employed for language translation tasks because it performs better than the conventional statistical and phrase-based approaches. However, NMT techniques involve challenges, such as requiring a large and clean corpus of parallel data and the inability to deal with rare words. They need to be faster for real-time applications. More work needs to be done using NMT to address the challenges in translating Sanskrit, one of the oldest and rich languages known to the world, with its morphological richness and limited multilingual parallel corpus. There is usually no similar data between a language pair; hence, no application exists so far that can translate Sanskrit to/from other languages. This study presents an in-depth analysis to address these challenges with the help of a low-resource Sanskrit-Hindi language pair. We employ a novel training corpus filtering with extended vocabulary in a zero-shot transformer architecture. The structure of the Sanskrit language is thoroughly investigated to justify the use of each step. Furthermore, the proposed method is analyzed based on variations in sentence length and also applied to a high-resource language pair in order to demonstrate its efficacy.
- [1] . 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. Retrieved from https://arxiv.org/abs/1409.0473Google Scholar
- [2] 2021. A Sanskrit-to-English machine translation using hybridization of direct and rule-based approach. Neural Computing and Applications 33, 7 (2021), 2819–2838.Google Scholar
Digital Library
- [3] . 2017. What do neural machine translation models learn about morphology? arXiv:1704.03471. Retrieved from https://arxiv.org/abs/1704.03471Google Scholar
- [4] . 2009. Anusaaraka: An accessor cum machine translator. Department of Sanskrit Studies, University of Hyderabad, Hyderabad (2009).Google Scholar
- [5] . 2020. Morphological segmentation to improve crosslingual word embeddings for low resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 5 (2020), 1–15.Google Scholar
Digital Library
- [6] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078. Retrieved from https://arxiv.org/abs/1406.1078Google Scholar
- [7] . 2016. A character-level decoder without explicit segmentation for neural machine translation. arXiv:1603.06147. Retrieved from https://arxiv.org/abs/1603.06147Google Scholar
- [8] . 2020. A survey of multilingual neural machine translation. ACM Computing Surveys 53, 5 (2020), 1–38.Google Scholar
Digital Library
- [9] . 2020. A survey of the model transfer approaches to cross-lingual dependency parsing. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 5 (2020), 1–60.Google Scholar
Digital Library
- [10] . 2021. Itihasa: A large-scale corpus for Sanskrit to English translation. WAT 2021 (2021), 191.Google Scholar
- [11] . 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805Google Scholar
- [12] . 2010. Comparative study of indexing and search strategies for the hindi, marathi, and bengali languages. ACM Transactions on Asian Language Information Processing 9, 3 (2010), 1–24.Google Scholar
Digital Library
- [13] . 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 1723–1732.Google Scholar
Cross Ref
- [14] . 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv:1601.01073. Retrieved from https://arxiv.org/abs/1601.01073Google Scholar
- [15] . 2016. Zero-resource translation with multi-lingual neural machine translation. arXiv:1606.04164. Retrieved from https://arxiv.org/abs/1606.04164Google Scholar
- [16] . 2016. Toward multilingual neural machine translation with universal encoder and decoder. arXiv:1611.04798. Retrieved from https://arxiv.org/abs/1611.04798Google Scholar
- [17] . 2019. Filtered pseudo-parallel corpus improves low-resource neural machine translation. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 2 (2019), 1–16.Google Scholar
Digital Library
- [18] . 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5 (2017), 339–351.Google Scholar
Cross Ref
- [19] . 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1700–1709.Google Scholar
- [20] . 2019. Utilizing word embeddings based features for phylogenetic tree generation of Sanskrit texts. In Proceedings of the 6th International Sanskrit Computational Linguistics Symposium. 152–165.Google Scholar
- [21] . 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980Google Scholar
- [22] . 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv:1804.10959. Retrieved from https://arxiv.org/abs/1804.10959Google Scholar
- [23] . 2019. An augmented translation technique for low resource language pair: Sanskrit to Hindi translation. In Proceedings of the 2019 2nd International Conference on Algorithms, Computing and Artificial Intelligence. 377–383.Google Scholar
Digital Library
- [24] . 2017. The iit bombay english-hindi parallel corpus. arXiv:1710.02855. Retrieved from https://arxiv.org/abs/1710.02855Google Scholar
- [25] . 2018. Improving zero-shot translation of low-resource languages. arXiv:1811.01389. Retrieved from https://arxiv.org/abs/1811.01389Google Scholar
- [26] . 2020. Hindi-Marathi cross lingual model. In Proceedings of the 5th Conference on Machine Translation. Association for Computational Linguistics, Online, 396–401. Retrieved from https://aclanthology.org/2020.wmt-1.45.Google Scholar
- [27] . 2019. Neural machine translation: Hindi-Nepali. In Proceedings of the 4th Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2). Association for Computational Linguistics, 202–207.
DOI: Google ScholarCross Ref
- [28] . 2017. Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics 5 (2017), 365–378.Google Scholar
Cross Ref
- [29] . 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8 (2020), 726–742.Google Scholar
Cross Ref
- [30] . 2015. Multi-task sequence to sequence learning. arXiv:1511.06114. Retrieved from https://arxiv.org/abs/1511.06114Google Scholar
- [31] . 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. arXiv:1604.00788. Retrieved from https://arxiv.org/abs/1604.00788Google Scholar
- [32] . 2019. Multi-round transfer learning for low-resource NMT using multiple high-resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 4 (2019), 1–26.Google Scholar
Digital Library
- [33] . 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013), 3111–3119.Google Scholar
Digital Library
- [34] . 2018. Improving NER tagging performance in low-resource languages via multilingual learning. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 2 (2018), 1–20.Google Scholar
Digital Library
- [35] . 2016. Error analysis of sahit-a statistical sanskrit-hindi translator. Procedia Computer Science 96 (2016), 495–501.Google Scholar
Digital Library
- [36] . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google Scholar
Digital Library
- [37] . 2022. Samanantar: The largest publicly available parallel corpora collection for 11 Indic Languages. Transactions of the Association for Computational Linguistics 10 (2022), 145–162.
DOI: Google ScholarCross Ref
- [38] . 2020. Bilingual dictionary for Sanskrit–Gujarati MT implementation. In Proceedings of the ICT Analysis and Applications. Springer, 463–470.Google Scholar
Cross Ref
- [39] . 2019. Sanskrit-Gujarati constituency mapper for machine translation system. In Proceedings of the 2019 IEEE Bombay Section Signature Conference. IEEE, 1–8.Google Scholar
Cross Ref
- [40] . 2022. A novel framework for Sanskrit-Gujarati symbolic machine translation system. International Journal of Advanced Computer Science and Applications 13, 4 (2022).Google Scholar
Cross Ref
- [41] . 2021. Leveraging automated unit tests for unsupervised code translation. arXiv:2110.06773. Retrieved from https://arxiv.org/abs/2110.06773Google Scholar
- [42] . 2021. Evaluating neural word embeddings for Sanskrit. arXiv:2104.00270. Retrieved from https://arxiv.org/abs/2104.00270Google Scholar
- [43] . 2021. “A Passage to India”: Pre-trained word embeddings for Indian languages. arXiv:2112.13800. Retrieved from https://arxiv.org/abs/2112.13800Google Scholar
- [44] . 2012. Japanese and korean voice search. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 5149–5152.Google Scholar
Cross Ref
- [45] . 2015. Neural machine translation of rare words with subword units. arXiv:1508.07909. Retrieved from https://arxiv.org/abs/1508.07909Google Scholar
- [46] . 2017. Representing contexual relations with sanskrit word embeddings. In Proceedings of the International Conference on Computational Science and Its Applications. Springer, 262–273.Google Scholar
Cross Ref
- [47] . 2018. Human versus automatic quality evaluation of NMT and PBSMT. Machine Translation 32, 3 (2018), 217–235.Google Scholar
Digital Library
- [48] . 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (2014), 3104–3112.Google Scholar
Digital Library
- [49] . 2020. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv:2008.00401. Retrieved from https://arxiv.org/abs/2008.00401Google Scholar
- [50] . 2020. Translating morphologically rich Indian languages under zero-resource conditions. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 6 (2020), 1–15.Google Scholar
Digital Library
- [51] Snover Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. 223--231.Google Scholar
- [52] . 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 5998--6008.Google Scholar
- [53] . 2012. Source language adaptation for resource-poor machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 286–296.Google Scholar
Digital Library
- [54] . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144. Retrieved from https://arxiv.org/abs/1609.08144Google Scholar
- [55] . 2019. Co-occurrence weight selection in generation of word embeddings for low resource languages. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3 (2019), 1–18.Google Scholar
Digital Library
- [56] . 2016. Multi-source neural translation. arXiv:1601.00710. Retrieved from https://arxiv.org/abs/1601.00710Google Scholar
Index Terms
Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-Hindi
Recommendations
A Novel Neural Machine Translation Approach for low-resource Sanskrit-Hindi Language pair
Sanskrit is one of the earliest native languages and is correctly described as "the gods' language" because of its wide use in Indian religious literature from the past. However, it is becoming less popular in modern India. Due in significant part to the ...
Handling of Infinitives in English to Sanskrit Machine Translation
The development of Machine Translation (MT) system for ancient language like Sanskrit is a fascinating and challenging task. In this paper, the authors handle the infinitive type of English sentences in the English to Sanskrit machine translation (EST) ...
Interlingua-based English–Hindi Machine Translation and Language Divergence
Interlingua and transfer-based approaches to machine translation have long been in use in competing and complementary ways. The former proves economical in situations where translation among multiple languages is involved, and can be used as a knowledge-...






Comments