Abstract
The Machine Translation System (MTS) serves as effective tool for communication by translating text or speech from one language to another language. Recently, neural machine translation (NMT) has become popular for its performance and cost-effectiveness. However, NMT systems are restricted in translating low-resource languages as a huge quantity of data is required to learn useful mappings across languages. The need for an efficient translation system becomes obvious in a large multilingual environment like India. Indian languages (ILs) are still treated as low-resource languages due to unavailability of corpora. In order to address such an asymmetric nature, the multilingual neural machine translation (MNMT) system evolves as an ideal approach in this direction. The MNMT converts many languages using a single model, which is extremely useful in terms of training process and lowering online maintenance costs. It is also helpful for improving low-resource translation. In this article, we propose an MNMT system to address the issues related to low-resource language translation. Our model comprises two MNMT systems, i.e., for English-Indic (one-to-many) and for Indic-English (many-to-one) with a shared encoder-decoder containing 15 language pairs (30 translation directions). Since most of IL pairs have a scanty amount of parallel corpora, not sufficient for training any machine translation model, we explore various augmentation strategies to improve overall translation quality through the proposed model. A state-of-the-art transformer architecture is used to realize the proposed model. In addition, the article addresses the use of language relationships (in terms of dialect, script, etc.), particularly about the role of high-resource languages of the same family in boosting the performance of low-resource languages. Moreover, the experimental results also show the advantage of back-translation and domain adaptation for ILs to enhance the translation quality of both source and target languages. Using all these key approaches, our proposed model emerges to be more efficient than the baseline model in terms of evaluation metrics, i.e., BLEU (BiLingual Evaluation Understudy) score for a set of ILs.
- [1] . 2022. IndicXNLI: Evaluating multilingual inference for Indian languages. arXiv preprint arXiv:2204.08776 (2022).Google Scholar
- [2] . 2019. Massively multilingual neural machine translation. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. ACL, 3874–3884.Google Scholar
- [3] . 2019. Natural language processing: An overview. Frontiers in Pattern Recognition and Artificial Intelligence 1 (2019), 35–55.Google Scholar
Cross Ref
- [4] . 2019. The missing ingredient in zero-shot neural machine translation. arXiv preprint arXiv:1903.07091 (2019).Google Scholar
- [5] . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- [6] . 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173 (2017).Google Scholar
- [7] . 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).Google Scholar
- [8] . 2015. Large-scale dictionary construction via pivot-based statistical machine translation with significance pruning and neural network features. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 289–297.Google Scholar
- [9] . 2020. A survey of multilingual neural machine translation. ACM Computing Surveys (CSUR) 53, 5 (2020), 1–38.Google Scholar
Digital Library
- [10] . 2022. NIT Rourkela machine translation (MT) system submission to WAT 2022 for MultiIndicMT: An Indic language multilingual shared task. In Proceedings of the 9th Workshop on Asian Translation. 73–77.Google Scholar
- [11] . 2009. Translation Corpora and Machine Aided Translation. Translation Today 6, 1 (2009), 134–153.Google Scholar
- [12] . 2008. Corpus linguistics: An empirical approach for studying a natural language. In Language Forum, Vol. 34. Bahri Publications, 5–26.Google Scholar
- [13] . 2018. How much does tokenization affect neural machine translation? arXiv preprint arXiv:1812.08621 (2018).Google Scholar
- [14] . 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 1723–1732.Google Scholar
Cross Ref
- [15] , Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptichinsky, Sergery Eduniv, Edouard Grave, Michael Auli, and Armand Joulin. 2021. Beyond English-centri multilingual machine translation. The Journal of Machine Learning Research 22, 1 (2021), 4839–4886.Google Scholar
- [16] . 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073 (2016).Google Scholar
- [17] . 2010. Automatically learning source-side reordering rules for large scale machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics. 376–384.Google Scholar
Digital Library
- [18] . 2022. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics 10 (2022), 522–538.Google Scholar
Cross Ref
- [19] . 2020. Efficient neural machine translation for low-resource languages via exploiting related languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 162–168.Google Scholar
- [20] . 2019. The iiit-h Gujarati-English machine translation system for wmt19. In Proceedings of the 4th Conference on Machine Translation. 191–195.Google Scholar
Cross Ref
- [21] . 2019. Two new evaluation datasets for low-resource machine translation: Nepali-English and Sinhala-English. arXiv preprint arXiv:1902.01382.Google Scholar
- [22] . 2022. Survey of low-resource machine translation. Computational Linguistics 48, 3 (2022), 673–732.Google Scholar
Cross Ref
- [23] . 2018. The University of Edinburgh’s submissions to the WMT18 news translation Task. In Proceedings of the 3rd Conference on Machine Translation. 399–409.Google Scholar
Cross Ref
- [24] . 2020. PMIndia - A collection of parallel corpora of Languages of India. CoRR (2020).Google Scholar
- [25] . 2000. The first decades of machine translation. Early Years in Machine Translation 1 (2000), 1–15.Google Scholar
- [26] . 2022. Natural language interface in Dogri to database. In Machine Intelligence and Data Science Applications. Springer, 843–852.Google Scholar
Cross Ref
- [27] . 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5 (2017), 339–351.Google Scholar
Cross Ref
- [28] . 2018. Marian: Fast neural machine translation in C++. arXiv preprint arXiv:1804.00344 (2018).Google Scholar
- [29] . 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 4948–4961.Google Scholar
- [30] . 2018. On the impact of various types of noise on neural machine translation. arXiv preprint arXiv:1805.12282 (2018).Google Scholar
- [31] . 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- [32] . 2009. Statistical Machine Translation. Cambridge University Press.Google Scholar
Digital Library
- [33] . 2010. Experiments on Indian language dependency parsing. In Proceedings of the NLP Tools Contest: Indian Language Dependency Parsing, 40–45.Google Scholar
- [34] . 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 (2018).Google Scholar
- [35] . 2020. The indicnlp library. Indian Language NLP Library.Google Scholar
- [36] . 2020. Utilizing language relatedness to improve machine translation: A case study on languages of the Indian subcontinent. arXiv preprint arXiv:2003.08925 (2020).Google Scholar
- [37] . 2017. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043 (2017).Google Scholar
- [38] . 2017. Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics 5 (2017), 365–378.Google Scholar
Cross Ref
- [39] . 2019. The Niutrans machine translation systems for wmt19. In Proceedings of the 4th Conference on Machine Translation. 257–266.Google Scholar
Cross Ref
- [40] . 2021. Learning language specific sub-network for multilingual machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 293–305.Google Scholar
Cross Ref
- [41] . 2019. Robust neural machine translation with joint textual and phonetic embedding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3044–3049.Google Scholar
Cross Ref
- [42] . 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8 (2020), 726–742.Google Scholar
Cross Ref
- [43] . 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign. 76–79.Google Scholar
- [44] . 2015. Augmenting Pivot based SMT with word segmentation. In Proceedings of the 12th International Conference on Natural Language Processing. 303–307.Google Scholar
- [45] . 2022. Overview of the 9th workshop on Asian translation. In Proceedings of the 9th Workshop on Asian Translation (WAT’22). Association for Computational Linguistics.Google Scholar
- [46] . 2021. Overview of the 8th workshop on Asian translation. In Proceedings of the 8th Workshop on Asian Translation (WAT’21). 1–45.Google Scholar
Cross Ref
- [47] . 2017. Transfer learning across low-resource, related languages for neural machine translation. In Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Asian Federation of Natural Language Processing, 296–301.Google Scholar
- [48] . 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 (2019).Google Scholar
- [49] . 2021. Contrastive learning for many-to-many multilingual neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. ACL, 244–258.Google Scholar
Cross Ref
- [50] . 2006. Evaluating language statistics: The ethnologue and beyond. Contract Report for UNESCO Institute for Statistics.Google Scholar
- [51] . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.Google Scholar
Digital Library
- [52] . 2018. Tilde’s parallel corpus filtering methods for WMT 2018. In Proceedings of the 3rd Conference on Machine Translation. 939–945.Google Scholar
Cross Ref
- [53] . 2020. A survey of Konkani NLP resources. Computer Science Review 38 (2020), 1–5.Google Scholar
- [54] . 2022. Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association for Computational Linguistics 10 (2022), 145–162.Google Scholar
Cross Ref
- [55] . 2014. An efficient machine translation system for English to Indian languages using hybrid mechanism. International Journal of Engineering and Technology 6, 4 (2014), 1909–1919.Google Scholar
- [56] . 2015. Study on similarity among Indian languages using language verification framework. Advances in Artificial Intelligence 2015 (2015), 2 pages.Google Scholar
Digital Library
- [57] . 2015. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015).Google Scholar
- [58] . 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).Google Scholar
- [59] . 2016. Formalizing Natural Languages: The NooJ Approach. John Wiley & Sons.Google Scholar
Cross Ref
- [60] . 2020. Corpus based machine translation system with deep neural network for Sanskrit to Hindi translation. Procedia Computer Science 167 (2020), 2534–2544.Google Scholar
Cross Ref
- [61] . 2020. A multilingual parallel corpora collection effort for Indian languages. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 3743–3751.Google Scholar
- [62] . 2018. Statistical vs. rule-based machine translation: A comparative study on indian languages. In International Conference on Intelligent Computing and Applications. Springer, 663–675.Google Scholar
Cross Ref
- [63] . 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (2014), 1–9.Google Scholar
- [64] . 2009. News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In Recent Advances in Natural Language Processing, Vol. 5. 237–248.Google Scholar
- [65] . 2012. Parallel data, tools and interfaces in OPUS. In Lrec, Vol. 2012. 2214–2218.Google Scholar
- [66] . 2020. OPUS-MT – Building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation. European Association for Machine Translation, 479–480.Google Scholar
- [67] . 2007. A comparison of pivot methods for phrase-based statistical machine translation. In The Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 484–491.Google Scholar
- [68] . 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 1–11.Google Scholar
- [69] . 2013. Morphology: Indian languages and European languages. International Journal of Scientific and Research Publications 3, 6 (2013), 1–5.Google Scholar
- [70] . 2020. Multi-task learning for multilingual neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 1022–1034.Google Scholar
Cross Ref
- [71] . 2022. A comparative study of CNN-and transformer-based visual style transfer. Journal of Computer Science and Technology 37, 3 (2022), 601–614.Google Scholar
Digital Library
- [72] . 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1568–1575.Google Scholar
Cross Ref
Index Terms
Improving Multilingual Neural Machine Translation System for Indic Languages
Recommendations
Low Resource Neural Machine Translation: Assamese to/from Other Indo-Aryan (Indic) Languages
Machine translation (MT) systems have been built using numerous different techniques for bridging the language barriers. These techniques are broadly categorized into approaches like Statistical Machine Translation (SMT) and Neural Machine Translation (...
Improving phrase-based statistical machine translation with morphosyntactic transformation
We present a phrase-based statistical machine translation approach which uses linguistic analysis in the preprocessing phase. The linguistic analysis includes morphological transformation and syntactic transformation. Since the word-order problem is ...
English to Hindi Machine Translation System in the Context of Homoeopathy Literature
Over the years, researches in machine translation MT systems have gain momentum due to their widespread applicability. A number of systems have come up doing the task successfully for different language pairs. However, to the best of the author's ...






Comments