skip to main content
research-article

Improving Multilingual Neural Machine Translation System for Indic Languages

Published:16 June 2023Publication History
Skip Abstract Section

Abstract

The Machine Translation System (MTS) serves as effective tool for communication by translating text or speech from one language to another language. Recently, neural machine translation (NMT) has become popular for its performance and cost-effectiveness. However, NMT systems are restricted in translating low-resource languages as a huge quantity of data is required to learn useful mappings across languages. The need for an efficient translation system becomes obvious in a large multilingual environment like India. Indian languages (ILs) are still treated as low-resource languages due to unavailability of corpora. In order to address such an asymmetric nature, the multilingual neural machine translation (MNMT) system evolves as an ideal approach in this direction. The MNMT converts many languages using a single model, which is extremely useful in terms of training process and lowering online maintenance costs. It is also helpful for improving low-resource translation. In this article, we propose an MNMT system to address the issues related to low-resource language translation. Our model comprises two MNMT systems, i.e., for English-Indic (one-to-many) and for Indic-English (many-to-one) with a shared encoder-decoder containing 15 language pairs (30 translation directions). Since most of IL pairs have a scanty amount of parallel corpora, not sufficient for training any machine translation model, we explore various augmentation strategies to improve overall translation quality through the proposed model. A state-of-the-art transformer architecture is used to realize the proposed model. In addition, the article addresses the use of language relationships (in terms of dialect, script, etc.), particularly about the role of high-resource languages of the same family in boosting the performance of low-resource languages. Moreover, the experimental results also show the advantage of back-translation and domain adaptation for ILs to enhance the translation quality of both source and target languages. Using all these key approaches, our proposed model emerges to be more efficient than the baseline model in terms of evaluation metrics, i.e., BLEU (BiLingual Evaluation Understudy) score for a set of ILs.

REFERENCES

  1. [1] Aggarwal Divyanshu, Gupta Vivek, and Kunchukuttan Anoop. 2022. IndicXNLI: Evaluating multilingual inference for Indian languages. arXiv preprint arXiv:2204.08776 (2022).Google ScholarGoogle Scholar
  2. [2] Johnson Melvin Aharoni, Roee, and Firat Orhan. 2019. Massively multilingual neural machine translation. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. ACL, 38743884.Google ScholarGoogle Scholar
  3. [3] Amini Hessam, Farahnak Farhood, and Kosseim Leila. 2019. Natural language processing: An overview. Frontiers in Pattern Recognition and Artificial Intelligence 1 (2019), 3555.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Arivazhagan Naveen, Bapna Ankur, Firat Orhan, Aharoni Roee, Johnson Melvin, and Macherey Wolfgang. 2019. The missing ingredient in zero-shot neural machine translation. arXiv preprint arXiv:1903.07091 (2019).Google ScholarGoogle Scholar
  5. [5] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  6. [6] Belinkov Yonatan and Bisk Yonatan. 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173 (2017).Google ScholarGoogle Scholar
  7. [7] Cho Kyunghyun, Merriënboer Bart Van, Bahdanau Dzmitry, and Bengio Yoshua. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).Google ScholarGoogle Scholar
  8. [8] Dabre Raj, Chu Chenhui, Cromieres Fabien, Nakazawa Toshiaki, and Kurohashi Sadao. 2015. Large-scale dictionary construction via pivot-based statistical machine translation with significance pruning and neural network features. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 289297.Google ScholarGoogle Scholar
  9. [9] Dabre Raj, Chu Chenhui, and Kunchukuttan Anoop. 2020. A survey of multilingual neural machine translation. ACM Computing Surveys (CSUR) 53, 5 (2020), 138.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Das Sudhansu Bala, Biradar Atharv, Mishra Tapas Kumar, and Patra Bidyut Kumar. 2022. NIT Rourkela machine translation (MT) system submission to WAT 2022 for MultiIndicMT: An Indic language multilingual shared task. In Proceedings of the 9th Workshop on Asian Translation. 7377.Google ScholarGoogle Scholar
  11. [11] Dash Niladri Sekhar. 2009. Translation Corpora and Machine Aided Translation. Translation Today 6, 1 (2009), 134–153.Google ScholarGoogle Scholar
  12. [12] Dash Niladri Sekhar. 2008. Corpus linguistics: An empirical approach for studying a natural language. In Language Forum, Vol. 34. Bahri Publications, 526.Google ScholarGoogle Scholar
  13. [13] Domingo Miguel, Garcıa-Martınez Mercedes, Helle Alexandre, Casacuberta Francisco, and Herranz Manuel. 2018. How much does tokenization affect neural machine translation? arXiv preprint arXiv:1812.08621 (2018).Google ScholarGoogle Scholar
  14. [14] Dong Daxiang, Wu Hua, He Wei, Yu Dianhai, and Wang Haifeng. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 17231732.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Fan Angela, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptichinsky, Sergery Eduniv, Edouard Grave, Michael Auli, and Armand Joulin. 2021. Beyond English-centri multilingual machine translation. The Journal of Machine Learning Research 22, 1 (2021), 4839–4886.Google ScholarGoogle Scholar
  16. [16] Firat Orhan, Cho Kyunghyun, and Bengio Yoshua. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073 (2016).Google ScholarGoogle Scholar
  17. [17] Genzel Dmitriy. 2010. Automatically learning source-side reordering rules for large scale machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics. 376384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Goyal Naman, Gao Cynthia, Chaudhary Vishrav, Chen Peng-Jen, Wenzek Guillaume, Ju Da, Krishnan Sanjana, Ranzato Marc’Aurelio, Guzman Francisco, and Fan Angela. 2022. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics 10 (2022), 522538.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Goyal Vikrant, Kumar Sourav, and Sharma Dipti Misra. 2020. Efficient neural machine translation for low-resource languages via exploiting related languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 162168.Google ScholarGoogle Scholar
  20. [20] Goyal Vikrant and Sharma Dipti Misra. 2019. The iiit-h Gujarati-English machine translation system for wmt19. In Proceedings of the 4th Conference on Machine Translation. 191195.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Guzmán Francisco, Chen Peng-Jen, Ott Myle, Pino Juan, Lample Guillaume, Koehn Philipp, Chaudhary Vishrav, and Ranzato Marc’Aurelio. 2019. Two new evaluation datasets for low-resource machine translation: Nepali-English and Sinhala-English. arXiv preprint arXiv:1902.01382.Google ScholarGoogle Scholar
  22. [22] Haddow Barry, Bawden Rachel, Barone Antonio Valerio Miceli, Helcl Jindřich, and Birch Alexandra. 2022. Survey of low-resource machine translation. Computational Linguistics 48, 3 (2022), 673732.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Haddow Barry, Bogoychev Nikolay, Emelin Denis, Germann Ulrich, Grundkiewicz Roman, Heafield Kenneth, Miceli-Barone Antonio Valerio, and Sennrich Rico. 2018. The University of Edinburgh’s submissions to the WMT18 news translation Task. In Proceedings of the 3rd Conference on Machine Translation. 399409.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Haddow Barry and Kirefu Faheem. 2020. PMIndia - A collection of parallel corpora of Languages of India. CoRR (2020).Google ScholarGoogle Scholar
  25. [25] Hutchins W.. 2000. The first decades of machine translation. Early Years in Machine Translation 1 (2000), 115.Google ScholarGoogle Scholar
  26. [26] Jamwal Shubhnandan S. and Sen Vijay Singh. 2022. Natural language interface in Dogri to database. In Machine Intelligence and Data Science Applications. Springer, 843852.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Johnson Melvin, Schuster Mike, Le Quoc V., Krikun Maxim, Wu Yonghui, Chen Zhifeng, Thorat Nikhil, Viégas Fernanda, Wattenberg Martin, Corrado Greg, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5 (2017), 339351.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Junczys-Dowmunt Marcin, Grundkiewicz Roman, Dwojak Tomasz, Hoang Hieu, Heafield Kenneth, Neckermann Tom, Seide Frank, Germann Ulrich, Aji Alham Fikri, Bogoychev Nikolay, et al. 2018. Marian: Fast neural machine translation in C++. arXiv preprint arXiv:1804.00344 (2018).Google ScholarGoogle Scholar
  29. [29] Kakwani Divyanshu, Kunchukuttan Anoop, Golla Satish, Gokul N. C., Bhattacharyya Avik, Khapra Mitesh M., and Kumar Pratyush. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 49484961.Google ScholarGoogle Scholar
  30. [30] Khayrallah Huda and Koehn Philipp. 2018. On the impact of various types of noise on neural machine translation. arXiv preprint arXiv:1805.12282 (2018).Google ScholarGoogle Scholar
  31. [31] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  32. [32] Koehn Philipp. 2009. Statistical Machine Translation. Cambridge University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Kosaraju Prudhvi, Kesidi Sruthilaya Reddy, Ainavolu Vinay Bhargav Reddy, and Kukkadapu Puneeth. 2010. Experiments on Indian language dependency parsing. In Proceedings of the NLP Tools Contest: Indian Language Dependency Parsing, 4045.Google ScholarGoogle Scholar
  34. [34] Kudo Taku and Richardson John. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 (2018).Google ScholarGoogle Scholar
  35. [35] Kunchukuttan Anoop. 2020. The indicnlp library. Indian Language NLP Library.Google ScholarGoogle Scholar
  36. [36] Kunchukuttan Anoop and Bhattacharyya Pushpak. 2020. Utilizing language relatedness to improve machine translation: A case study on languages of the Indian subcontinent. arXiv preprint arXiv:2003.08925 (2020).Google ScholarGoogle Scholar
  37. [37] Lample Guillaume, Conneau Alexis, Denoyer Ludovic, and Ranzato Marc Aurelio. 2017. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043 (2017).Google ScholarGoogle Scholar
  38. [38] Lee Jason, Cho Kyunghyun, and Hofmann Thomas. 2017. Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics 5 (2017), 365378.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Li Bei, Li Yinqiao, et al. 2019. The Niutrans machine translation systems for wmt19. In Proceedings of the 4th Conference on Machine Translation. 257266.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Lin Zehui, Wu Liwei, Wang Mingxuan, and Li Lei. 2021. Learning language specific sub-network for multilingual machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 293305.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Liu Hairong, Ma Mingbo, Huang Liang, Xiong Hao, and He Zhongjun. 2019. Robust neural machine translation with joint textual and phonetic embedding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 30443049.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Liu Yinhan, Gu Jiatao, Goyal Naman, Li Xian, Edunov Sergey, Ghazvininejad Marjan, Lewis Mike, and Zettlemoyer Luke. 2020. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8 (2020), 726742.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Luong Minh-Thang and Manning Christopher D.. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign. 7679.Google ScholarGoogle Scholar
  44. [44] More Rohit, Kunchukuttan Anoop, Bhattacharyya Pushpak, and Dabre Raj. 2015. Augmenting Pivot based SMT with word segmentation. In Proceedings of the 12th International Conference on Natural Language Processing. 303307.Google ScholarGoogle Scholar
  45. [45] Nakazawa Toshiaki, Mino Hideya, Goto Isao, Dabre Raj, Higashiyama Shohei, Parida Shantipriya, Kunchukuttan Anoop, Morishita Makoto, Bojar Ondřej, Chu Chenhui, Eriguchi Akiko, Abe Kaori, Oda Yusuke, and Kurohashi Sadao. 2022. Overview of the 9th workshop on Asian translation. In Proceedings of the 9th Workshop on Asian Translation (WAT’22). Association for Computational Linguistics.Google ScholarGoogle Scholar
  46. [46] Nakazawa Toshiaki, Nakayama Hideki, Ding Chenchen, Dabre Raj, Higashiyama Shohei, Mino Hideya, Goto Isao, Pa Win Pa, Kunchukuttan Anoop, Parida Shantipriya, et al. 2021. Overview of the 8th workshop on Asian translation. In Proceedings of the 8th Workshop on Asian Translation (WAT’21). 145.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Nguyen Toan Q. and Chiang David. 2017. Transfer learning across low-resource, related languages for neural machine translation. In Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Asian Federation of Natural Language Processing, 296301.Google ScholarGoogle Scholar
  48. [48] Ott Myle, Edunov Sergey, Baevski Alexei, Fan Angela, Gross Sam, Ng Nathan, Grangier David, and Auli Michael. 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 (2019).Google ScholarGoogle Scholar
  49. [49] Pan Xiao, Wang Mingxuan, Wu Liwei, and Li Lei. 2021. Contrastive learning for many-to-many multilingual neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. ACL, 244258.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Paolillo John C. and Das Anupam. 2006. Evaluating language statistics: The ethnologue and beyond. Contract Report for UNESCO Institute for Statistics.Google ScholarGoogle Scholar
  51. [51] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Pinnis Mārcis. 2018. Tilde’s parallel corpus filtering methods for WMT 2018. In Proceedings of the 3rd Conference on Machine Translation. 939945.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Rajan Annie, Salgaonkar Ambuja, and Joshi Ramprasad. 2020. A survey of Konkani NLP resources. Computer Science Review 38 (2020), 15.Google ScholarGoogle Scholar
  54. [54] Ramesh Gowtham, Doddapaneni Sumanth, Bheemaraj Aravinth, et al. 2022. Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association for Computational Linguistics 10 (2022), 145162.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Sangeetha J., Jothilakshmi S., and Kumar Devendra. 2014. An efficient machine translation system for English to Indian languages using hybrid mechanism. International Journal of Engineering and Technology 6, 4 (2014), 19091919.Google ScholarGoogle Scholar
  56. [56] Sengupta Debapriya and Saha Goutam. 2015. Study on similarity among Indian languages using language verification framework. Advances in Artificial Intelligence 2015 (2015), 2 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2015. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015).Google ScholarGoogle Scholar
  58. [58] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).Google ScholarGoogle Scholar
  59. [59] Silberztein Max. 2016. Formalizing Natural Languages: The NooJ Approach. John Wiley & Sons.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Singh Muskaan, Kumar Ravinder, and Chana Inderveer. 2020. Corpus based machine translation system with deep neural network for Sanskrit to Hindi translation. Procedia Computer Science 167 (2020), 25342544.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Siripragada Shashank, Philip Jerin, Namboodiri Vinay P., and Jawahar C. V.. 2020. A multilingual parallel corpora collection effort for Indian languages. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 37433751.Google ScholarGoogle Scholar
  62. [62] Sreelekha S., Bhattacharyya Pushpak, and Malathi D.. 2018. Statistical vs. rule-based machine translation: A comparative study on indian languages. In International Conference on Intelligent Computing and Applications. Springer, 663675.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (2014), 1–9.Google ScholarGoogle Scholar
  64. [64] Tiedemann J.. 2009. News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In Recent Advances in Natural Language Processing, Vol. 5. 237248.Google ScholarGoogle Scholar
  65. [65] Tiedemann J.. 2012. Parallel data, tools and interfaces in OPUS. In Lrec, Vol. 2012. 22142218.Google ScholarGoogle Scholar
  66. [66] Tiedemann Jörg and Thottingal Santhosh. 2020. OPUS-MT – Building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation. European Association for Machine Translation, 479480.Google ScholarGoogle Scholar
  67. [67] Utiyama Masao and Isahara Hitoshi. 2007. A comparison of pivot methods for phrase-based statistical machine translation. In The Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 484491.Google ScholarGoogle Scholar
  68. [68] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017), 1–11.Google ScholarGoogle Scholar
  69. [69] Vikram Shweta. 2013. Morphology: Indian languages and European languages. International Journal of Scientific and Research Publications 3, 6 (2013), 15.Google ScholarGoogle Scholar
  70. [70] Wang Yiren, Zhai ChengXiang, and Hassan Hany. 2020. Multi-task learning for multilingual neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 10221034.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Wei Hua-Peng, Deng Ying-Ying, Tang Fan, Pan Xing-Jia, and Dong Wei-Ming. 2022. A comparative study of CNN-and transformer-based visual style transfer. Journal of Computer Science and Technology 37, 3 (2022), 601614.Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. [72] Zoph Barret, Yuret Deniz, May Jonathan, and Knight Kevin. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 15681575.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Improving Multilingual Neural Machine Translation System for Indic Languages

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
      June 2023
      635 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3604597
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 June 2023
      • Online AM: 14 March 2023
      • Accepted: 2 March 2023
      • Received: 14 November 2022
      Published in tallip Volume 22, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)275
      • Downloads (Last 6 weeks)122

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!