skip to main content
research-article

Target-Oriented Knowledge Distillation with Language-Family-Based Grouping for Multilingual NMT

Published:23 March 2023Publication History
Skip Abstract Section

Abstract

Multilingual NMT has developed rapidly, but still has performance degradation caused by language diversity and model capacity constraints. To achieve the competitive accuracy of multilingual translation despite such limitations, knowledge distillation, which improves the student network by matching the teacher network’s output, has been applied and shown enhancement by focusing on the important parts of the teacher distribution. However, existing knowledge distillation methods for multilingual NMT rarely consider the knowledge, which has an important function as the student model’s target, in the process. In this article, we propose two distillation strategies that effectively use the knowledge to improve the accuracy of multilingual NMT. First, we introduce a language-family-based approach, guiding to select appropriate knowledge for each language pair. By distilling the knowledge of multilingual teachers that each processes a group of languages classified by language families, the multilingual model overcomes accuracy degradation caused by linguistic diversity. Second, we propose target-oriented knowledge distillation, which intensively focuses on the ground-truth target of knowledge with a penalty strategy. Our method provides a sensible distillation by penalizing samples without actual targets, while additionally targeting the ground-truth targets. Experiments using TED Talk datasets demonstrate the effectiveness of our method with BLEU scores increment. Discussions of distilled knowledge and further observations of the methods also validate our results.

REFERENCES

  1. [1] Alves Mark J.. 1999. What’s so Chinese about Vietnamese. In Papers from the 9th Annual Meeting of the Southeast Asian Linguistics Society, Thurgood. Graham W. (Ed.). Citeseer, 221242. Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Ba Lei Jimmy and Caruana Rich. 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems, Ghahramani Z., Welling M., Cortes C., Lawrence N., and Weinberger K. Q. (Eds.), Vol. 27. Curran Associates, Inc., 2654–-2662. https://proceedings.neurips.cc/paper/2014/file/ea8fcd92d59581717e06eb187f10666d-Paper.pdf.Google ScholarGoogle Scholar
  3. [3] Buciluǎ Cristian, Caruana Rich, and Niculescu-Mizil Alexandru. 2006. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 535541. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Dabre Raj, Chu Chenhui, and Kunchukuttan Anoop. 2020. A survey of multilingual neural machine translation. ACM Comput. Surv. 53, 5, Article 99 (Sept.2020), 38 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Dong Daxiang, Wu Hua, He Wei, Yu Dianhai, and Wang Haifeng. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China, 17231732. https://aclanthology.org/P15-1166.pdf.Google ScholarGoogle Scholar
  6. [6] Duh Kevin. 2018. The Multitarget TED Talks Task. http://www.cs.jhu.edu/kevinduh/a/multitarget-tedtalks/.Google ScholarGoogle Scholar
  7. [7] Eberhard David M., Simons Gary F., and Fennig Charles D.. 2020. Ethnologue: Languages of the World, 23rd edition. SIL International, Dallas, Texas (2020).Google ScholarGoogle Scholar
  8. [8] Firat Orhan, Cho Kyunghyun, and Bengio Yoshua. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 866875. Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Furlanello Tommaso, Lipton Zachary, Tschannen Michael, Itti Laurent, and Anandkumar Anima. 2018. Born again neural networks. In International Conference on Machine Learning (Proceedings of Machine Learning Research), Dy Jennifer and Krause Andreas (Eds.). (PMLR), 16071616. http://proceedings.mlr.press/v80/furlanello18a.html.Google ScholarGoogle Scholar
  10. [10] Gordon Mitchell A. and Duh Kevin. 2019. Explaining sequence-level knowledge distillation as data-augmentation for neural machine translation. arXiv preprint arXiv:1912.03334 (2019).Google ScholarGoogle Scholar
  11. [11] Ha Thanh-Le, Niehues Jan, and Waibel Alexander. 2016. Toward multilingual neural machine translation with universal encoder and decoder. arXiv preprint arXiv:1611.04798 (2016). https://arxiv.org/abs/1611.04798Google ScholarGoogle Scholar
  12. [12] Hinton Geoffrey, Vinyals Oriol, and Dean Jeffrey. 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop. http://arxiv.org/abs/1503.02531Google ScholarGoogle Scholar
  13. [13] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5 (2017), 339–351.Google ScholarGoogle Scholar
  14. [14] Kim Yoon and Rush Alexander M.. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 13171327. Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Kingma Diederik P. and Ba Jimmy. 2015. c. In Proceedings of the 3rd International Conference for Learning Representations. Ithaca, NY: arXiv.org, San Diego.Google ScholarGoogle Scholar
  16. [16] Li Yuncheng, Yang Jianchao, Song Yale, Cao Liangliang, Luo Jiebo, and Li Li-Jia. 2017. Learning from noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer Vision. 19101918. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Lu Yichao, Keung Phillip, Ladhak Faisal, Bhardwaj Vikas, Zhang Shaonan, and Sun Jason. 2018. A neural interlingua for multilingual machine translation. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. Association for Computational Linguistics, Brussels, Belgium, 8492. Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Robbeets Martine Irma. 2005. Is Japanese Related to Korean, Tungusic, Mongolic and Turkic? Vol. 64. Otto Harrassowitz Verlag. https://www.researchgate.net/publication/309762917_Robbeets_Martine_2005_Is_Japanese_related_to_Korean_Tungusic_Mongolic_and_Turkic_Turcologica_64_Wiesbaden_Harrassowitz.Google ScholarGoogle Scholar
  20. [20] Sachan Devendra Singh and Neubig Graham. 2018. Parameter sharing methods for multilingual self-attentional translation models. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. Association for Computational Linguistics, Brussels, Belgium, 261271. Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 17151725. Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Sun Haipeng, Wang Rui, Chen Kehai, Utiyama Masao, Sumita Eiichiro, and Zhao Tiejun. 2020. Knowledge distillation for multilingual unsupervised neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 35253535. Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Tan Xu, Chen Jiale, He Di, Xia Yingce, Qin Tao, and Liu Tie-Yan. 2019. Multilingual neural machine translation with language clustering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 963973. Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Tan Xu, Ren Yi, He Di, Qin Tao, and Liu Tie-Yan. 2019. Multilingual neural machine translation with knowledge distillation. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=S1gUsoR9YX.Google ScholarGoogle Scholar
  25. [25] Thompson Laurence C.. 1965. A Vietnamese Grammar. University of Washington Press, Seattle.Google ScholarGoogle Scholar
  26. [26] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser undefinedukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, 6000–-6010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Wang Fusheng, Yan Jianhao, Meng Fandong, and Zhou Jie. 2021. Selective knowledge distillation for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 64566466. Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Yang Chenglin and Qiao Alan Yuille Lingxi Xie, Siyuan. 2018. Knowledge distillation in generations: More tolerant teachers educate better students. (2018). arXiv preprint arXiv:1805.05551.Google ScholarGoogle Scholar
  29. [29] Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation. In HLT-NAACL.Google ScholarGoogle Scholar

Index Terms

  1. Target-Oriented Knowledge Distillation with Language-Family-Based Grouping for Multilingual NMT

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 2
      February 2023
      624 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3572719
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 March 2023
      • Online AM: 30 June 2022
      • Accepted: 24 June 2022
      • Revised: 20 June 2022
      • Received: 30 August 2021
      Published in tallip Volume 22, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)227
      • Downloads (Last 6 weeks)15

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!