Abstract
Multilingual NMT has developed rapidly, but still has performance degradation caused by language diversity and model capacity constraints. To achieve the competitive accuracy of multilingual translation despite such limitations, knowledge distillation, which improves the student network by matching the teacher network’s output, has been applied and shown enhancement by focusing on the important parts of the teacher distribution. However, existing knowledge distillation methods for multilingual NMT rarely consider the knowledge, which has an important function as the student model’s target, in the process. In this article, we propose two distillation strategies that effectively use the knowledge to improve the accuracy of multilingual NMT. First, we introduce a language-family-based approach, guiding to select appropriate knowledge for each language pair. By distilling the knowledge of multilingual teachers that each processes a group of languages classified by language families, the multilingual model overcomes accuracy degradation caused by linguistic diversity. Second, we propose target-oriented knowledge distillation, which intensively focuses on the ground-truth target of knowledge with a penalty strategy. Our method provides a sensible distillation by penalizing samples without actual targets, while additionally targeting the ground-truth targets. Experiments using TED Talk datasets demonstrate the effectiveness of our method with BLEU scores increment. Discussions of distilled knowledge and further observations of the methods also validate our results.
- [1] . 1999. What’s so Chinese about Vietnamese. In Papers from the 9th Annual Meeting of the Southeast Asian Linguistics Society, (Ed.). Citeseer, 221–242. Google Scholar
Cross Ref
- [2] . 2014. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems, , , , , and (Eds.), Vol. 27. Curran Associates, Inc., 2654–-2662. https://proceedings.neurips.cc/paper/2014/file/ea8fcd92d59581717e06eb187f10666d-Paper.pdf.Google Scholar
- [3] . 2006. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 535–541. Google Scholar
Digital Library
- [4] . 2020. A survey of multilingual neural machine translation. ACM Comput. Surv. 53, 5, Article
99 (Sept. 2020), 38 pages. Google ScholarDigital Library
- [5] . 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China, 1723–1732. https://aclanthology.org/P15-1166.pdf.Google Scholar
- [6] . 2018. The Multitarget TED Talks Task. http://www.cs.jhu.edu/kevinduh/a/multitarget-tedtalks/.Google Scholar
- [7] . 2020. Ethnologue: Languages of the World, 23rd edition. SIL International, Dallas, Texas (2020).Google Scholar
- [8] . 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, San Diego, California, 866–875. Google Scholar
Cross Ref
- [9] . 2018. Born again neural networks. In International Conference on Machine Learning (Proceedings of Machine Learning Research), and (Eds.). (PMLR), 1607–1616. http://proceedings.mlr.press/v80/furlanello18a.html.Google Scholar
- [10] . 2019. Explaining sequence-level knowledge distillation as data-augmentation for neural machine translation. arXiv preprint arXiv:1912.03334 (2019).Google Scholar
- [11] . 2016. Toward multilingual neural machine translation with universal encoder and decoder. arXiv preprint arXiv:1611.04798 (2016). https://arxiv.org/abs/1611.04798Google Scholar
- [12] . 2015. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop. http://arxiv.org/abs/1503.02531Google Scholar
- [13] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5 (2017), 339–351.Google Scholar
- [14] . 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 1317–1327. Google Scholar
Cross Ref
- [15] . 2015. c. In Proceedings of the 3rd International Conference for Learning Representations. Ithaca, NY: arXiv.org, San Diego.Google Scholar
- [16] . 2017. Learning from noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer Vision. 1910–1918. Google Scholar
Cross Ref
- [17] . 2018. A neural interlingua for multilingual machine translation. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. Association for Computational Linguistics, Brussels, Belgium, 84–92. Google Scholar
Cross Ref
- [18] . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318. Google Scholar
Digital Library
- [19] . 2005. Is Japanese Related to Korean, Tungusic, Mongolic and Turkic? Vol. 64. Otto Harrassowitz Verlag. https://www.researchgate.net/publication/309762917_Robbeets_Martine_2005_Is_Japanese_related_to_Korean_Tungusic_Mongolic_and_Turkic_Turcologica_64_Wiesbaden_Harrassowitz.Google Scholar
- [20] . 2018. Parameter sharing methods for multilingual self-attentional translation models. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. Association for Computational Linguistics, Brussels, Belgium, 261–271. Google Scholar
Cross Ref
- [21] . 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715–1725. Google Scholar
Cross Ref
- [22] . 2020. Knowledge distillation for multilingual unsupervised neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 3525–3535. Google Scholar
Cross Ref
- [23] . 2019. Multilingual neural machine translation with language clustering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 963–973. Google Scholar
Cross Ref
- [24] . 2019. Multilingual neural machine translation with knowledge distillation. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=S1gUsoR9YX.Google Scholar
- [25] . 1965. A Vietnamese Grammar. University of Washington Press, Seattle.Google Scholar
- [26] . 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., Red Hook, NY, 6000–-6010. Google Scholar
Digital Library
- [27] . 2021. Selective knowledge distillation for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6456–6466. Google Scholar
Cross Ref
- [28] . 2018. Knowledge distillation in generations: More tolerant teachers educate better students. (2018). arXiv preprint arXiv:1805.05551.Google Scholar
- [29] . 2018. When and why are pre-trained word embeddings useful for neural machine translation. In HLT-NAACL.Google Scholar
Index Terms
Target-Oriented Knowledge Distillation with Language-Family-Based Grouping for Multilingual NMT
Recommendations
Language-related issues for NMT and PBMT for English---German and English---Serbian
This work presents an extensive comparison of language-related problems for neural machine translation (NMT) and phrase-based machine translation (PBMT) for German-to-English, English-to-German and English-to-Serbian. The explored issues are related ...
Improving Low-Resource NMT with Parser Generated Syntactic Phrases
Computational Linguistics and Intelligent Text ProcessingAbstractRecently, neural machine translation (NMT) has become highly successful achieving state-of-the-art results on many resource-rich language pairs. However, it fails when there is a lack of sufficiently large amount of parallel corpora for a domain ...
Reinforced NMT for Sentiment and Content Preservation in Low-resource Scenario
The preservation of domain knowledge from source to the target is crucial in any translation workflows. Hence, translation service providers that use machine translation (MT) in production could reasonably expect that the translation process should ...






Comments