skip to main content
research-article

Using Sub-character Level Information for Neural Machine Translation of Logographic Languages

Published:15 April 2021Publication History
Skip Abstract Section

Abstract

Logographic and alphabetic languages (e.g., Chinese vs. English) have different writing systems linguistically. Languages belonging to the same writing system usually exhibit more sharing information, which can be used to facilitate natural language processing tasks such as neural machine translation (NMT). This article takes advantage of the logographic characters in Chinese and Japanese by decomposing them into smaller units, thus more optimally utilizing the information these characters share in the training of NMT systems in both encoding and decoding processes. Experiments show that the proposed method can robustly improve the NMT performance of both “logographic” language pairs (JA–ZH) and “logographic + alphabetic” (JA–EN and ZH–EN) language pairs in both supervised and unsupervised NMT scenarios. Moreover, as the decomposed sequences are usually very long, extra position features for the transformer encoder can help with the modeling of these long sequences. The results also indicate that, theoretically, linguistic features can be manipulated to obtain higher share token rates and further improve the performance of natural language processing systems.

References

  1. Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machine translation. In Proceedings of the ICLR.Google ScholarGoogle Scholar
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR.Google ScholarGoogle Scholar
  3. Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2017. Investigating stroke-level information for learning Chinese word embeddings. In Proceedings of the ISWC.Google ScholarGoogle Scholar
  5. Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2018. cw2vec: Learning Chinese word embeddings with stroke n-gram information. In Proceedings of the AAAI’18, IAAI’18, EAAI’18. 5053–5061.Google ScholarGoogle Scholar
  6. Colin Cherry, George Foster, Ankur Bapna, Orhan Firat, and Wolfgang Macherey. 2018. Revisiting character-based neural machine translation with capacity and compression. In Proceedings of the EMNLP. 4295–4305.Google ScholarGoogle ScholarCross RefCross Ref
  7. Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the EMNLP, a meeting of SIGDAT, a Special Interest Group of the ACL. 1724–1734.Google ScholarGoogle ScholarCross RefCross Ref
  8. Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2012. Chinese characters mapping table of Japanese, traditional Chinese and simplified Chinese. In Proceedings of the LREC. European Language Resources Association (ELRA), 2149–2152.Google ScholarGoogle Scholar
  9. Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the ACL. 2978–2988.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. In Proceedings of the NAACL 2019. 4171--4186.Google ScholarGoogle Scholar
  11. Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. CoRR abs/1706.02677 (2017).Google ScholarGoogle Scholar
  12. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In Proceedings of the ICLR. OpenReview.net.Google ScholarGoogle Scholar
  14. Guillaume Klein, Yoon Kim, Yuntian Deng, Vincent Nguyen, Jean Senellart, and Alexander M. Rush. 2018. OpenNMT: Neural machine translation toolkit. In Proceedings of the AMTA. 177–184.Google ScholarGoogle Scholar
  15. Shaohui Kuang and Lifeng Han. 2018. Apply Chinese radicals into neural machine translation: Deeper than character level. CoRR abs/1805.01565 (2018).Google ScholarGoogle Scholar
  16. Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-based & neural unsupervised machine translation. In Proceedings of the EMNLP. 5039–5049.Google ScholarGoogle ScholarCross RefCross Ref
  17. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the EMNLP 2015. 1412--1421.Google ScholarGoogle Scholar
  18. Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi, and Hitoshi Isahara. 2016. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 2204--2208.Google ScholarGoogle Scholar
  19. Graham Neubig, Zi-Yi Dou, Junjie Hu, Paul Michel, Danish Pruthi, and Xinyi Wang. 2019. compare-mt: A tool for holistic comparison of language generation systems. In Proceedings of the NAACL-HLT. 35–41.Google ScholarGoogle ScholarCross RefCross Ref
  20. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the NAACL-HLT.Google ScholarGoogle ScholarCross RefCross Ref
  21. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019).Google ScholarGoogle Scholar
  22. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the ACL.Google ScholarGoogle ScholarCross RefCross Ref
  23. Mo Shen, Wingmui Li, HyunJeong Choe, Chenhui Chu, Daisuke Kawahara, and Sadao Kurohashi. 2016. Consistent word segmentation, part-of-speech tagging and dependency labelling annotation for Chinese language. In Proceedings of the COLING. ACL, 298–308.Google ScholarGoogle Scholar
  24. Xinlei Shi, Junjie Zhai, Xudong Yang, Zehua Xie, and Chao Liu. 2015. Radical embedding: Delving deeper to Chinese radicals. In Proceedings of the ACL IJCNLP. 594–598.Google ScholarGoogle ScholarCross RefCross Ref
  25. Dusan Stosic, Darko Stosic, Cleber Zanchettin, Teresa Bernarda Ludermir, and Borko D. Stosic. 2017. QRNN: q -generalized random neural network. IEEE Trans. Neural Netw. Learn. Syst. 28, 2 (2017), 383–390.Google ScholarGoogle ScholarCross RefCross Ref
  26. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the NIPS. 3104–3112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hanqing Tao, Shiwei Tong, Tong Xu, Qi Liu, and Enhong Chen. 2019. Chinese embedding via stroke and glyph information: A dual-channel view. CoRR abs/1906.04287 (2019).Google ScholarGoogle Scholar
  28. Yota Toyama, Makoto Miwa, and Yutaka Sasaki. 2017. Utilizing visual forms of Japanese characters for neural review classification. In Proceedings of the IJCNLP. 378–382.Google ScholarGoogle Scholar
  29. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NIPS. 6000–6010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).Google ScholarGoogle Scholar
  31. Jinyi Zhang and Tadahiro Matsumoto. 2018. Improving character-level Japanese-Chinese neural machine translation with radicals as an additional input feature. CoRR abs/1805.02937 (2018).Google ScholarGoogle Scholar
  32. Longtu Zhang and Mamoru Komachi. 2018. Neural machine translation of logographic language using sub-character level information. In Proceedings of the WMT. 17–25.Google ScholarGoogle ScholarCross RefCross Ref
  33. Hang Zhuang, Chao Wang, Changlong Li, Yijing Li, Qingfeng Wang, and Xuehai Zhou. 2018. Chinese language processing based on stroke representation and multidimensional representation. IEEE Access 6 (2018), 41928–41941.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Using Sub-character Level Information for Neural Machine Translation of Logographic Languages

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 2
        March 2021
        313 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3454116
        Issue’s Table of Contents

        Copyright © 2021 Copyright held by the owner/author(s).

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 April 2021
        • Accepted: 1 October 2020
        • Revised: 1 September 2020
        • Received: 1 November 2019
        Published in tallip Volume 20, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!