skip to main content
research-article

Normalization of Transliterated Mongolian Words Using Seq2Seq Model with Limited Data

Authors Info & Claims
Published:01 September 2021Publication History
Skip Abstract Section

Abstract

The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. In this study, we propose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. We applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing out-of-vocabulary (OOV) words, and most of our models achieved higher normalization performance than the conventional method. When using test data during our text normalization experiment, our proposed method which included checking each hypothesis during the inference period achieved the lowest word error rate (WER = 13.41%), which was 4.51% fewer errors than when using the conventional SMT method.

References

  1. Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467. https://www.tensorflow.org/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. 33–40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google ScholarGoogle Scholar
  4. Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724–1734.Google ScholarGoogle ScholarCross RefCross Ref
  5. François Chollet et al. 2015. Keras: the python deep-learning API. http://keras.io/.Google ScholarGoogle Scholar
  6. Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp Koehn. 2014. Integrating an unsupervised transliteration model into statistical machine translation. In Proceedings of the 15th Conference of the European Chapter of the ACL (EACL’14). 148–153.Google ScholarGoogle ScholarCross RefCross Ref
  7. Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. arXiv:1503.03535.Google ScholarGoogle Scholar
  8. Gerelmaa Guruuchin. 2018. The usages and standard of latin script (in the FB frame). The Journal of Northern Cultures Studies. 9, (2018) 205–215 (in Mongolian).Google ScholarGoogle Scholar
  9. Liang Huang, Kai Zhao, and Mingbo Ma. 2017. When to finish? Optimal beam search for neural text generation (modulo beam size). In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2134–2139.Google ScholarGoogle ScholarCross RefCross Ref
  10. Taishi Ikeda, Hiroyuki Shindo, and Yuji Matsumoto. 2016. Japanese text normalization with encoder-decoder model. In Proceedings of the 2nd Workshop on Noisy User-generated Text. 129–137.Google ScholarGoogle Scholar
  11. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1700–1709.Google ScholarGoogle Scholar
  12. Harpreet Kaur and Er. Jasdeep Singh Mann. 2016. Text normalization using statistical machine approach. International Research Journal of Engineering and Technology (IRJET). 2230–2233.Google ScholarGoogle Scholar
  13. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 177–180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions. Soviet Physics Doklady. 10, 8 (1966), 707–710.Google ScholarGoogle Scholar
  15. Ismini Lourentzou, Kabir Manghnani, and ChengXiang Zhai. 2019. Adapting sequence to sequence models for text normalization in social media. In Proceedings of the Thirteenth International AAAI Conference on Web and Social Media. 335–345.Google ScholarGoogle Scholar
  16. Minh-Thang Luong, Pham Hieu, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1412–1421.Google ScholarGoogle Scholar
  17. Massimo Lusetti, Tatyana Ruzsics, Anne Göhring, Tanja Samardžić, and Elisabeth Stark. 2018. Encoder-decoder methods for text normalization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’18). 18–28.Google ScholarGoogle Scholar
  18. Manuel Mager, Monica Jasso Rosales, Ozlem Cetinoglu, and Ivan Meza. 2019. Low-resource neural character-based noisy text normalization. Journal of Intelligent & Fuzzy Systems. 4921–4929.Google ScholarGoogle ScholarCross RefCross Ref
  19. Soumil Mandal and Karthick Nanmaran. 2018. Normalization of transliterated words in code-mixed data using seq2seq model & Levenshtein distance. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text. 49–53.Google ScholarGoogle ScholarCross RefCross Ref
  20. Peter Norvig. 2016. How to write a spelling corrector (essay). http://norvig.com/.Google ScholarGoogle Scholar
  21. Tatyana Ruzsics and Tanja Samardzic. 2017. Neural sequence-to-sequence learning of internal word structure. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL’17). 184–194.Google ScholarGoogle ScholarCross RefCross Ref
  22. Mohammad Arshi Saloot, Norisma Idris, and AiTi Aw. 2014. Noisy text normalization using an enhanced language model. In Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition. 111–122.Google ScholarGoogle Scholar
  23. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS). 3104–3112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Osman Tursun and Ruket Cakici. 2017. Noisy Uyghur text normalization. In Proceedings of the 3rd Workshop on Noisy User-generated Text (W-NUT). 85–93.Google ScholarGoogle ScholarCross RefCross Ref
  25. Darnes Vilariño, David Pinto, Beatriz Beltran, Saul Leon, Esteban Castillo, and Mireya Tovar. 2012. A machine-translation method for normalization of SMS. In Proceedings of the 4th Mexican Conference on Pattern Recognition. 293–302. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Normalization of Transliterated Mongolian Words Using Seq2Seq Model with Limited Data

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Article Metrics

          • Downloads (Last 12 months)20
          • Downloads (Last 6 weeks)2

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!