Abstract
The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. In this study, we propose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. We applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing out-of-vocabulary (OOV) words, and most of our models achieved higher normalization performance than the conventional method. When using test data during our text normalization experiment, our proposed method which included checking each hypothesis during the inference period achieved the lowest word error rate (WER = 13.41%), which was 4.51% fewer errors than when using the conventional SMT method.
- Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467. https://www.tensorflow.org/. Google Scholar
Digital Library
- AiTi Aw, Min Zhang, Juan Xiao, and Jian Su. 2006. A phrase-based statistical model for SMS text normalization. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. 33–40. Google Scholar
Digital Library
- Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.Google Scholar
- Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724–1734.Google Scholar
Cross Ref
- François Chollet et al. 2015. Keras: the python deep-learning API. http://keras.io/.Google Scholar
- Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp Koehn. 2014. Integrating an unsupervised transliteration model into statistical machine translation. In Proceedings of the 15th Conference of the European Chapter of the ACL (EACL’14). 148–153.Google Scholar
Cross Ref
- Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. arXiv:1503.03535.Google Scholar
- Gerelmaa Guruuchin. 2018. The usages and standard of latin script (in the FB frame). The Journal of Northern Cultures Studies. 9, (2018) 205–215 (in Mongolian).Google Scholar
- Liang Huang, Kai Zhao, and Mingbo Ma. 2017. When to finish? Optimal beam search for neural text generation (modulo beam size). In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2134–2139.Google Scholar
Cross Ref
- Taishi Ikeda, Hiroyuki Shindo, and Yuji Matsumoto. 2016. Japanese text normalization with encoder-decoder model. In Proceedings of the 2nd Workshop on Noisy User-generated Text. 129–137.Google Scholar
- Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1700–1709.Google Scholar
- Harpreet Kaur and Er. Jasdeep Singh Mann. 2016. Text normalization using statistical machine approach. International Research Journal of Engineering and Technology (IRJET). 2230–2233.Google Scholar
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. 177–180. Google Scholar
Digital Library
- Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions. Soviet Physics Doklady. 10, 8 (1966), 707–710.Google Scholar
- Ismini Lourentzou, Kabir Manghnani, and ChengXiang Zhai. 2019. Adapting sequence to sequence models for text normalization in social media. In Proceedings of the Thirteenth International AAAI Conference on Web and Social Media. 335–345.Google Scholar
- Minh-Thang Luong, Pham Hieu, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1412–1421.Google Scholar
- Massimo Lusetti, Tatyana Ruzsics, Anne Göhring, Tanja Samardžić, and Elisabeth Stark. 2018. Encoder-decoder methods for text normalization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’18). 18–28.Google Scholar
- Manuel Mager, Monica Jasso Rosales, Ozlem Cetinoglu, and Ivan Meza. 2019. Low-resource neural character-based noisy text normalization. Journal of Intelligent & Fuzzy Systems. 4921–4929.Google Scholar
Cross Ref
- Soumil Mandal and Karthick Nanmaran. 2018. Normalization of transliterated words in code-mixed data using seq2seq model & Levenshtein distance. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text. 49–53.Google Scholar
Cross Ref
- Peter Norvig. 2016. How to write a spelling corrector (essay). http://norvig.com/.Google Scholar
- Tatyana Ruzsics and Tanja Samardzic. 2017. Neural sequence-to-sequence learning of internal word structure. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL’17). 184–194.Google Scholar
Cross Ref
- Mohammad Arshi Saloot, Norisma Idris, and AiTi Aw. 2014. Noisy text normalization using an enhanced language model. In Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition. 111–122.Google Scholar
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of Conference on Advances in Neural Information Processing Systems (NIPS). 3104–3112. Google Scholar
Digital Library
- Osman Tursun and Ruket Cakici. 2017. Noisy Uyghur text normalization. In Proceedings of the 3rd Workshop on Noisy User-generated Text (W-NUT). 85–93.Google Scholar
Cross Ref
- Darnes Vilariño, David Pinto, Beatriz Beltran, Saul Leon, Esteban Castillo, and Mireya Tovar. 2012. A machine-translation method for normalization of SMS. In Proceedings of the 4th Mexican Conference on Pattern Recognition. 293–302. Google Scholar
Digital Library
Index Terms
Normalization of Transliterated Mongolian Words Using Seq2Seq Model with Limited Data
Recommendations
Multi class-based n-gram language model for new words using web data
ROCOM'11/MUSP'11: Proceedings of the 11th WSEAS international conference on robotics, control and manufacturing technology, and 11th WSEAS international conference on Multimedia systems & signal processingOut-of-vocabulary (OOV) words cause a serious problem for automatic speech recognition (ASR) system. Not only it will be miss-recognized as an in-vocabulary word with similar phonetics, but the error will also affect nearby words to make errors. ...
An architecture for Malay Tweet normalization
Research in natural language processing has increasingly focused on normalizing Twitter messages. Currently, while different well-defined approaches have been proposed for the English language, the problem remains far from being solved for other ...
A language model based on semantically clustered words in a Chinese character recognition system
ICDAR '95: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1This paper presents a new method for clustering the words in a dictionary into word groups, which are applied in a Chinese character recognition system with a language model to describe the contextual information. The Chinese synonym dictionary ...






Comments