Abstract
This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-tagging information, as one component of the Asian Language Treebank Project. The annotated corpus has been released under a CC BY-NC-SA license, and it is the largest open-access database of annotated Burmese when this manuscript was prepared in 2017. Detailed descriptions of the preparation, refinement, and features of the annotated corpus are provided in the first half of the article. Facilitated by the annotated corpus, experiment-based investigations are presented in the second half of the article, wherein the standard sequence-labeling approach of conditional random fields and a long short-term memory (LSTM)-based recurrent neural network (RNN) are applied and discussed. We obtained several general conclusions, covering the effect of joint tokenization and POS-tagging and importance of ensemble from the viewpoint of stabilizing the performance of LSTM-based RNN. This study provides a solid basis for further studies on Burmese processing.
- Aye Myat Mon, Soe Lai Phyue, Myint Myint Thein, Su Su Htay, and Thinn Thinn Win. 2010. Analysis of Myanmar word boundary and segmentation by using statistical approach. In Proceedings of the ICACTE. 233--237.Google Scholar
- Vincent Berment. 2004. Methods to Computerize “Little Equipped” Languages and Groups of Languages. Ph.D. Dissertation.Google Scholar
- Denise Bernot. 1980. Le prédicat en birman parlé, vol. 8. Peeters Publishers.Google Scholar
- Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015. Long short-term memory neural networks for Chinese word segmentation. In Proceedings of the EMNLP. 1197--1206.Google Scholar
Cross Ref
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the EMNLP. 1724--1734.Google Scholar
Cross Ref
- Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12 (2011), 2493--2537. Google Scholar
Digital Library
- Department of the Myanmar Language Commission. 2014. Myanmar-English Dictionary (Myanma-anggalip Abidan) (12th ed.). Ministry of Education, the Republic of the Union of Myanmar.Google Scholar
- Department of the Myanmar Language Commission. 2016. Myanmar Grammar (Myanma Sadda) (3rd ed.). Ministry of Education, the Republic of the Union of Myanmar (in Burmese).Google Scholar
- Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2018. NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging. ACM Trans. Asian Low-Resour. Lang. Info. Process. 18, 2 (2018), 17. Google Scholar
Digital Library
- Chenchen Ding, Win Pa Pa, Masao Utiyama, and Eiichiro Sumita. 2017. Burmese (Myanmar) name romanization: A sub-syllabic segmentation scheme for statistical solutions. In Proceedings of the PACLING. 227--238.Google Scholar
- Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Trans. Asian Low-Resour. Lang. Info. Process. 15, 4 (2016). Google Scholar
Digital Library
- Thomas Emerson. 2005. The second international Chinese word segmentation bakeoff. In Proceedings of the SIGHAN. 123--133.Google Scholar
- Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. Representing text chunks. In Proceedings of the EACL. 173--179. Google Scholar
Digital Library
- Ryo Fujii, Ryo Domoto, and Daichi Mochihashi. 2017. Nonparametric Bayesian semi-supervised word segmentation. Trans. Assoc. Comput. Linguist. 5 (2017), 179--189.Google Scholar
- Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. 2002. Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3 (Aug.2002), 115--143. Google Scholar
Digital Library
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the AISTATS (PMLR), vol. 9. 249--256.Google Scholar
- Klaus Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. (2017), 1735--1780.Google Scholar
- Hla Hla Htay, G. Bharadwaja Kumar, and Kavi Narayana Murthy. 2007. Statistical Analyses of Myanmar Corpora. Technical Report. Department of Computer and Information Sciences, University of Hyderabad.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google Scholar
Digital Library
- Khin War War Htike, Ye Kyaw Thu, Zuping Zhang, Win Pa Pa, Yoshinori Sagisaka, and Naoto Iwahashi. 2017. Comparison of six POS tagging methods on 10K sentences Myanmar language (Burmese) POS tagged corpus. In Proceedings of the CICLING.Google Scholar
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the ICLR.Google Scholar
- Taku Kudo and Yuji Matsumoto. 2001. Chunking with support vector machines. In Proceedings of the NAACL. 1--8. Google Scholar
Digital Library
- Taku Kudo and Yuji Matsumoto. 2002. Support vector machine wo mochiita chunk dōtei. J. Natur. Lang. Process. 9, 5 (2002), 3--21. In Japanese.Google Scholar
Cross Ref
- Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the EMNLP. 230--237.Google Scholar
- John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML. 282--289. Google Scholar
Digital Library
- Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the ACL. 1064--1074.Google Scholar
Cross Ref
- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn treebank. Comput. Linguist. 19, 2 (1993), 313--330. Google Scholar
Digital Library
- Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network-based language model. In Proceedings of Interspeech, vol. 2. 1045--1048.Google Scholar
Cross Ref
- Seung-Hoon Na. 2015. Conditional random fields for Korean morpheme segmentation and POS tagging. ACM Trans. Asian Low-Res. Lang. Info. Process. 14, 3 (2015), 10. Google Scholar
Digital Library
- Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017. DyNet: The dynamic neural network toolkit. arXiv:1701.03980 (2017).Google Scholar
- Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proceedings of the ACL-HLT. 529--533. Google Scholar
Digital Library
- Hideki Ogura, Hanae Koiso, Yumi Fujiike, Sayaka Miyauchi, Hikari Konishi, and Yutaka Hara. 2011. JC-D-10-05-01, and JC-D-10-05-02. Retrieved from http://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-01.pdf; http://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-02.pdf (in Japanese).Google Scholar
- John Okell and Anna Allott. 2001. Burmese/Myanmar Dictionary of Grammatical Forms. Routledge.Google Scholar
- Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the LREC. 2089--2096.Google Scholar
- Lance A. Ramshaw and Mitchell P. Marcus. 1999. Text chunking using transformation-based learning. In Natural Language Processing Using Very Large Corpora. Springer, 157--176.Google Scholar
- Hammam Riza, Michael Purwoadi, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Rapid Sun, Sethserey Sam, Sopheap Seng, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, and Chenchen Ding. 2016. Introduction of the Asian language treebank. In Proceedings of the O-COCOSDA. 1--6.Google Scholar
Cross Ref
- Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958. Google Scholar
Digital Library
- Karl Stratos and Michael Collins. 2015. Simple semi-supervised POS tagging. In Proceedings of the NAACL-HLT. 79--87.Google Scholar
Cross Ref
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the NIPS. 3104--3112. Google Scholar
Digital Library
- Ann Taylor, Mitchell P. Marcus, and Beatrice Santorini. 2003. The Penn treebank: An overview. In Treebanks. Springer, 5--22.Google Scholar
- Thet Thet Zin, Khin Mar Soe, and Ni Lar Thein. 2011. Myanmar phrases translation model with morphological analysis for statistical Myanmar to English translation system. In Proceedings of the PACLIC. 130--139.Google Scholar
- Tin Htay Hlaing. 2012. Manually constructed context-free grammar for Myanmar syllable structure. In Proceedings of the EACL. 32--37. Google Scholar
Digital Library
- Sato Toshinori. 2015. Neologism dictionary based on the language resources on the Web for Mecab. Retrieved from https://github.com/neologd/mecab-ipadic-neologd.Google Scholar
- Kiyotaka Uchimoto, Qing Ma, Masaki Murata, Hiromi Ozaku, and Hitoshi Isahara. 2000. Named entity extraction based on a maximum entropy model and transformation rules. In Proceedings of the ACL. 326--335. Google Scholar
Digital Library
- Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2010. A unified character-based tagging framework for Chinese word segmentation. ACM Trans. Asian Lang. Info. Process. 9, 2 (2010), 5. Google Scholar
Digital Library
Index Terms
Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging
Recommendations
Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion
As a highly analytic language, Khmer has considerable ambiguities in tokenization and part-of-speech (POS) tagging processing. This topic is investigated in this study. Specifically, a 20,000-sentence Khmer corpus with manual tokenization and POS-tagging ...
Word Segmentation for Burmese (Myanmar)
Experiments on various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimental results demonstrate that ...
Impact of Tokenization on Language Models: An Analysis for Turkish
Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for ...






Comments