skip to main content
research-article
Open Access

Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging

Authors Info & Claims
Published:31 May 2019Publication History
Skip Abstract Section

Abstract

This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-tagging information, as one component of the Asian Language Treebank Project. The annotated corpus has been released under a CC BY-NC-SA license, and it is the largest open-access database of annotated Burmese when this manuscript was prepared in 2017. Detailed descriptions of the preparation, refinement, and features of the annotated corpus are provided in the first half of the article. Facilitated by the annotated corpus, experiment-based investigations are presented in the second half of the article, wherein the standard sequence-labeling approach of conditional random fields and a long short-term memory (LSTM)-based recurrent neural network (RNN) are applied and discussed. We obtained several general conclusions, covering the effect of joint tokenization and POS-tagging and importance of ensemble from the viewpoint of stabilizing the performance of LSTM-based RNN. This study provides a solid basis for further studies on Burmese processing.

References

  1. Aye Myat Mon, Soe Lai Phyue, Myint Myint Thein, Su Su Htay, and Thinn Thinn Win. 2010. Analysis of Myanmar word boundary and segmentation by using statistical approach. In Proceedings of the ICACTE. 233--237.Google ScholarGoogle Scholar
  2. Vincent Berment. 2004. Methods to Computerize “Little Equipped” Languages and Groups of Languages. Ph.D. Dissertation.Google ScholarGoogle Scholar
  3. Denise Bernot. 1980. Le prédicat en birman parlé, vol. 8. Peeters Publishers.Google ScholarGoogle Scholar
  4. Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015. Long short-term memory neural networks for Chinese word segmentation. In Proceedings of the EMNLP. 1197--1206.Google ScholarGoogle ScholarCross RefCross Ref
  5. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the EMNLP. 1724--1734.Google ScholarGoogle ScholarCross RefCross Ref
  6. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12 (2011), 2493--2537. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Department of the Myanmar Language Commission. 2014. Myanmar-English Dictionary (Myanma-anggalip Abidan) (12th ed.). Ministry of Education, the Republic of the Union of Myanmar.Google ScholarGoogle Scholar
  8. Department of the Myanmar Language Commission. 2016. Myanmar Grammar (Myanma Sadda) (3rd ed.). Ministry of Education, the Republic of the Union of Myanmar (in Burmese).Google ScholarGoogle Scholar
  9. Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2018. NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging. ACM Trans. Asian Low-Resour. Lang. Info. Process. 18, 2 (2018), 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chenchen Ding, Win Pa Pa, Masao Utiyama, and Eiichiro Sumita. 2017. Burmese (Myanmar) name romanization: A sub-syllabic segmentation scheme for statistical solutions. In Proceedings of the PACLING. 227--238.Google ScholarGoogle Scholar
  11. Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Trans. Asian Low-Resour. Lang. Info. Process. 15, 4 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Thomas Emerson. 2005. The second international Chinese word segmentation bakeoff. In Proceedings of the SIGHAN. 123--133.Google ScholarGoogle Scholar
  13. Erik F. Tjong Kim Sang and Jorn Veenstra. 1999. Representing text chunks. In Proceedings of the EACL. 173--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ryo Fujii, Ryo Domoto, and Daichi Mochihashi. 2017. Nonparametric Bayesian semi-supervised word segmentation. Trans. Assoc. Comput. Linguist. 5 (2017), 179--189.Google ScholarGoogle Scholar
  15. Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. 2002. Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3 (Aug.2002), 115--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the AISTATS (PMLR), vol. 9. 249--256.Google ScholarGoogle Scholar
  17. Klaus Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. (2017), 1735--1780.Google ScholarGoogle Scholar
  18. Hla Hla Htay, G. Bharadwaja Kumar, and Kavi Narayana Murthy. 2007. Statistical Analyses of Myanmar Corpora. Technical Report. Department of Computer and Information Sciences, University of Hyderabad.Google ScholarGoogle Scholar
  19. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Khin War War Htike, Ye Kyaw Thu, Zuping Zhang, Win Pa Pa, Yoshinori Sagisaka, and Naoto Iwahashi. 2017. Comparison of six POS tagging methods on 10K sentences Myanmar language (Burmese) POS tagged corpus. In Proceedings of the CICLING.Google ScholarGoogle Scholar
  21. Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the ICLR.Google ScholarGoogle Scholar
  22. Taku Kudo and Yuji Matsumoto. 2001. Chunking with support vector machines. In Proceedings of the NAACL. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Taku Kudo and Yuji Matsumoto. 2002. Support vector machine wo mochiita chunk dōtei. J. Natur. Lang. Process. 9, 5 (2002), 3--21. In Japanese.Google ScholarGoogle ScholarCross RefCross Ref
  24. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the EMNLP. 230--237.Google ScholarGoogle Scholar
  25. John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML. 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the ACL. 1064--1074.Google ScholarGoogle ScholarCross RefCross Ref
  27. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn treebank. Comput. Linguist. 19, 2 (1993), 313--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network-based language model. In Proceedings of Interspeech, vol. 2. 1045--1048.Google ScholarGoogle ScholarCross RefCross Ref
  29. Seung-Hoon Na. 2015. Conditional random fields for Korean morpheme segmentation and POS tagging. ACM Trans. Asian Low-Res. Lang. Info. Process. 14, 3 (2015), 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017. DyNet: The dynamic neural network toolkit. arXiv:1701.03980 (2017).Google ScholarGoogle Scholar
  31. Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proceedings of the ACL-HLT. 529--533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Hideki Ogura, Hanae Koiso, Yumi Fujiike, Sayaka Miyauchi, Hikari Konishi, and Yutaka Hara. 2011. JC-D-10-05-01, and JC-D-10-05-02. Retrieved from http://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-01.pdf; http://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-02.pdf (in Japanese).Google ScholarGoogle Scholar
  33. John Okell and Anna Allott. 2001. Burmese/Myanmar Dictionary of Grammatical Forms. Routledge.Google ScholarGoogle Scholar
  34. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the LREC. 2089--2096.Google ScholarGoogle Scholar
  35. Lance A. Ramshaw and Mitchell P. Marcus. 1999. Text chunking using transformation-based learning. In Natural Language Processing Using Very Large Corpora. Springer, 157--176.Google ScholarGoogle Scholar
  36. Hammam Riza, Michael Purwoadi, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Rapid Sun, Sethserey Sam, Sopheap Seng, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, and Chenchen Ding. 2016. Introduction of the Asian language treebank. In Proceedings of the O-COCOSDA. 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  37. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Karl Stratos and Michael Collins. 2015. Simple semi-supervised POS tagging. In Proceedings of the NAACL-HLT. 79--87.Google ScholarGoogle ScholarCross RefCross Ref
  39. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the NIPS. 3104--3112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ann Taylor, Mitchell P. Marcus, and Beatrice Santorini. 2003. The Penn treebank: An overview. In Treebanks. Springer, 5--22.Google ScholarGoogle Scholar
  41. Thet Thet Zin, Khin Mar Soe, and Ni Lar Thein. 2011. Myanmar phrases translation model with morphological analysis for statistical Myanmar to English translation system. In Proceedings of the PACLIC. 130--139.Google ScholarGoogle Scholar
  42. Tin Htay Hlaing. 2012. Manually constructed context-free grammar for Myanmar syllable structure. In Proceedings of the EACL. 32--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Sato Toshinori. 2015. Neologism dictionary based on the language resources on the Web for Mecab. Retrieved from https://github.com/neologd/mecab-ipadic-neologd.Google ScholarGoogle Scholar
  44. Kiyotaka Uchimoto, Qing Ma, Masaki Murata, Hiromi Ozaku, and Hitoshi Isahara. 2000. Named entity extraction based on a maximum entropy model and transformation rules. In Proceedings of the ACL. 326--335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2010. A unified character-based tagging framework for Chinese word segmentation. ACM Trans. Asian Lang. Info. Process. 9, 2 (2010), 5. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!