Abstract
Experiments on various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimental results demonstrate that statistical and machine learning approaches perform significantly better than dictionary-based approaches. We believe that this note, based on an annotated corpus of relatively considerable size (containing approximately a half million words), is the first systematic comparison of word segmentation approaches for Burmese. This work aims to discover the properties and proper approaches to Burmese textual processing and to promote further researches on this understudied language.
- Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2014. Empirical dependency-based head finalization for statistical Chinese-, English-, and French-to-Myanmar (Burmese) machine translation. In Proc. of IWSLT. 184--191.Google Scholar
- Hla Hla Htay and Kavi Narayana Murthy. 2008. Myanmar word segmentation using syllable level longest matching. In Proc. of IJCNLP. 41--48.Google Scholar
- Chang-Ning Huang and Hai Zhao. 2007. Chinese word segmentation: A decade review. J. Chin. Inform. Process. 21, 3 (2007), 8--19. (in Chinese).Google Scholar
- Gen-ichiro Kikui. 2003. Creating corpora for speech-to-speech translation. In Proc. of INTERSPEECH. 381--384.Google Scholar
- Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proc. of EMNLP. 230--237.Google Scholar
- John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML. 282--289. Google Scholar
Digital Library
- Seung-Hoon Na. 2015. Conditional random fields for Korean morpheme segmentation and POS tagging. ACM Trans. Asian Low-Res. Lang. Inform. Process. 14, 3 (2015), 10. Google Scholar
Digital Library
- Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proc. of ACL-HLT. 529--533. Google Scholar
Digital Library
- Manabu Sassano. 2014. Deterministic word segmentation using maximum matching with fully lexicalized rules. In Proc. of EACL. 79--83.Google Scholar
Cross Ref
- Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proc. of HLT-NAACT. 134--141. Google Scholar
Digital Library
- Richard Sproat and Thomas Emerson. 2003. The first international Chinese word segmentation bakeoff. In Proc. of SIGHAN, Vol. 1. 133--143. Google Scholar
Digital Library
- Tun Thura Thet, Jin-Cheon Na, and Wunna Ko Ko. 2008. Word segmentation for the Myanmar language. J. Inform. Sci. 34, 5 (2008), 688--704. Google Scholar
Digital Library
- Win Pa Pa and Ni Lar Thein. 2008. Myanmar word segmentation using hybrid approach. In Proc. of ICCA. 166--170.Google Scholar
- Ye Kyaw Thu, Andrew Finch, Eiichiro Sumita, and Yoshinori Sagisaka. 2014. Integrating dictionaries into an unsupervised model for Myanmar word segmentation. In Proc. of WSSANLP. 20--27.Google Scholar
- Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2010. A unified character-based tagging framework for Chinese word segmentation. ACM Trans. Asian Lang. Inform. Process. 9, 2 (2010), 5. Google Scholar
Digital Library
Index Terms
Word Segmentation for Burmese (Myanmar)
Recommendations
Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging
This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-...
Word segmentation for the Myanmar language
This study reports the development of a Myanmar word segmentation method using Unicode standard encoding. Word segmentation is an essential step prior to natural language processing in the Myanmar language, because a Myanmar text is a string of ...
A Neural Joint Model with BERT for Burmese Syllable Segmentation, Word Segmentation, and POS Tagging
The smallest semantic unit of the Burmese language is called the syllable. In the present study, it is intended to propose the first neural joint learning model for Burmese syllable segmentation, word segmentation, and part-of-speech (POS) tagging with ...






Comments