Abstract
Burmese is an isolated language, in which the syllable is the smallest unit. Syllable segmentation methods based on matching lead to performance subject to the syllable segmentation effect. This article proposes a word segmentation method with fusion conditions of double syllable features. It combines word segmentation and segmentation of syllables into one process, thus reducing the impact of errors on the syllable segmentation of Burmese. In the first layer of the conditional random fields (CRF) model, Burmese characters as atomic features are integrated into the Burma section of the Barkis Speech Paradigm (Backus normal form) features to realize the Burma syllable sequence tags. In the second layer of the CRFs model, with the syllable marked as input, it realizes the sequence markers through building a feature template with syllables as atomic features. The experimental results show that the proposed method has a better effect compared with the method based on the matching of syllables.
- Sun Maosong and Zou Jiayan. 2001. A review of the study of Chinese automatic word segmentation. Mod. Ling. 3, 1 (2001), 22--32.Google Scholar
- Zhou Jun, Zheng Zhonghua, and Zhang Wei. 2014. Chinese word segmentation based on improved maximum matching algorithm. Comput. Eng Appl. 50, 2, (2014), 124--128.Google Scholar
- Li Jiangbo, Zhou Qiang, and Chen Zushun. 2006. Research on fast search algorithm for chinese dictionary. Chin. J. Inf. 20, 5 (2006), 31--39.Google Scholar
- Zhang Bingyi, Wei Bo, and Chen Jiancheng et al. 2014. Chinese segmentation algorithm based on dual coding. J. Nanjing Univ. Sci. Technol. Nat. Sci. 38, 4 (2014), 526--530.Google Scholar
- HuaPing Zhang, HongKui Yu, and DeYi Xiong et al. 2003. HHMM-based chinese lexical ICTCLAS. In Proceedings of the 2nd SIGHAN Workshop on Language Processing, Volume 17. 184--187. Google Scholar
Digital Library
- Jia Xu, Jianfeng Gao, Kristina Toutanova, and Hermann Ney. 2008. Bayesian semi-supervised chinese word segmentation for statistical machine translation. In Proceedings of the International Conference on Computational Linguistics (COLING’08). 1017--1024. Google Scholar
Digital Library
- R. Sproat and T. Emerson. 2003. The first international chinese word segmentation bakeoff. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. ACL, 133--143. Google Scholar
Digital Library
- Xue Nianwen and Shen Libin. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the 2nd ACL SIGHAN Workshop on Chinese Language Processing. ACL, 176--179. Google Scholar
Digital Library
- Zhao Hai, Huang Changning, and Li Mu. 2006. An system with conditional random field. Workshop on Chinese Language Processing, improved Chinese word segmentation. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. ACL, 108--117.Google Scholar
- Huang Degen, Jiao Yang, and Zhou Huiwei. 2010. Double layer CRFs chinese word segmentation based on child words. Comput. Res. Dev. 47, 5 (2010), 962--968.Google Scholar
- Tun Thura Thet and Jin-Cheon Na. 2008. Word segmentation for the Myanmar language. J. Inf. Sci. 34, 5 (2008), 688--704. Google Scholar
Digital Library
- Aye Myat Mon et al. 2010. Analysis of myanmar word boundary and segmentation by using statistical approach. In Proceedings of the 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE‘10), V5:233--237.Google Scholar
- Ye Kyaw Thu. Integrating dictionaries into an unsupervised model for myanmar word segmentation. In Proceedings of the 5th Workshop on South and Southeast Asian NLP and 25th International Conference on Computational Linguistics. 20--27.Google Scholar
- Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 4, Article 22 (May 2016), 10 pages. Google Scholar
Digital Library
- Zhou Junsheng, Dai Xinyu, Yin Cunyan et al. 2006. Automatic identification of Chinese organization names based on cascaded conditional random field model {J}. J. Electr. 34, 5 (2006), 6804--809.Google Scholar
- Yan Yang, Wen Dunwei, Wang Yunji et al. 2014. Chinese medical record naming entity recognition based on cascading conditions with the airport{J}. Journal of Jilin University: Engineering Edition 44, 6 (2014), 1843--1848.Google Scholar
- Li Yachao, Jiayangji, and Zong Chengqing et al. 2013. Research and implementation of tibetan automatic word segmentation based on conditional random field {J}. Journal of Chinese Information Processing 27, 4 (2013), 52--58.Google Scholar
- Hla Hla Htay and Kavi Narayana Murthy. 2008. Myanmar Word Segmentation using Syllable level Longest Matching. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’8). 41--48.Google Scholar
Index Terms
Word Segmentation for Burmese Based on Dual-Layer CRFs
Recommendations
A Neural Joint Model with BERT for Burmese Syllable Segmentation, Word Segmentation, and POS Tagging
The smallest semantic unit of the Burmese language is called the syllable. In the present study, it is intended to propose the first neural joint learning model for Burmese syllable segmentation, word segmentation, and part-of-speech (POS) tagging with ...
Word Segmentation for Burmese (Myanmar)
Experiments on various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimental results demonstrate that ...
Word segmentation for the Myanmar language
This study reports the development of a Myanmar word segmentation method using Unicode standard encoding. Word segmentation is an essential step prior to natural language processing in the Myanmar language, because a Myanmar text is a string of ...






Comments