Abstract
As a highly analytic language, Khmer has considerable ambiguities in tokenization and part-of-speech (POS) tagging processing. This topic is investigated in this study. Specifically, a 20,000-sentence Khmer corpus with manual tokenization and POS-tagging annotation is released after a series of work over the last 4 years. This is the largest morphologically annotated Khmer dataset as of 2020, when this article was prepared. Based on the annotated data, experiments were conducted to establish a comprehensive benchmark on the automatic processing of tokenization and POS-tagging for Khmer. Specifically, a support vector machine, a conditional random field (CRF), a long short-term memory (LSTM)-based recurrent neural network, and an integrated LSTM-CRF model have been investigated and discussed. As a primary conclusion, processing at morpheme-level is satisfactory for the provided data. However, it is intrinsically difficult to identify further grammatical constituents of compounds or phrases because of the complex analytic features of the language. Syntactic annotation and automatic parsing for Khmer will be scheduled in the near future.
- Aye Myat Mon, Chenchen Ding, Hour Kaing, Khin Mar Soe, Masao Utiyama, and Eiichiro Sumita. 2020. A Myanmar (Burmese)-English named entity transliteration dictionary. In Proceedings of LREC. 2980–2983.Google Scholar
- Narin Bi and Nguonly Taing. 2014. Khmer word segmentation based on bi-directional maximal matching for plaintext and microsoft word document. In Proceedings of APSIPA. 1–9.Google Scholar
Cross Ref
- Vichet Chea, Ye Kyaw Thu, Chenchen Ding, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2015. Khmer word segmentation using conditional random fields. In Proceedings of Khmer NLP. 62–69.Google Scholar
- Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuan-Jing Huang. 2015. Long short-term memory neural networks for Chinese word segmentation. In Proceedings of EMNLP. 1197–1206.Google Scholar
Cross Ref
- Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12 (2011), 2493–2537. Google Scholar
Digital Library
- Chenchen Ding, Vichet Chea, Masao Utiyama, Eiichiro Sumita, Sethserey Sam, and Sopheap Seng. 2017. Statistical khmer name romanization. In Proceedings of PACLING. 179–190.Google Scholar
- Chenchen Ding, Hnin Thu Zar Aye, Win Pa Pa, Khin Thandar Nwet, Khin Mar Soe, Masao Utiyama, and Eiichiro Sumita. 2019. Towards Burmese (Myanmar) morphological analysis: Syllable-based tokenization and part-of-speech tagging.ACM Trans. Asian Low-Resour. Lang. Info. Process. 19, 1 (2019), 5. Google Scholar
Digital Library
- Chenchen Ding, Sann Su Su Yee, Win Pa Pa, Khin Mar Soe, Masao Utiyama, and Eiichiro Sumita. 2020. A Burmese (Myanmar) treebank: Guideline and analysis. ACM Trans. Asian Low-Resour. Lang. Info. Process. 19, 3 (2020), 40. Google Scholar
Digital Library
- Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2018. NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging. ACM Trans. Asian Low-Resour. Lang. Info. Process. 18, 2 (2018), 17. Google Scholar
Digital Library
- Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2020. Tokenized and POS-tagged Khmer data of the Asian language treebank project. https://doi.org/10.5281/zenodo.3937914Google Scholar
- Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Trans. Asian Low-Resour. Lang. Info. Process. 15, 4 (2016), 22. Google Scholar
Digital Library
- Madeline Elizabeth Ehrman and Kem Sos. 1972. Contemporary Cambodian: Grammatical Sketch.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google Scholar
Digital Library
- Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).Google Scholar
- Chea Sok Huor, Top Rithy, Ros Pich Hemy, Vann Navy, Chin Chanthirith, and Chhoeun Tola. 2004. Word bigram vs orthographic syllable bigram in Khmer word segmentation. PAN Localization Working Papers (2004).Google Scholar
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of ICLR.Google Scholar
- Canasai Kruengkrai, Kiyotaka Uchimoto, Junichi Kazama, Yiou Wang, Kentaro Torisawa, and Hitoshi Isahara. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of ACL-AFNLP. 513–521. Google Scholar
Digital Library
- Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of EMNLP. 230–237.Google Scholar
- John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML. 282–289. Google Scholar
Digital Library
- Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT. 260–270.Google Scholar
Cross Ref
- Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of ACL. 1064–1074.Google Scholar
Cross Ref
- Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017. DyNet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980 (2017).Google Scholar
- Graham Neubig and Shinsuke Mori. 2010. Word-based partial annotation for efficient corpus construction. In Proceedings of LREC.Google Scholar
- Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proc. of ACL. 529–533. Google Scholar
Digital Library
- Chenda Nou and Wataru Kameyama. 2007. Hybrid approach for Khmer unknown word POS guessing. In Proceedings of IRI. 215–220.Google Scholar
Cross Ref
- Chenda Nou and Wataru Kameyama. 2007. Transformation-based Khmer part-of-speech tagger. In Proceedings of ICAI. 581–587.Google Scholar
- Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of ACL. 1756–1765.Google Scholar
Cross Ref
- Hammam Riza, Michael Purwoadi, Gunarso, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Rapid Sun, Vichet Chea, Sethserey Sam, Sopheap Seng, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, and Chenchen Ding. 2016. Introduction of the asian language treebank. In Proceedings of O-COCOSDA. 1–6.Google Scholar
Cross Ref
- Sopheap Seng, Sethserey Sam, Laurent Besacier, Brigitte Bigi, and Eric Castelli. 2008. First broadcast news transcription system for Khmer language. In Proceedings of LREC. 2658–2661.Google Scholar
- Peter Sollich and Anders Krogh. 1995. Learning with ensembles: How overfitting can be useful. In Proceedings of NIPS. 190–196. Google Scholar
Digital Library
- Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of COLING. 3879–3889.Google Scholar
- Jie Yang and Yue Zhang. 2018. NCRF++: An open-source neural sequence labeling toolkit. In Proceedings of ACL, System Demonstrations. 74–79.Google Scholar
Cross Ref
- Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2010. A unified character-based tagging framework for chinese word segmentation. ACM Trans. Asian Lang. Info. Process. 9, 2 (2010), 1–32. Google Scholar
Digital Library
Index Terms
Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion
Recommendations
Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging
This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-...
Toward an Effective Igbo Part-of-Speech Tagger
Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments ...
Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging
Research on the problem of morphological disambiguation of Arabic has noted that techniques developed for lexical disambiguation in English do not easily transfer over, since the affixation present in Arabic creates a very different tag set than for ...






Comments