skip to main content
research-article
Open Access

Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion

Authors Info & Claims
Published:13 September 2021Publication History
Skip Abstract Section

Abstract

As a highly analytic language, Khmer has considerable ambiguities in tokenization and part-of-speech (POS) tagging processing. This topic is investigated in this study. Specifically, a 20,000-sentence Khmer corpus with manual tokenization and POS-tagging annotation is released after a series of work over the last 4 years. This is the largest morphologically annotated Khmer dataset as of 2020, when this article was prepared. Based on the annotated data, experiments were conducted to establish a comprehensive benchmark on the automatic processing of tokenization and POS-tagging for Khmer. Specifically, a support vector machine, a conditional random field (CRF), a long short-term memory (LSTM)-based recurrent neural network, and an integrated LSTM-CRF model have been investigated and discussed. As a primary conclusion, processing at morpheme-level is satisfactory for the provided data. However, it is intrinsically difficult to identify further grammatical constituents of compounds or phrases because of the complex analytic features of the language. Syntactic annotation and automatic parsing for Khmer will be scheduled in the near future.

References

  1. Aye Myat Mon, Chenchen Ding, Hour Kaing, Khin Mar Soe, Masao Utiyama, and Eiichiro Sumita. 2020. A Myanmar (Burmese)-English named entity transliteration dictionary. In Proceedings of LREC. 2980–2983.Google ScholarGoogle Scholar
  2. Narin Bi and Nguonly Taing. 2014. Khmer word segmentation based on bi-directional maximal matching for plaintext and microsoft word document. In Proceedings of APSIPA. 1–9.Google ScholarGoogle ScholarCross RefCross Ref
  3. Vichet Chea, Ye Kyaw Thu, Chenchen Ding, Masao Utiyama, Andrew Finch, and Eiichiro Sumita. 2015. Khmer word segmentation using conditional random fields. In Proceedings of Khmer NLP. 62–69.Google ScholarGoogle Scholar
  4. Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuan-Jing Huang. 2015. Long short-term memory neural networks for Chinese word segmentation. In Proceedings of EMNLP. 1197–1206.Google ScholarGoogle ScholarCross RefCross Ref
  5. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12 (2011), 2493–2537. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chenchen Ding, Vichet Chea, Masao Utiyama, Eiichiro Sumita, Sethserey Sam, and Sopheap Seng. 2017. Statistical khmer name romanization. In Proceedings of PACLING. 179–190.Google ScholarGoogle Scholar
  7. Chenchen Ding, Hnin Thu Zar Aye, Win Pa Pa, Khin Thandar Nwet, Khin Mar Soe, Masao Utiyama, and Eiichiro Sumita. 2019. Towards Burmese (Myanmar) morphological analysis: Syllable-based tokenization and part-of-speech tagging.ACM Trans. Asian Low-Resour. Lang. Info. Process. 19, 1 (2019), 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chenchen Ding, Sann Su Su Yee, Win Pa Pa, Khin Mar Soe, Masao Utiyama, and Eiichiro Sumita. 2020. A Burmese (Myanmar) treebank: Guideline and analysis. ACM Trans. Asian Low-Resour. Lang. Info. Process. 19, 3 (2020), 40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2018. NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging. ACM Trans. Asian Low-Resour. Lang. Info. Process. 18, 2 (2018), 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2020. Tokenized and POS-tagged Khmer data of the Asian language treebank project. https://doi.org/10.5281/zenodo.3937914Google ScholarGoogle Scholar
  11. Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Trans. Asian Low-Resour. Lang. Info. Process. 15, 4 (2016), 22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Madeline Elizabeth Ehrman and Kem Sos. 1972. Contemporary Cambodian: Grammatical Sketch.Google ScholarGoogle Scholar
  13. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).Google ScholarGoogle Scholar
  15. Chea Sok Huor, Top Rithy, Ros Pich Hemy, Vann Navy, Chin Chanthirith, and Chhoeun Tola. 2004. Word bigram vs orthographic syllable bigram in Khmer word segmentation. PAN Localization Working Papers (2004).Google ScholarGoogle Scholar
  16. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of ICLR.Google ScholarGoogle Scholar
  17. Canasai Kruengkrai, Kiyotaka Uchimoto, Junichi Kazama, Yiou Wang, Kentaro Torisawa, and Hitoshi Isahara. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of ACL-AFNLP. 513–521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of EMNLP. 230–237.Google ScholarGoogle Scholar
  19. John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML. 282–289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT. 260–270.Google ScholarGoogle ScholarCross RefCross Ref
  21. Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of ACL. 1064–1074.Google ScholarGoogle ScholarCross RefCross Ref
  22. Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017. DyNet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980 (2017).Google ScholarGoogle Scholar
  23. Graham Neubig and Shinsuke Mori. 2010. Word-based partial annotation for efficient corpus construction. In Proceedings of LREC.Google ScholarGoogle Scholar
  24. Graham Neubig, Yosuke Nakata, and Shinsuke Mori. 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Proc. of ACL. 529–533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chenda Nou and Wataru Kameyama. 2007. Hybrid approach for Khmer unknown word POS guessing. In Proceedings of IRI. 215–220.Google ScholarGoogle ScholarCross RefCross Ref
  26. Chenda Nou and Wataru Kameyama. 2007. Transformation-based Khmer part-of-speech tagger. In Proceedings of ICAI. 581–587.Google ScholarGoogle Scholar
  27. Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of ACL. 1756–1765.Google ScholarGoogle ScholarCross RefCross Ref
  28. Hammam Riza, Michael Purwoadi, Gunarso, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Rapid Sun, Vichet Chea, Sethserey Sam, Sopheap Seng, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, and Chenchen Ding. 2016. Introduction of the asian language treebank. In Proceedings of O-COCOSDA. 1–6.Google ScholarGoogle ScholarCross RefCross Ref
  29. Sopheap Seng, Sethserey Sam, Laurent Besacier, Brigitte Bigi, and Eric Castelli. 2008. First broadcast news transcription system for Khmer language. In Proceedings of LREC. 2658–2661.Google ScholarGoogle Scholar
  30. Peter Sollich and Anders Krogh. 1995. Learning with ensembles: How overfitting can be useful. In Proceedings of NIPS. 190–196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of COLING. 3879–3889.Google ScholarGoogle Scholar
  32. Jie Yang and Yue Zhang. 2018. NCRF++: An open-source neural sequence labeling toolkit. In Proceedings of ACL, System Demonstrations. 74–79.Google ScholarGoogle ScholarCross RefCross Ref
  33. Hai Zhao, Chang-Ning Huang, Mu Li, and Bao-Liang Lu. 2010. A unified character-based tagging framework for chinese word segmentation. ACM Trans. Asian Lang. Info. Process. 9, 2 (2010), 1–32. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 6
      November 2021
      439 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3476127
      Issue’s Table of Contents

      Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 September 2021
      • Accepted: 1 April 2021
      • Revised: 1 February 2021
      • Received: 1 July 2020
      Published in tallip Volume 20, Issue 6

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!