Abstract
A 20,000-sentence Burmese (Myanmar) treebank on news articles has been released under a CC BY-NC-SA license. Complete phrase structure annotation was developed for each sentence from the morphologically annotated data prepared in previous work of Ding et al. [1]. As the final result of the Burmese component in the Asian Language Treebank Project, this is the first large-scale, open-access treebank for the Burmese language. The annotation details and features of this treebank are presented.
- Chenchen Ding, Hnin Thu Zar Aye, Win Pa Pa, Khin Thandar Nwet, Khin Mar Soe, Masao Utiyama, and Eiichiro Sumita. 2019. Towards Burmese (Myanmar) morphological analysis: Syllable-based tokenization and part-of-speech tagging. ACM Trans. Asian Low-Resource Lang. Inf. Process. 19, 1 (2019), 5.Google Scholar
- Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2018. NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging. ACM Trans. Asian Low-Resource Lang. Inf. Process. 18, 2 (2018), 17.Google Scholar
- Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2019. Burmese (Myanmar) Treebank of Asian Language Treebank Project. Retrieved from DOI:https://doi.org/10.5281/zenodo.3463010Google Scholar
- Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, and Eiichiro Sumita. 2016. Word segmentation for Burmese (Myanmar). ACM Trans. Asian Low-Resource Lang. Inf. Process. 15, 4 (2016), 22.Google Scholar
- Daisuke Kawahara, Sadao Kurohashi, and Kôiti Hasida. 2002. Construction of a Japanese relevance-tagged corpus. In Proceedings of the Annual Language Resources and Evaluation Conference (LREC’02). 2008--2013.Google Scholar
- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of English: The Penn Treebank. Comput. Linguist. 19, 2 (1993), 313--330.Google Scholar
Digital Library
- Toshiaki Nakazawa, Katsuhito Sudoh, Shohei Higashiyama, Chenchen Ding, Raj Dabre, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, and Sadao Kurohashi. 2018. Overview of the 5th workshop on Asian translation. In Proceedings of the 5th Workshop on Asian Translation (WAT’18). 1--41.Google Scholar
- John Okell and Anna Allott. 2001. Burmese/Myanmar Dictionary of Grammatical Forms. Routledge.Google Scholar
- Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the Annual Conference of the Association for Computational Linguistics (ACL’06). 433--440.Google Scholar
Digital Library
- Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Annual Language Resources and Evaluation Conference (LREC’12). 2089--2096.Google Scholar
- Hammam Riza, Michael Purwoadi, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Rapid Sun, Sethserey Sam, Sopheap Seng, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, and Chenchen Ding. 2016. Introduction of the Asian language treebank. In Proceedings of the Oriental International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques Conference (O-COCOSDA’16). 1--6.Google Scholar
Cross Ref
- Sann Su Su Yee, Chenchen Ding, Khin Mar Soe, Masao Utiyama, and Eiichiro Sumita. 2019. Modifying NOVA-annotated Myanmar data to universal part-of-speech tagset. In Proceedings of the International Conference on Computing Advancements (ICCA’19). 230--237.Google Scholar
- Soe Lai Phyue and Aye Thida. 2013. Unknown word detection via syntax analyze. IAES Int. J. Artif. Intell. 2, 3 (2013), 107--116.Google Scholar
- Win Win Thant, Tin Myat Htwe, and Ni Lar Thein. 2012. Parsing of Myanmar sentences with function tagging. Int. J. Nat. Lang. Comput. 1, 1 (2012), 9--27.Google Scholar
- Naiwen Xue, Fei Xia, Fu-Dong Chiou, and Marta Palmer. 2005. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Nat. Lang. Eng. 11, 2 (2005), 207--238.Google Scholar
Digital Library
Index Terms
A Burmese (Myanmar) Treebank: Guideline and Analysis
Recommendations
Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging
This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-...
Development and evaluation of an Urdu treebank (CLE-UTB) and a statistical parser
AbstractA number of natural language processing tools for Urdu language processing have been developed in the past few years for word segmentation, part of speech tagging, chunking, named entity recognition and parsing. Corpora, especially treebanks, are ...
An Arabic CCG approach for determining constituent types from Arabic Treebank
Converting a treebank into a CCGbank opens the respective language to the sophisticated tools developed for Combinatory Categorial Grammar (CCG) and enriches cross-linguistic development. The conversion is primarily a three-step process: determining ...






Comments