Abstract
A feasible and flexible annotation system is designed for joint tokenization and part-of-speech (POS) tagging to annotate those languages without natural definitions of words. This design was motivated by the fact that word separators are not used in many highly analytic East and Southeast Asian languages. Although several of the languages are well-studied, e.g., Chinese and Japanese, many are understudied with low resources, e.g., Burmese (Myanmar) and Khmer. In the first part of the article, the proposed annotation system, named nova, is introduced. nova contains only four basic tags (n, v, a, and o); these tags can be further modified and combined to adapt complex linguistic phenomena in tokenization and POS tagging. In the second part of the article, the feasibility and flexibility of nova is illustrated from the annotation practice on Burmese and Khmer. The relation between nova and two universal POS tagsets is discussed in the final part of the article.
- L. Bloomfield and C. F. Hockett. 1984. Language. https://books.google.co.jp/books?id=87BCDVsmFE4C.Google Scholar
- Noam Chomsky. 1970. Remarks on nominalization. Readings in English Transformational Grammar, Roderick A. Jacobs and Peter S. Rosenbaum (Eds.). Ginn, 184--221.Google Scholar
- Madeline Elizabeth Ehrman, Kem Sos, and Lim Hak Kheang. 1974. Contemporary Cambodian—Grammatical Sketch. Foreign Service Institute, Department of State. Retrieved from https://www.livelingua.com/fsi/Fsi-ContemporaryCambodian-GrammaticalSketch.pdf.Google Scholar
- Chang-Ning Huang and Hai Zhao. 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing 21, 3 (2007), 8--19. (in Chinese).Google Scholar
- Wentian Li. 1992. Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE Transactions on Information Theory 38, 6 (1992), 1842--1845. Google Scholar
Digital Library
- Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan T. McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Proc. of LREC. 1659--1666.Google Scholar
- Hideki Ogura, Hanae Koiso, Yumi Fujiike, Sayaka Miyauchi, Hikari Konishi, and Yutaka Hara. 2011. Retrieved from http://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-01.pdf and http://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-02.pdf (in Japanese).Google Scholar
- John Okell. 2010. Burmese—An Introduction to the Spoken Language, Book 1. Northern Illinois University Press.Google Scholar
- John Okell. 2010. Burmese—An Introduction to the Spoken Language, Book 2. Northern Illinois University Press.Google Scholar
- John Okell and Anna Allott. 2001. Burmese/Myanmar Dictionary of Grammatical Forms. Routledge.Google Scholar
- Thomas Pachunke, Oliver Mertineit, Klaus Wothke, and Rudolf Schmidt. 1992. Broad coverage automatic morphological segmentation of German words. In Proc. of COLING. 1218--1222. Google Scholar
Digital Library
- Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proc. of LREC. 2089--2096.Google Scholar
- Hammam Riza, Michael Purwoadi, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Rapid Sun, Sethserey Sam, Sopheap Seng, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, and Chenchen Ding. 2016. Introduction of the Asian language treebank. In Proc. of O-COCOSDA. 1--6.Google Scholar
Cross Ref
- Jonathan J. Webster and Chunyu Kit. 1992. Tokenization as the initial phase in NLP. In Proc. of COLING. 1106--1110. Google Scholar
Digital Library
- Fei Xia. 2000. The part-of-speech tagging guidelines for the Penn Chinese Treebank (3.0). Retrieved from http://www.cis.upenn.edu/∼chinese/posguide.3rd.ch.pdf.Google Scholar
- Fei Xia. 2000. The segmentation guidelines for the Penn Chinese Treebank (3.0). Retrieved from http://www.cis.upenn.edu/∼chinese/segguide.3rd.ch.pdf.Google Scholar
- Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen Okurowski, John Kovarik, Fu-Dong Chiou, Shizhe Huang, Tony Kroch, and Mitchell P. Marcus. 2000. Developing guidelines and ensuring consistency for Chinese text annotation. In Proc. of LREC.Google Scholar
Index Terms
NOVA: A Feasible and Flexible Annotation System for Joint Tokenization and Part-of-Speech Tagging
Recommendations
Cross-lingual adaptation as a baseline: adapting maximum entropy models to Bulgarian
AdaptLRTtoND '09: Proceedings of the Workshop on Adaptation of Language Resources and Technology to New DomainsWe describe our efforts in adapting five basic natural language processing components to Bulgar-ian: sentence splitter, tokenizer, part-of-speech tagger, chunker, and syntactic parser. The components were originally developed for English within OpenNLP, ...
Towards Better Text Processing Tools for the Ainu Language
Human Language Technology. Challenges for Computer Science and LinguisticsAbstractIn this paper we present our research devoted to the development of Natural Language Processing technologies for the Ainu language, a critically endangered language isolate spoken by the Ainu people, the native inhabitants of northern parts of the ...
A Cross-lingual Part-of-Speech Tagging for Malay Language
ICAART 2015: Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2Cross-lingual annotation projection methods can benefit from rich-resourced languages to improve the performance
of Natural Language Processing (NLP) tasks in less-resourced languages. In this research, Malay
is experimented as the less-resourced ...






Comments