skip to main content
research-article
Open Access

NOVA: A Feasible and Flexible Annotation System for Joint Tokenization and Part-of-Speech Tagging

Authors Info & Claims
Published:17 December 2018Publication History
Skip Abstract Section

Abstract

A feasible and flexible annotation system is designed for joint tokenization and part-of-speech (POS) tagging to annotate those languages without natural definitions of words. This design was motivated by the fact that word separators are not used in many highly analytic East and Southeast Asian languages. Although several of the languages are well-studied, e.g., Chinese and Japanese, many are understudied with low resources, e.g., Burmese (Myanmar) and Khmer. In the first part of the article, the proposed annotation system, named nova, is introduced. nova contains only four basic tags (n, v, a, and o); these tags can be further modified and combined to adapt complex linguistic phenomena in tokenization and POS tagging. In the second part of the article, the feasibility and flexibility of nova is illustrated from the annotation practice on Burmese and Khmer. The relation between nova and two universal POS tagsets is discussed in the final part of the article.

References

  1. L. Bloomfield and C. F. Hockett. 1984. Language. https://books.google.co.jp/books?id=87BCDVsmFE4C.Google ScholarGoogle Scholar
  2. Noam Chomsky. 1970. Remarks on nominalization. Readings in English Transformational Grammar, Roderick A. Jacobs and Peter S. Rosenbaum (Eds.). Ginn, 184--221.Google ScholarGoogle Scholar
  3. Madeline Elizabeth Ehrman, Kem Sos, and Lim Hak Kheang. 1974. Contemporary Cambodian—Grammatical Sketch. Foreign Service Institute, Department of State. Retrieved from https://www.livelingua.com/fsi/Fsi-ContemporaryCambodian-GrammaticalSketch.pdf.Google ScholarGoogle Scholar
  4. Chang-Ning Huang and Hai Zhao. 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing 21, 3 (2007), 8--19. (in Chinese).Google ScholarGoogle Scholar
  5. Wentian Li. 1992. Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE Transactions on Information Theory 38, 6 (1992), 1842--1845. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan T. McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. 2016. Universal dependencies v1: A multilingual treebank collection. In Proc. of LREC. 1659--1666.Google ScholarGoogle Scholar
  7. Hideki Ogura, Hanae Koiso, Yumi Fujiike, Sayaka Miyauchi, Hikari Konishi, and Yutaka Hara. 2011. Retrieved from http://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-01.pdf and http://pj.ninjal.ac.jp/corpus_center/bccwj/doc/report/JC-D-10-05-02.pdf (in Japanese).Google ScholarGoogle Scholar
  8. John Okell. 2010. Burmese—An Introduction to the Spoken Language, Book 1. Northern Illinois University Press.Google ScholarGoogle Scholar
  9. John Okell. 2010. Burmese—An Introduction to the Spoken Language, Book 2. Northern Illinois University Press.Google ScholarGoogle Scholar
  10. John Okell and Anna Allott. 2001. Burmese/Myanmar Dictionary of Grammatical Forms. Routledge.Google ScholarGoogle Scholar
  11. Thomas Pachunke, Oliver Mertineit, Klaus Wothke, and Rudolf Schmidt. 1992. Broad coverage automatic morphological segmentation of German words. In Proc. of COLING. 1218--1222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proc. of LREC. 2089--2096.Google ScholarGoogle Scholar
  13. Hammam Riza, Michael Purwoadi, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Rapid Sun, Sethserey Sam, Sopheap Seng, Khin Mar Soe, Khin Thandar Nwet, Masao Utiyama, and Chenchen Ding. 2016. Introduction of the Asian language treebank. In Proc. of O-COCOSDA. 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  14. Jonathan J. Webster and Chunyu Kit. 1992. Tokenization as the initial phase in NLP. In Proc. of COLING. 1106--1110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Fei Xia. 2000. The part-of-speech tagging guidelines for the Penn Chinese Treebank (3.0). Retrieved from http://www.cis.upenn.edu/∼chinese/posguide.3rd.ch.pdf.Google ScholarGoogle Scholar
  16. Fei Xia. 2000. The segmentation guidelines for the Penn Chinese Treebank (3.0). Retrieved from http://www.cis.upenn.edu/∼chinese/segguide.3rd.ch.pdf.Google ScholarGoogle Scholar
  17. Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen Okurowski, John Kovarik, Fu-Dong Chiou, Shizhe Huang, Tony Kroch, and Mitchell P. Marcus. 2000. Developing guidelines and ensuring consistency for Chinese text annotation. In Proc. of LREC.Google ScholarGoogle Scholar

Index Terms

  1. NOVA: A Feasible and Flexible Annotation System for Joint Tokenization and Part-of-Speech Tagging

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 2
      June 2019
      208 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3300146
      Issue’s Table of Contents

      Copyright © 2018 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 December 2018
      • Accepted: 1 September 2018
      • Revised: 1 August 2018
      • Received: 1 May 2016
      Published in tallip Volume 18, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!