skip to main content
research-article
Open Access

Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods

Published:10 March 2021Publication History
Skip Abstract Section

Abstract

This article presents a pipeline that converts collections of Tibetan documents in plain text or XML into a fully segmented and POS-tagged corpus. We apply the pipeline to the large extent collection of the Buddhist Digital Resource Center. The semi-supervised methods presented here not only result in a new and improved version of the largest annotated Tibetan corpus to date, the integration of rule-based, memory-based, and neural-network methods also serves as a good example of how to overcome challenges of under-researched languages. The end-to-end accuracy of our entire automatic pipeline of 91.99% is high enough to make the resulting corpus a useful resource for both linguists and scholars of Tibetan studies.

References

  1. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5 (2017), 135--146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Artem Chernodub, Oleksiy Oliynyk, Philipp Heidenreich, Alexander Bondarenko, Matthias Hagen, Chris Biemann, and Alexander Panchenko. 2019. TARGER: Neural argument mining at your fingertips. In Proceedings of the 57th Annual Meeting of the Association of Computational Linguistics (ACL’19).Google ScholarGoogle ScholarCross RefCross Ref
  3. Christian Faggionato and Marieke Meelen. 2019. Developing the old Tibetan treebank. In Proceedings of Recent Advances in Natural Language Processing, Angelova, Mitkov, Nikolova, and Temnikova (Eds.). 304--312.Google ScholarGoogle ScholarCross RefCross Ref
  4. Edward Garrett, Nathan W. Hill, and Abel Zadoks. 2014. A rule-based part-of-speech tagger for Classical Tibetan. Himal. Linguist. 13, 2 (2014), 9--57.Google ScholarGoogle Scholar
  5. Tsering Gya and Dbangphyug Tsering. 2010. Research on a standard for POS tagging of contemporary Tibetan for TIP. In Proceedings of the 12th Seminar of the International Association for Tibetan Studies. 1--12.Google ScholarGoogle Scholar
  6. Paul G. Hackett. 2019. Digital Encoding, Preservation, Translation, and Research for Tibetan Buddhist Texts. Walter de Gruyter, 91--110. DOI:https://doi.org/10.1515/9783110519082-006Google ScholarGoogle Scholar
  7. Nathan W. Hill. 2012. A note on the history and future of the “Wylie” system. Revue d’Etudes Tibétaines 23 (2012), 103--105.Google ScholarGoogle Scholar
  8. Nathan W. Hill and Edward Garrett. 2017. A part-of-speech (POS) lexicon of Classical Tibetan for NLP. http://doi.org/10.5281/zenodo.574876Google ScholarGoogle Scholar
  9. Nathan W. Hill and Di Jiang. 2016. Introduction: Tibetan natural language processing. Himal. Linguist. 15, 1 (2016), 1--11. DOI:https://doi.org/10.5070/H915131516Google ScholarGoogle Scholar
  10. Di Jiang. 2003. A New Perspective for Modern Tibetan Machine Processing and its Development: An Insight Into the Method of Computerized Automatic Understanding of Natural Languages in Terms of Chunk Identification. 科学出版社 Kexue Chubanshe, 438--448.Google ScholarGoogle Scholar
  11. C. Kang, D. Jiang, and C. Long. 2013. Tibetan word segmentation based on word-position tagging. In Proceedings of the International Conference on Asian Language Processing. 239--242. DOI:https://doi.org/10.1109/IALP.2013.74Google ScholarGoogle Scholar
  12. Huidan Liu, Congjun Long, Minghua Nuo, and Jian Wu. 2015. Tibetan word segmentation as sub-syllable tagging with syllable’s part-of-speech property. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, Maosong Sun, Zhiyuan Liu, Min Zhang, and Yang Liu (Eds.). Springer International Publishing, Cham, 189--201.Google ScholarGoogle Scholar
  13. Ning Ma, Yachao Li, and Xiangzhen He. 2016. Fusion of word clustering features for Tibetan part of speech tagging based on maximum entropy model. Int. J. Simul. Syst. Sci. Technol. 17, 8 (2016), 19.1--19.5. DOI:https://doi.org/10.5013/IJSSST.a.17.08.19Google ScholarGoogle Scholar
  14. Marieke Meelen and Nathan Hill. 2017. Segmenting and POS tagging Classical Tibetan using a memory-based tagger. Himal. Linguist. 16, 2 (2017), 64--89.Google ScholarGoogle Scholar
  15. Marieke Meelen, Nathan W. Hill, and Christopher Handy. 2017a. The Annotated Corpus of Classical Tibetan (ACTib), Part I—Segmented version, based on the BDRC digitised text collection, tagged with the Memory-based Tagger from TiMBL. DOI:https://doi.org/10.5281/zenodo.823707Google ScholarGoogle Scholar
  16. Marieke Meelen, Nathan W. Hill, and Christopher Handy. 2017b. The Annotated Corpus of Classical Tibetan (ACTib), Part II—POS-tagged version, based on the BDRC digitised text collection, tagged with the Memory-based Tagger from TiMBL. DOI:https://doi.org/10.5281/zenodo.823707Google ScholarGoogle Scholar
  17. M. Meelen and É. Roux. 2020. Meta-dating the PArsed Corpus of Tibetan (PACTib). In Proceedings of the 19th Workshop on Treebanks and Linguistic Theories. 31--42.Google ScholarGoogle Scholar
  18. Yuan Sun, Xiaodong Yan, Xiaobing Zhao, and Guosheng Yang. 2009. Design of a Tibetan automatic word segmentation scheme. In Proceedings of the International Conference on Information Engineering and Computer Science. DOI:https://doi.org/10.1109/iciecs.2009.5366542Google ScholarGoogle ScholarCross RefCross Ref
  19. Tashi Tsering 扎西次仁. 1999. 一个人机互助的藏文分词和词登录系统的设计 Design of a Word Segmentation System for Word Segmentation and Word Registration. 民族出版社 Nationalities Publishing House, 322--327.Google ScholarGoogle Scholar
  20. Tshe Ring Rgyal 才让加 and Mchog Thar Rgyal 吉太加. 2005. 基于藏语语料库的词类分类方法研究 Studies on a Taxonomic Approach to Part of Speech Identification in the Tibetan Corpus. 西北民族大学学报 (自然科学版) J. Northwest Univ. National. (Natural Sci.) 26, 57 (2005), 39--42.Google ScholarGoogle Scholar
  21. Lili Wang, Ziyan Chen, and Hongwu Yang. 2019. TPOS tagging method based on BiLSTM_CRF model. In Proceedings of the 5th International Conference of Pioneering Computer Scientists, Engineers and Educators (ICPCSEE’19), Xiaohui Cheng, Weipeng Jing, Xianhua Song, and Zeguang Lu (Eds.). Springer, Singapore, 490--503.Google ScholarGoogle ScholarCross RefCross Ref
  22. Z. Q. Wu, H. Z. Yu, and S. H. Wan. 2014. Research on automatic tagging of parts of speech for Tibetan texts based on the condition of random fields. Appl. Mech. Mater. 519--520 (2014), 784--787. DOI:https://doi.org/10.4028/www.scientific.net/amm.519-520.784Google ScholarGoogle Scholar

Index Terms

  1. Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 1
      Special issue on Deep Learning for Low-Resource Natural Language Processing, Part 1 and Regular Papers
      January 2021
      332 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3439335
      Issue’s Table of Contents

      Copyright © 2021 Owner/Author

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 March 2021
      • Accepted: 1 July 2020
      • Revised: 1 May 2020
      • Received: 1 December 2019
      Published in tallip Volume 20, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!