Abstract
This article presents a pipeline that converts collections of Tibetan documents in plain text or XML into a fully segmented and POS-tagged corpus. We apply the pipeline to the large extent collection of the Buddhist Digital Resource Center. The semi-supervised methods presented here not only result in a new and improved version of the largest annotated Tibetan corpus to date, the integration of rule-based, memory-based, and neural-network methods also serves as a good example of how to overcome challenges of under-researched languages. The end-to-end accuracy of our entire automatic pipeline of 91.99% is high enough to make the resulting corpus a useful resource for both linguists and scholars of Tibetan studies.
- Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5 (2017), 135--146.Google Scholar
Digital Library
- Artem Chernodub, Oleksiy Oliynyk, Philipp Heidenreich, Alexander Bondarenko, Matthias Hagen, Chris Biemann, and Alexander Panchenko. 2019. TARGER: Neural argument mining at your fingertips. In Proceedings of the 57th Annual Meeting of the Association of Computational Linguistics (ACL’19).Google Scholar
Cross Ref
- Christian Faggionato and Marieke Meelen. 2019. Developing the old Tibetan treebank. In Proceedings of Recent Advances in Natural Language Processing, Angelova, Mitkov, Nikolova, and Temnikova (Eds.). 304--312.Google Scholar
Cross Ref
- Edward Garrett, Nathan W. Hill, and Abel Zadoks. 2014. A rule-based part-of-speech tagger for Classical Tibetan. Himal. Linguist. 13, 2 (2014), 9--57.Google Scholar
- Tsering Gya and Dbangphyug Tsering. 2010. Research on a standard for POS tagging of contemporary Tibetan for TIP. In Proceedings of the 12th Seminar of the International Association for Tibetan Studies. 1--12.Google Scholar
- Paul G. Hackett. 2019. Digital Encoding, Preservation, Translation, and Research for Tibetan Buddhist Texts. Walter de Gruyter, 91--110. DOI:https://doi.org/10.1515/9783110519082-006Google Scholar
- Nathan W. Hill. 2012. A note on the history and future of the “Wylie” system. Revue d’Etudes Tibétaines 23 (2012), 103--105.Google Scholar
- Nathan W. Hill and Edward Garrett. 2017. A part-of-speech (POS) lexicon of Classical Tibetan for NLP. http://doi.org/10.5281/zenodo.574876Google Scholar
- Nathan W. Hill and Di Jiang. 2016. Introduction: Tibetan natural language processing. Himal. Linguist. 15, 1 (2016), 1--11. DOI:https://doi.org/10.5070/H915131516Google Scholar
- Di Jiang. 2003. A New Perspective for Modern Tibetan Machine Processing and its Development: An Insight Into the Method of Computerized Automatic Understanding of Natural Languages in Terms of Chunk Identification. 科学出版社 Kexue Chubanshe, 438--448.Google Scholar
- C. Kang, D. Jiang, and C. Long. 2013. Tibetan word segmentation based on word-position tagging. In Proceedings of the International Conference on Asian Language Processing. 239--242. DOI:https://doi.org/10.1109/IALP.2013.74Google Scholar
- Huidan Liu, Congjun Long, Minghua Nuo, and Jian Wu. 2015. Tibetan word segmentation as sub-syllable tagging with syllable’s part-of-speech property. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, Maosong Sun, Zhiyuan Liu, Min Zhang, and Yang Liu (Eds.). Springer International Publishing, Cham, 189--201.Google Scholar
- Ning Ma, Yachao Li, and Xiangzhen He. 2016. Fusion of word clustering features for Tibetan part of speech tagging based on maximum entropy model. Int. J. Simul. Syst. Sci. Technol. 17, 8 (2016), 19.1--19.5. DOI:https://doi.org/10.5013/IJSSST.a.17.08.19Google Scholar
- Marieke Meelen and Nathan Hill. 2017. Segmenting and POS tagging Classical Tibetan using a memory-based tagger. Himal. Linguist. 16, 2 (2017), 64--89.Google Scholar
- Marieke Meelen, Nathan W. Hill, and Christopher Handy. 2017a. The Annotated Corpus of Classical Tibetan (ACTib), Part I—Segmented version, based on the BDRC digitised text collection, tagged with the Memory-based Tagger from TiMBL. DOI:https://doi.org/10.5281/zenodo.823707Google Scholar
- Marieke Meelen, Nathan W. Hill, and Christopher Handy. 2017b. The Annotated Corpus of Classical Tibetan (ACTib), Part II—POS-tagged version, based on the BDRC digitised text collection, tagged with the Memory-based Tagger from TiMBL. DOI:https://doi.org/10.5281/zenodo.823707Google Scholar
- M. Meelen and É. Roux. 2020. Meta-dating the PArsed Corpus of Tibetan (PACTib). In Proceedings of the 19th Workshop on Treebanks and Linguistic Theories. 31--42.Google Scholar
- Yuan Sun, Xiaodong Yan, Xiaobing Zhao, and Guosheng Yang. 2009. Design of a Tibetan automatic word segmentation scheme. In Proceedings of the International Conference on Information Engineering and Computer Science. DOI:https://doi.org/10.1109/iciecs.2009.5366542Google Scholar
Cross Ref
- Tashi Tsering 扎西次仁. 1999. 一个人机互助的藏文分词和词登录系统的设计 Design of a Word Segmentation System for Word Segmentation and Word Registration. 民族出版社 Nationalities Publishing House, 322--327.Google Scholar
- Tshe Ring Rgyal 才让加 and Mchog Thar Rgyal 吉太加. 2005. 基于藏语语料库的词类分类方法研究 Studies on a Taxonomic Approach to Part of Speech Identification in the Tibetan Corpus. 西北民族大学学报 (自然科学版) J. Northwest Univ. National. (Natural Sci.) 26, 57 (2005), 39--42.Google Scholar
- Lili Wang, Ziyan Chen, and Hongwu Yang. 2019. TPOS tagging method based on BiLSTM_CRF model. In Proceedings of the 5th International Conference of Pioneering Computer Scientists, Engineers and Educators (ICPCSEE’19), Xiaohui Cheng, Weipeng Jing, Xianhua Song, and Zeguang Lu (Eds.). Springer, Singapore, 490--503.Google Scholar
Cross Ref
- Z. Q. Wu, H. Z. Yu, and S. H. Wan. 2014. Research on automatic tagging of parts of speech for Tibetan texts based on the condition of random fields. Appl. Mech. Mater. 519--520 (2014), 784--787. DOI:https://doi.org/10.4028/www.scientific.net/amm.519-520.784Google Scholar
Index Terms
Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods
Recommendations
A Memory-Based Lemmatizer for Ancient Greek
DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural HeritageIn this paper we present the lemmatizer that we developed for Ancient Greek: GLEM. As far as we know, GLEM is the first publicly available lemmatizer for Ancient Greek that uses POS information to disambiguate and that also assigns output to unseen ...
Coupling an annotated corpus and a lexicon for amazigh POS tagging
This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on Amazigh tagging, we introduce a decision tree and Markov model ...
Character-based Joint Word Segmentation and Part-of-Speech Tagging for Tibetan Based on Deep Learning
Tibetan word segmentation and POS tagging are the primary tasks of Tibetan natural language processing. Most of existing methods of Tibetan word segmentation and POS tagging are based on rules and statistics, which need manual construction of features. In ...






Comments