Abstract
Tibetan word segmentation and POS tagging are the primary tasks of Tibetan natural language processing. Most of existing methods of Tibetan word segmentation and POS tagging are based on rules and statistics, which need manual construction of features. In addition, the joint mode has shown stronger capabilities for word segmentation and POS tagging and have received great interests. In this paper, we propose Bi-LSTM+IDCNN+CRF structures, a simple yet effective end-to-end neural network model, for joint Tibetan word segmentation and POS tagging. We conduct step-by-step and joint experiments on the Tibetan datasets. The results demonstrate that the performance of the Bi-LSTM+IDCNN+CRF model is the best regardless of the step-by-step or joint mode. We obtain state-of-the-art performance in the joint tagging mode. The F1 score of the word segmentation task reached 92.31%, and the F1 score of the POS tagging task reached 81.26%.
- . The parts-of-speech tagging set for Tibetan information processing: GB/T 36337-2018[S].2018.Google Scholar
- . Specification on Tibetan segmentation for information processing: GB/T 36452-2018[S].2018.Google Scholar
- . 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1 (2009), 1–127.Google Scholar
Digital Library
- . 2010. Design and implementation of Banzhida Tibetan word segmentation system. Journal of Minorities Teachers College of Qinghai Teachers University 21, 2 (2010), 75–77.Google Scholar
- . 2017. A Feature-Enriched neural model for joint Chinese word segmentation and Part-of-Speech tagging. arXiv:1611.05384.Google Scholar
- . 2003. An Automatic Tibetan segmentation scheme based on case-auxiliary words and continuous features. Applied Linguistics 1 (2003), 75–82.Google Scholar
- . 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research 12 (2011), 2493–2537.Google Scholar
Digital Library
- . 2016. Neural word segmentation learning for Chinese. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. New York, USA: ACL (2016).Google Scholar
- . 2013. Study on method of solving ambiguity in Tibetan part of speech tagging. Computer Engineering and Applications 49, 24 (2013), 135−137+148.Google Scholar
- . 2012. Long short-term memory. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer, 37–45.Google Scholar
- . 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5–6 (2005), 602–610.Google Scholar
Digital Library
- . 2014. Discriminative Tibetan Part-of-Speech tagging with perceptron model. Journal of Chinese Information Processing 28, 2 (2014), 56–60.Google Scholar
- . 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv Preprint, arXiv: 1508.01991.Google Scholar
- . 2012. Research and implementation of the Tibetan POS Tagging based on HMM. Computer CD Software and Applications 12 (2012), 100–101.Google Scholar
- . 2011. Tibetan word segmentation system based on conditional random fields. Software Engineering and Service Science (ICSESS’11), IEEE 2nd International Conference on. IEEE, 446–448.Google Scholar
Cross Ref
- . 2008. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. Proceedings of ACL-08: HLT, 897–904.Google Scholar
- . 2014. Research on Tibetan word segmentation and POS tagging[D]. Shanghai Normal University.Google Scholar
- . 2014. Adam: A method for stochastic optimization. Computer Science.Google Scholar
- . 2009. An error-driven wordcharacter hybrid model for joint Chinese word segmentation and POS tagging. ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore.Google Scholar
- . 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.Google Scholar
- . 2018. Tibetan word segmentation based on deep learning. Computer Engineering and Design 39, 1 (2018), 194–198.Google Scholar
- . 2015. TIP-LAS: An open source toolkit for Tibetan word segmentation and POS tagging. Journal of Chinese Information Processing 29, 6 (2015), 203–207.Google Scholar
- . 2012. Research on Tibetan Word Segmentation and Text Resource Mining. Institute of Software, Chinese Academy of Sciences.Google Scholar
- . 2015. Tibetan POS tagging based on syllable tagging. Journal of Chinese Information Processing 29, 5 (2015), 211–215.Google Scholar
- . 2017. Research on tagging of Tibetan syllables. Journal of Chinese Information Processing 31, 4.Google Scholar
- . 1999. Basic rule of Tibetan computer automatic word segmentation. Chinese Minority Language Modernization.Google Scholar
- . 2015. Tibetan automatic word segmentation based on conditional random fields and knowledge fusion. Journal of Chinese Information Processing 29, 6 (2015), 213–219.Google Scholar
- . 2016. End-to-end sequence labeling via Bi-directional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. USA: ACL. 1064–1074.Google Scholar
Cross Ref
- . 2013. Distributed representations of words and phrases and their compositionality. arXiv:1310.4546Google Scholar
- . 2004. Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. EMNLP, (2004), 277–284.Google Scholar
- . 2010. Dzongkha word segmentation. Proceedings of the 8th Workshop on Asian Language Resources, 95–102.Google Scholar
- . 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11 (2010), 3371–3408.Google Scholar
- . 2006. Research of Tibetan word segmentation for information processing. Journal of Northwest University for Nationalities (Philosophy and Social Science) 4 (2006), 92–97.Google Scholar
- . 2015. A transition-based model for joint segmentation, POS-tagging and normalization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1837–1846.Google Scholar
Cross Ref
- . 2002. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.Google Scholar
Digital Library
- . 2010. CRF based research on a unified approach to word segmentation and POS tagging for Pre-Qin Chinese. Journal of Chinese Information Processing 24, 2 (2010), 39–45.Google Scholar
- . 2011. A Tibetan segmentation system—Yangjin. Journal of Chinese Information Processing 25, 4 (2011), 54–56.Google Scholar
- . 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 56 (2014), 1929–1958.Google Scholar
Digital Library
- . 2007. Fast and accurate entity Koltun recognition with iterated dilated convolutions. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2670–2680.Google Scholar
- . 2009. Research on automatic Part-of-Speech tagging of Tibetan corpus based on HMM. Journal of Northwest University for Nationalities (Natural Science) 30, 1 (2009), 42–45.Google Scholar
- . 2010. Notice of Retraction: A resolution of overlapping ambiguity in Tibetan word segmentation. IEEE International Conference on Computer Science & Information Technology.Google Scholar
- . 2007. Convolutional Neural Network with word embeddings for Chinese word segmentation. Proceedings of the 8th International Joint Conference on Natural Language, 163–172.Google Scholar
- . 2018. Multiple character embeddings for Chinese word segmentation.Google Scholar
- Lili Wang, Hongwu Yang, Xiaotian Xing, and Yajing Yan. 2019. Tibetan word segmentation method based on CNN-BiLSTM-CRF model. International Conference on Asian Language Processing (IALP'19). 319--324.
DOI: 10.1109/IALP48816.2019.9037661Google Scholar - . 2003. Chinese word segmentation as LMR tagging. Proceeding of the Second SIGHAN Workshop on Chinese Language Processing. (Association for Computational Linguistics, Morristown, NJ, USA), 7, 17 (2003), 176–179.Google Scholar
Digital Library
- Wenming Yang and Weijie Chu. 2019. Named entity recognition of online medical question answering text. Computer Systems and Applications 28, 2 (2019), 8--14 (in Chinese). http://www.c-s-a.org.cn/1003-3254/6760.html.Google Scholar
- . 2016. Bi-directional LSTM recurrent neural network for Chinese word segmentation. International Conference on Neural Information Processing Springer, Cham (2016).Google Scholar
Cross Ref
- . 2015. Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122.Google Scholar
- . 2013. Fusion of syllable features for Tibetan part of speech based on maximum entropy model. Journal of Chinese Information Processing 27, 5 (2013), 160–165.Google Scholar
- . 2016. Transition-based neural word segmentation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany 1 (2016), 421–431.Google Scholar
Cross Ref
- . 2008. Joint word segmentation and POS tagging using a single perceptron. ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA.Google Scholar
- . 2015. Tibetan Syllable Frequency Dictionary[M]. China Social Sciences Press (2015).Google Scholar
- . 2013. Deep Learning for Chinese Word Segmentation and POS Tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 647--657.Google Scholar
Index Terms
Character-based Joint Word Segmentation and Part-of-Speech Tagging for Tibetan Based on Deep Learning
Recommendations
Transformation-based part-of-speech tagging for Serbian language
CIMMACS'09: Proceedings of the 8th WSEAS International Conference on Computational intelligence, man-machine systems and cyberneticsMachine learning techniques based on transformation rules have proven to be a viable alternative to stochastic tagging, achieving similar accuracy while having many advantages such as simplicity and better portability to other languages. However, data ...
Part-of-Speech (POS) Tagging Using Deep Learning-Based Approaches on the Designed Khasi POS Corpus
Part-of-speech (POS) tagging is one of the research challenging fields in natural language processing (NLP). It requires good knowledge of a particular language with large amounts of data or corpora for feature engineering, which can lead to achieving a ...
Korean Part-of-speech Tagging Based on Morpheme Generation
Two major problems of Korean part-of-speech (POS) tagging are that the word-spacing unit is not mapped one-to-one to a POS tag and that morphemes should be recovered during POS tagging. Therefore, this article proposes a novel two-step Korean POS tagger ...






Comments