skip to main content
research-article

Character-based Joint Word Segmentation and Part-of-Speech Tagging for Tibetan Based on Deep Learning

Authors Info & Claims
Published:23 November 2022Publication History
Skip Abstract Section

Abstract

Tibetan word segmentation and POS tagging are the primary tasks of Tibetan natural language processing. Most of existing methods of Tibetan word segmentation and POS tagging are based on rules and statistics, which need manual construction of features. In addition, the joint mode has shown stronger capabilities for word segmentation and POS tagging and have received great interests. In this paper, we propose Bi-LSTM+IDCNN+CRF structures, a simple yet effective end-to-end neural network model, for joint Tibetan word segmentation and POS tagging. We conduct step-by-step and joint experiments on the Tibetan datasets. The results demonstrate that the performance of the Bi-LSTM+IDCNN+CRF model is the best regardless of the step-by-step or joint mode. We obtain state-of-the-art performance in the joint tagging mode. The F1 score of the word segmentation task reached 92.31%, and the F1 score of the POS tagging task reached 81.26%.

REFERENCES

  1. China National Information Technology Standardization on Network. The parts-of-speech tagging set for Tibetan information processing: GB/T 36337-2018[S].2018.Google ScholarGoogle Scholar
  2. China National Information Technology Standardization on Network. Specification on Tibetan segmentation for information processing: GB/T 36452-2018[S].2018.Google ScholarGoogle Scholar
  3. Bengio Yoshua. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 1 (2009), 1127.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cai Zhijie. 2010. Design and implementation of Banzhida Tibetan word segmentation system. Journal of Minorities Teachers College of Qinghai Teachers University 21, 2 (2010), 7577.Google ScholarGoogle Scholar
  5. Chen Xinchi, Qiu Xipeng, and Huang Xuanjing. 2017. A Feature-Enriched neural model for joint Chinese word segmentation and Part-of-Speech tagging. arXiv:1611.05384.Google ScholarGoogle Scholar
  6. Chen Yuzhong, Li Baoli, Shiwen YU, and Lan Cuoji. 2003. An Automatic Tibetan segmentation scheme based on case-auxiliary words and continuous features. Applied Linguistics 1 (2003), 7582.Google ScholarGoogle Scholar
  7. Collobert Ronan, Weston Jason, Bottou Léon, Karlen Michael, Kavukcuoglu Koray, and Kuksa Pavel. 2011. Natural language processing (almost) from scratch. The Journal of Machine Learning Research 12 (2011), 24932537.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Deng Cai and Hai Zhao. 2016. Neural word segmentation learning for Chinese. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. New York, USA: ACL (2016).Google ScholarGoogle Scholar
  9. Droma Yangmo. 2013. Study on method of solving ambiguity in Tibetan part of speech tagging. Computer Engineering and Applications 49, 24 (2013), 135−137+148.Google ScholarGoogle Scholar
  10. Graves Alex. 2012. Long short-term memory. Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer, 3745.Google ScholarGoogle Scholar
  11. Graves Alex and Schmidhuber Jürgen. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5–6 (2005), 602610.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hua Quecairang, Liu Qun, and Zhao Haixing. 2014. Discriminative Tibetan Part-of-Speech tagging with perceptron model. Journal of Chinese Information Processing 28, 2 (2014), 5660.Google ScholarGoogle Scholar
  13. Huang Zhiheng, Xu Wei, and Yu Kai. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv Preprint, arXiv: 1508.01991.Google ScholarGoogle Scholar
  14. Zhaxiduojie and Jiancairang An. 2012. Research and implementation of the Tibetan POS Tagging based on HMM. Computer CD Software and Applications 12 (2012), 100101.Google ScholarGoogle Scholar
  15. Jiang Tao, Yu Hongzhi, and Jam Yangkyi. 2011. Tibetan word segmentation system based on conditional random fields. Software Engineering and Service Science (ICSESS’11), IEEE 2nd International Conference on. IEEE, 446448.Google ScholarGoogle ScholarCross RefCross Ref
  16. Jiang Wenbin, Huang Liang, Liu Qun, and Lü Yajuan. 2008. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. Proceedings of ACL-08: HLT, 897904.Google ScholarGoogle Scholar
  17. Kang Caijun. 2014. Research on Tibetan word segmentation and POS tagging[D]. Shanghai Normal University.Google ScholarGoogle Scholar
  18. Kingma Diederik, and Ba Jimmy. 2014. Adam: A method for stochastic optimization. Computer Science.Google ScholarGoogle Scholar
  19. Kruengkrai Canasai, Uchimoto Kiyotaka, Kazama Jun'ichi, Wang Yiou, Torisawa Kentaro, and Isahara Hitoshi. 2009. An error-driven wordcharacter hybrid model for joint Chinese word segmentation and POS tagging. ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore.Google ScholarGoogle Scholar
  20. LeCun Yann, Bottou Léon, Bengio Yoshua, and Haffner Patrick. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 22782324.Google ScholarGoogle Scholar
  21. Li Bohan, Liu Huidan, Long Congjun, and Wu Jian. 2018. Tibetan word segmentation based on deep learning. Computer Engineering and Design 39, 1 (2018), 194198.Google ScholarGoogle Scholar
  22. Li Yachao, Jiang Jing, Jia Yangji, and Yu Hongzhi. 2015. TIP-LAS: An open source toolkit for Tibetan word segmentation and POS tagging. Journal of Chinese Information Processing 29, 6 (2015), 203207.Google ScholarGoogle Scholar
  23. Liu Huidan. 2012. Research on Tibetan Word Segmentation and Text Resource Mining. Institute of Software, Chinese Academy of Sciences.Google ScholarGoogle Scholar
  24. Long Congjun, Liu Huidan, Nuo Minghua, and Wu Jian. 2015. Tibetan POS tagging based on syllable tagging. Journal of Chinese Information Processing 29, 5 (2015), 211215.Google ScholarGoogle Scholar
  25. Long Congjun, Liu Huidan, and Wu Jian. 2017. Research on tagging of Tibetan syllables. Journal of Chinese Information Processing 31, 4.Google ScholarGoogle Scholar
  26. Luo Bingfen and Di Jiang. 1999. Basic rule of Tibetan computer automatic word segmentation. Chinese Minority Language Modernization.Google ScholarGoogle Scholar
  27. Luobsang Karten, Yang Yuanyuan, and Zhao Xiaobing. 2015. Tibetan automatic word segmentation based on conditional random fields and knowledge fusion. Journal of Chinese Information Processing 29, 6 (2015), 213219.Google ScholarGoogle Scholar
  28. Ma Xuezhe and Hovy Eduard. 2016. End-to-end sequence labeling via Bi-directional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. USA: ACL. 10641074.Google ScholarGoogle ScholarCross RefCross Ref
  29. Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Distributed representations of words and phrases and their compositionality. arXiv:1310.4546Google ScholarGoogle Scholar
  30. Ng Hwee Tou and Low Jin Kiat. 2004. Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. EMNLP, (2004), 277284.Google ScholarGoogle Scholar
  31. Norbu Sithar, Choejey Pema, Dendup Tenzin, Hussain Sarmad, and Mauz Ahmed. 2010. Dzongkha word segmentation. Proceedings of the 8th Workshop on Asian Language Resources, 95102.Google ScholarGoogle Scholar
  32. Pascal Vincent, Larochelle Hugo, Lajoie Isabelle, Bengio Yoshua, and Manzagol Pierre-Antoine. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11 (2010), 33713408.Google ScholarGoogle Scholar
  33. Qi Kunyu. 2006. Research of Tibetan word segmentation for information processing. Journal of Northwest University for Nationalities (Philosophy and Social Science) 4 (2006), 9297.Google ScholarGoogle Scholar
  34. Qian Tao, Zhang Yue, Zhang Meishan, Ren Yafeng, and Ji Donghong. 2015. A transition-based model for joint segmentation, POS-tagging and normalization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 18371846.Google ScholarGoogle ScholarCross RefCross Ref
  35. Schuster Mike and Paliwal Kuldip K.. 2002. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 26732681.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Shi Min, Li Bin, and Chen Xiaohe. 2010. CRF based research on a unified approach to word segmentation and POS tagging for Pre-Qin Chinese. Journal of Chinese Information Processing 24, 2 (2010), 3945.Google ScholarGoogle Scholar
  37. Shi Xiaodong and Lu Yajun. 2011. A Tibetan segmentation system—Yangjin. Journal of Chinese Information Processing 25, 4 (2011), 5456.Google ScholarGoogle Scholar
  38. Srivastava Nitish, Hinton Geoffrey, Krizhevsky Alex, Sutskever Ilya, and Salakhutdinov Ruslan. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 56 (2014), 19291958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Strubell Emma, Verga Patrick, Belanger David, and McCallum Andrew. 2007. Fast and accurate entity Koltun recognition with iterated dilated convolutions. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 26702680.Google ScholarGoogle Scholar
  40. Su Junfeng, Qi Kunyu, and Ben Tai. 2009. Research on automatic Part-of-Speech tagging of Tibetan corpus based on HMM. Journal of Northwest University for Nationalities (Natural Science) 30, 1 (2009), 4245.Google ScholarGoogle Scholar
  41. Sun Yuan, Yan Xiaodong, Zhao Xiaobing, and Yang Guosheng. 2010. Notice of Retraction: A resolution of overlapping ambiguity in Tibetan word segmentation. IEEE International Conference on Computer Science & Information Technology.Google ScholarGoogle Scholar
  42. Wang Chunqi and Xu Bo. 2007. Convolutional Neural Network with word embeddings for Chinese word segmentation. Proceedings of the 8th International Joint Conference on Natural Language, 163172.Google ScholarGoogle Scholar
  43. Wang Jingkang, Zhou Jianing, and Liu Gongshen. 2018. Multiple character embeddings for Chinese word segmentation.Google ScholarGoogle Scholar
  44. Lili Wang, Hongwu Yang, Xiaotian Xing, and Yajing Yan. 2019. Tibetan word segmentation method based on CNN-BiLSTM-CRF model. International Conference on Asian Language Processing (IALP'19). 319--324. DOI: 10.1109/IALP48816.2019.9037661Google ScholarGoogle Scholar
  45. Xue Nianwen and Shen Libin. 2003. Chinese word segmentation as LMR tagging. Proceeding of the Second SIGHAN Workshop on Chinese Language Processing. (Association for Computational Linguistics, Morristown, NJ, USA), 7, 17 (2003), 176179.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Wenming Yang and Weijie Chu. 2019. Named entity recognition of online medical question answering text. Computer Systems and Applications 28, 2 (2019), 8--14 (in Chinese). http://www.c-s-a.org.cn/1003-3254/6760.html.Google ScholarGoogle Scholar
  47. Yao Yushi and Huang Zheng. 2016. Bi-directional LSTM recurrent neural network for Chinese word segmentation. International Conference on Neural Information Processing Springer, Cham (2016).Google ScholarGoogle ScholarCross RefCross Ref
  48. Yu Fisher and Koltun Vladlen. 2015. Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122.Google ScholarGoogle Scholar
  49. Yu Hongzhi, Li Yachao, Wang Kun, and Tashi Lengben. 2013. Fusion of syllable features for Tibetan part of speech based on maximum entropy model. Journal of Chinese Information Processing 27, 5 (2013), 160165.Google ScholarGoogle Scholar
  50. Zhang Meishan, Zhang Yue, and Fu Guohong. 2016. Transition-based neural word segmentation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany 1 (2016), 421431.Google ScholarGoogle ScholarCross RefCross Ref
  51. Zhang Yue and Clark Stephen. 2008. Joint word segmentation and POS tagging using a single perceptron. ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA.Google ScholarGoogle Scholar
  52. Duola and Zhaxi Jia. 2015. Tibetan Syllable Frequency Dictionary[M]. China Social Sciences Press (2015).Google ScholarGoogle Scholar
  53. Xiaoqing Zheng, Hanyang Chen, and Tianyu Xu. 2013. Deep Learning for Chinese Word Segmentation and POS Tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 647--657.Google ScholarGoogle Scholar

Index Terms

  1. Character-based Joint Word Segmentation and Part-of-Speech Tagging for Tibetan Based on Deep Learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 5
      September 2022
      486 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3533669
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 November 2022
      • Online AM: 31 August 2022
      • Accepted: 12 January 2022
      • Revised: 10 January 2022
      • Received: 29 December 2020
      Published in tallip Volume 21, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!