skip to main content
research-article

Domain-Aware Word Segmentation for Chinese Language: A Document-Level Context-Aware Model

Authors Info & Claims
Published:03 November 2021Publication History
Skip Abstract Section

Abstract

Word segmentation is an essential and challenging task in natural language processing, especially for the Chinese language due to its high linguistic complexity. Existing methods for Chinese word segmentation, including statistical machine learning methods and neural network methods, usually have good performance in specific knowledge domains. Given the increasing importance of interdisciplinary and cross-domain studies, one of the challenges in cross-domain word segmentation is to handle the out-of-vocabulary (OOV) words. Existing methods show unsatisfactory performance to meet the practical standard. To this end, we propose a document-level context-aware model that can automatically perceive and identify OOV words from different domains. Our method jointly implements a word-based and a character-based model and then processes the results with a newly proposed reconstruction model. We evaluate the new method by designing and conducting comprehensive experiments on two real-world datasets (e.g., news from different domains). The results demonstrate the superiority of our method over the state-of-the-art models in handling texts from different domains. Importantly, when doing the word segmentation under the cross-domain scenario, our proposed method can improve the performance of OOV words recognition.

REFERENCES

  1. Cai Deng, Zhao Hai, Zhang Zhisong, Xin Yuan, Wu Yongjian, and Huang Feiyue. 2017. Fast and accurate neural word segmentation for Chinese. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 608615. DOI: DOI: https://doi.org/10.18653/v1/P17-2096Google ScholarGoogle ScholarCross RefCross Ref
  2. Chen Xinchi, Qiu Xipeng, Zhu Chenxi, and Huang Xuanjing. 2015a. Gated recursive neural network for Chinese word segmentation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 17441753. DOI: DOI: https://doi.org/10.3115/v1/P15-1168Google ScholarGoogle ScholarCross RefCross Ref
  3. Chen Xinchi, Qiu Xipeng, Zhu Chenxi, Liu Pengfei, and Huang Xuanjing. 2015b. Long short-term memory neural networks for Chinese word segmentation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 11971206. DOI: DOI: https://doi.org/10.18653/v1/D15-1141Google ScholarGoogle ScholarCross RefCross Ref
  4. Chen Xinchi, Shi Zhan, Qiu Xipeng, and Huang Xuanjing. 2017. Adversarial multi-criteria learning for Chinese word segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 11931203. DOI: DOI: https://doi.org/10.18653/v1/P17-1110Google ScholarGoogle ScholarCross RefCross Ref
  5. Collobert Ronan, Weston Jason, Bottou Léon, Karlen Michael, Kavukcuoglu Koray, and Kuksa Pavel. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (2011), 24932537. http://dl.acm.org/citation.cfm?id=1953048.2078186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Emerson Thomas. 2005. The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 123133.Google ScholarGoogle Scholar
  7. Huang Chang-Ning and Zhao Hai. 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing 21, 3 (2007), 819.Google ScholarGoogle Scholar
  8. Huang Degen and Tong Deqin. 2012. Context information and fragments based cross-domain word segmentation. China Communications 9, 3 (2012), 4957.Google ScholarGoogle Scholar
  9. Huang Zhiheng, Xu Wei, and Yu Kai. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015). http://arxiv.org/abs/1508.01991.Google ScholarGoogle Scholar
  10. Kudo Taku, Yamamoto Kaoru, and Matsumoto Yuji. 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 230237.Google ScholarGoogle Scholar
  11. Liu Yijia, Che Wanxiang, Guo Jiang, Qin Bing, and Liu Ting. 2016. Exploring segment representations for neural segmentation models. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 28802886. http://dl.acm.org/citation.cfm?id=3060832.3061024. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Liu Yang and Zhang Yue. 2012. Unsupervised domain adaptation for joint segmentation and POS-tagging. In Proceedings of COLING 2012: Posters. 745754.Google ScholarGoogle Scholar
  13. Liu Yijia, Zhang Yue, Che Wanxiang, Liu Ting, and Wu Fan. 2014. Domain adaptation for CRF-based Chinese word segmentation using free annotations. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 864874. DOI: DOI: https://doi.org/10.3115/v1/D14-1093Google ScholarGoogle ScholarCross RefCross Ref
  14. Ma Ji, Ganchev Kuzman, and Weiss David. 2018. State-of-the-art Chinese word segmentation with Bi-LSTMs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 49024908.Google ScholarGoogle ScholarCross RefCross Ref
  15. Nakagawa Tetsuji. 2004. Chinese and Japanese word segmentation using word-level and character-level information. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). 466472. DOI: DOI: https://doi.org/10.3115/1220355.1220422 Google ScholarGoogle ScholarCross RefCross Ref
  16. Pei Wenzhe, Ge Tao, and Chang Baobao. 2014. Max-margin tensor neural network for Chinese word segmentation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 293303. DOI: DOI: https://doi.org/10.3115/v1/P14-1028Google ScholarGoogle ScholarCross RefCross Ref
  17. Peng Fuchun, Feng Fangfang, and McCallum Andrew. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). 562568. DOI: DOI: https://doi.org/10.3115/1220355.1220436Google ScholarGoogle ScholarCross RefCross Ref
  18. Tseng Huihsin, Chang Pichuan, Andrew Galen, Jurafsky Daniel, and Manning Christopher. 2005. A conditional random field word segmenter for Sighan Bakeoff 2005. In Proceedings of the 4rth SIGHAN Workshop on Chinese Language Processing. 168171. https://www.aclweb.org/anthology/I05-3027.Google ScholarGoogle Scholar
  19. Xue Nianwen. 2003. Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8, 1 (2003), 2948.Google ScholarGoogle Scholar
  20. Yang Jie, Zhang Yue, and Dong Fei. 2017. Neural word segmentation with rich pretraining. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 839849. DOI: DOI: https://doi.org/''10.18653/v1/P17-1078''Google ScholarGoogle ScholarCross RefCross Ref
  21. Yang Jie, Zhang Yue, and Liang Shuailong. 2019. Subword encoding in lattice LSTM for Chinese word segmentation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). 27202725.Google ScholarGoogle Scholar
  22. Zhang Longkai, Wang Houfeng, Sun Xu, and Mansur Mairgup. 2013. Exploring representations from unlabeled data with co-training for Chinese word segmentation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 311321. https://aclweb.org/anthology/D13-1031.pdf.Google ScholarGoogle Scholar
  23. Zhang Meishan, Zhang Yue, and Fu Guohong. 2016. Transition-based neural word segmentation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 421431. DOI: DOI: https://doi.org/10.18653/v1/P16-1040Google ScholarGoogle ScholarCross RefCross Ref
  24. Zhang Qi, Liu Xiaoyu, and Fu Jinlan. 2018. Neural networks incorporating dictionaries for Chinese word segmentation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  25. Zhao Hai, Cai Deng, Huang Changning, and Kit Chunyu. 2019. Chinese word segmentation: Another decade review (2007-2017). arXiv preprint arXiv:1901.06079 (2019).Google ScholarGoogle Scholar
  26. Zhao Hai, Huang Chang-Ning, Li Mu, and Lu Bao-Liang. 2010. A unified character-based tagging framework for Chinese word segmentation. ACM Transactions on Asian Language Information Processing 9, 2 (2010), 132. DOI: DOI: https://doi.org/10.1145/1781134.1781135 Google ScholarGoogle ScholarCross RefCross Ref
  27. Zhao Hai and Kit Chunyu. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 106111.Google ScholarGoogle Scholar
  28. Zhao Lujun, Zhang Qi, Wang Peng, and Liu Xiaoyu. 2018. Neural networks incorporating unlabeled and partially-labeled data for cross-domain chinese word segmentation. In Proceedings of the International Conference on Artificial Intelligence (IJCAI’18). 46024608. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Zheng Xiaoqing, Chen Hanyang, and Xu Tianyu. 2013. Deep learning for Chinese word segmentation and POS tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 647657.Google ScholarGoogle Scholar
  30. Zhou Hao, Yu Zhenting, Zhang Yue, Huang Shujian, Dai Xinyu, and Chen Jiajun. 2017. Word-context character embeddings for Chinese word segmentation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 760766. DOI: DOI: https://doi.org/10.18653/v1/D17-1079Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Domain-Aware Word Segmentation for Chinese Language: A Document-Level Context-Aware Model

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 2
      March 2022
      413 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3494070
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 November 2021
      • Accepted: 1 August 2021
      • Revised: 1 May 2021
      • Received: 1 November 2019
      Published in tallip Volume 21, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!