Abstract
Word segmentation is an essential and challenging task in natural language processing, especially for the Chinese language due to its high linguistic complexity. Existing methods for Chinese word segmentation, including statistical machine learning methods and neural network methods, usually have good performance in specific knowledge domains. Given the increasing importance of interdisciplinary and cross-domain studies, one of the challenges in cross-domain word segmentation is to handle the out-of-vocabulary (OOV) words. Existing methods show unsatisfactory performance to meet the practical standard. To this end, we propose a document-level context-aware model that can automatically perceive and identify OOV words from different domains. Our method jointly implements a word-based and a character-based model and then processes the results with a newly proposed reconstruction model. We evaluate the new method by designing and conducting comprehensive experiments on two real-world datasets (e.g., news from different domains). The results demonstrate the superiority of our method over the state-of-the-art models in handling texts from different domains. Importantly, when doing the word segmentation under the cross-domain scenario, our proposed method can improve the performance of OOV words recognition.
- . 2017. Fast and accurate neural word segmentation for Chinese. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 608–615.
DOI: DOI: https://doi.org/10.18653/v1/P17-2096Google ScholarCross Ref
- . 2015a. Gated recursive neural network for Chinese word segmentation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1744–1753.
DOI: DOI: https://doi.org/10.3115/v1/P15-1168Google ScholarCross Ref
- . 2015b. Long short-term memory neural networks for Chinese word segmentation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1197–1206.
DOI: DOI: https://doi.org/10.18653/v1/D15-1141Google ScholarCross Ref
- . 2017. Adversarial multi-criteria learning for Chinese word segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1193–1203.
DOI: DOI: https://doi.org/10.18653/v1/P17-1110Google ScholarCross Ref
- . 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12 (2011), 2493–2537. http://dl.acm.org/citation.cfm?id=1953048.2078186. Google Scholar
Digital Library
- . 2005. The Second International Chinese Word Segmentation Bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 123–133.Google Scholar
- . 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing 21, 3 (2007), 8–19.Google Scholar
- . 2012. Context information and fragments based cross-domain word segmentation. China Communications 9, 3 (2012), 49–57.Google Scholar
- . 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015). http://arxiv.org/abs/1508.01991.Google Scholar
- . 2004. Applying conditional random fields to Japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 230–237.Google Scholar
- . 2016. Exploring segment representations for neural segmentation models. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 2880–2886. http://dl.acm.org/citation.cfm?id=3060832.3061024. Google Scholar
Digital Library
- . 2012. Unsupervised domain adaptation for joint segmentation and POS-tagging. In Proceedings of COLING 2012: Posters. 745–754.Google Scholar
- . 2014. Domain adaptation for CRF-based Chinese word segmentation using free annotations. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 864–874.
DOI: DOI: https://doi.org/10.3115/v1/D14-1093Google ScholarCross Ref
- . 2018. State-of-the-art Chinese word segmentation with Bi-LSTMs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4902–4908.Google Scholar
Cross Ref
- . 2004. Chinese and Japanese word segmentation using word-level and character-level information. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). 466–472.
DOI: DOI: https://doi.org/10.3115/1220355.1220422 Google ScholarCross Ref
- . 2014. Max-margin tensor neural network for Chinese word segmentation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 293–303.
DOI: DOI: https://doi.org/10.3115/v1/P14-1028Google ScholarCross Ref
- . 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). 562–568.
DOI: DOI: https://doi.org/10.3115/1220355.1220436Google ScholarCross Ref
- . 2005. A conditional random field word segmenter for Sighan Bakeoff 2005. In Proceedings of the 4rth SIGHAN Workshop on Chinese Language Processing. 168–171. https://www.aclweb.org/anthology/I05-3027.Google Scholar
- . 2003. Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8, 1 (2003), 29–48.Google Scholar
- . 2017. Neural word segmentation with rich pretraining. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 839–849.
DOI: DOI: https://doi.org/''10.18653/v1/P17-1078''Google ScholarCross Ref
- . 2019. Subword encoding in lattice LSTM for Chinese word segmentation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers). 2720–2725.Google Scholar
- . 2013. Exploring representations from unlabeled data with co-training for Chinese word segmentation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 311–321. https://aclweb.org/anthology/D13-1031.pdf.Google Scholar
- . 2016. Transition-based neural word segmentation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 421–431.
DOI: DOI: https://doi.org/10.18653/v1/P16-1040Google ScholarCross Ref
- . 2018. Neural networks incorporating dictionaries for Chinese word segmentation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google Scholar
- . 2019. Chinese word segmentation: Another decade review (2007-2017). arXiv preprint arXiv:1901.06079 (2019).Google Scholar
- . 2010. A unified character-based tagging framework for Chinese word segmentation. ACM Transactions on Asian Language Information Processing 9, 2 (2010), 1–32.
DOI: DOI: https://doi.org/10.1145/1781134.1781135 Google ScholarCross Ref
- . 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 106–111.Google Scholar
- . 2018. Neural networks incorporating unlabeled and partially-labeled data for cross-domain chinese word segmentation. In Proceedings of the International Conference on Artificial Intelligence (IJCAI’18). 4602–4608. Google Scholar
Digital Library
- . 2013. Deep learning for Chinese word segmentation and POS tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13). 647–657.Google Scholar
- . 2017. Word-context character embeddings for Chinese word segmentation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 760–766.
DOI: DOI: https://doi.org/10.18653/v1/D17-1079Google ScholarCross Ref
Index Terms
Domain-Aware Word Segmentation for Chinese Language: A Document-Level Context-Aware Model
Recommendations
An integrated approach to chinese word segmentation and part-of-speech tagging
ICCPOL'06: Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges aheadThis paper discusses and compares various integration schemes of Chinese word segmentation and part-of-speech tagging in the framework of true-integration and pseudo-integration. A true-integration approach, named ‘the divide-and-conquer integration', ...
A Chinese word segmentation based on language situation in processing ambiguous words
While the processing of natural language is beneficial to the text mining. Chinese word segmentation is an important step in the processing of Chinese natural language. In this paper, the convergence essence of the segmentation process is analyzed, and ...
Chinese word segmentation as morpheme-based lexical chunking
Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...






Comments