Abstract
This article proposes a unified, character-based, generative model to incorporate additional resources for solving the out-of-vocabulary (OOV) problem of Chinese word segmentation, within which different types of additional information can be utilized independently in corresponding submodels. This article mainly addresses the following three types of OOV: unseen dictionary words, named entities, and suffix-derived words, none of which are handled well by current approaches. The results show that our approach can effectively improve the performance of the first two types with positive interaction in F-score. Additionally, we also analyze reason that suffix information is not helpful. After integrating the proposed generative model with the corresponding discriminative approach, our evaluation on various corpora---including SIGHAN-2005, CIPS-SIGHAN-2010, and the Chinese Treebank (CTB)---shows that our integrated approach achieves the best performance reported in the literature on all testing sets when additional information and resources are allowed.
- Baroni, M. 2009. Distributions in text. In Corpus Linguistics: An International Handbook, A. Lüdeling and M. Kytö (Eds.). Mouton de Gruyter, Berlin.Google Scholar
- Bilmes, J. A. and Kirchhoff, K. 2003. Factored language models and generalized parallel backoff. In Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics (HLT/NAACL’03). 4--6. Google Scholar
Digital Library
- Chen, S. F. and Goodman, J. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98. Harvard University Center for Research in Computing Technology.Google Scholar
- Dong, Z., Dong, Q., and Hao, C. 2010. Word segmentation needs change---From a linguists view. In Proceedings of the 1st CIPS-SIGHAN Joint Conference on Chinese Language Processing. 1--7.Google Scholar
- Emerson, T. 2005. The second international Chinese word segmentation bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 123--133.Google Scholar
- Feng, H., Chen, K., Deng, X., and Zheng, W. 2004. Accessor variety criteria for Chinese wordextraction. Comput. Linguistics 30, 1, 75--93. Google Scholar
Digital Library
- Gao, J., Li, M., Wu, A., and Huang, C.-N. 2005. Chinese word segmentation and named entity recognition: A pragmatic approach. Comput. Linguistics 31, 531--574. Google Scholar
Digital Library
- Hatori, J., Matsuzaki, T., Miyao, Y., and Tsujii, J. 2012. Incremental joint approach to word segmentation, POS tagging, and dependency parsing in Chinese. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 1045--1053. Google Scholar
Digital Library
- Huang, C. and Zhao, H. 2007. Chinese word segmentation: A decade review. J. Chinese Inf.Process. 21, 3, 8--20.Google Scholar
- Jiampojamarn, S., Cherry, C., and Kondrak, G. 2010. Integrating joint n-gram features into a discriminative training framework. In Proceedings of the NAACL. 697--700. Google Scholar
Digital Library
- Jiang, W., Huang, L., and Liu, Q. 2009. Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging---A case study. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 522--530. Google Scholar
Digital Library
- Jiang, W., Huang, L., Liu, Q., and Lu, Y. 2008. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the ACL. 897--904.Google Scholar
- Jiang, W., Sun, M., Lv, Y., Yang, Y., and Liu, Q. 2013. Discriminative learning with natural annotations: Word segmentation as a case study. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. (Vol. 1, Long Papers). 761--769.Google Scholar
- Jin, G. and Chen, X. 2008. The fourth international Chinese language processing bakeoff: Chinese word segmentation, named entity recognition and Chinese POS tagging. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 69.Google Scholar
- Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., and Isahara H. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 513--521. Google Scholar
Digital Library
- Li, X., Wang, K., Zong, C., and Su, K.-Y. 2012. Integrating surface and abstract features for robust cross-domain Chinese word segmentation. In Proceedings of COLING. 1653--1670.Google Scholar
- Li, X., Zong, C., and Su, K.-Y. 2013. A study of the effectiveness of suffixes for Chinese word segmentation. In Proceedings of the 27th Pacific Asia Conference on Language, Information and Computation.Google Scholar
- Li, Z. 2011. Parsing the internal structure of words: A new paradigm for Chinese word segmentation.In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.1405--1414. Google Scholar
Digital Library
- Li, Z. and Sun, M. 2009. Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguistics 35, 4, 505--512. Google Scholar
Digital Library
- Li, Z. and Zhou, G. 2012. Unified dependency parsing of Chinese morphological and syntactic structures. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 1445--1454. Google Scholar
Digital Library
- Ng, H. T. and Low, J. K. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? Word-based or character-based. In Proceedings of the EMNLP. 277--284.Google Scholar
- Nadeau, D. and Sekine, S. 2007. A survey of named entity recognition and classification.Lingvisticae Investigationes, 30, 1, 3--26.Google Scholar
- Och, F. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. 160--167. Google Scholar
Digital Library
- Peng, F., Feng, F., and McCallum, A. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of COLING. 562--568. Google Scholar
Digital Library
- Qian, X. and Liu, Y. 2012. Joint Chinese word segmentation, POS tagging and parsing. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 501--511. Google Scholar
Digital Library
- Stolcke, A. 2002. SRILM---An extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing. 311--318.Google Scholar
- Sun, J., Gao, J., Zhang, L., Zhou, M., and Huang, C. 2002. Chinese named entity identification using class-based language model. In Proceedings of the 19th International Conference on Computational Linguistics. pp 1--7. Google Scholar
Digital Library
- Sun, W. 2010. Word-based and character-based word segmentation models: Comparison and combination. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). 1211--1219. Google Scholar
Digital Library
- Sun, W. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. 1385--1394. Google Scholar
Digital Library
- Sun, W. and Xu, J. 2011. Enhancing Chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 970--979. Google Scholar
Digital Library
- Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. 2005. A conditional random field word segmenter for SIGHAN bakeoff 2005. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 168--171.Google Scholar
- Wang, K., Zong, C., and Su, K.-Y. 2009. Which is more suitable for Chinese word segmentation, thegenerative model or the discriminative one? In Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation (PACLIC’23). 827--834.Google Scholar
- Wang, K., Zong, C., and Su, K.-Y. 2012. Integrating generative and discriminative character-based models for Chinese word segmentation. ACM Trans. Asian Lang. Inf. Process. Google Scholar
Digital Library
- Wang, Y., Kazama, J., Tsuruoka, Y., Chen, W., Zhang, Y., and Torisawa, K. 2011. Improving Chinese word segmentation and POS tagging with semi-supervised methods using large auto-analyzed data. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 309--317.Google Scholar
- Wang, Z., Zong, C., and Xue, N. 2013. A lattice-based framework for joint Chinese word segmentation, POS tagging and parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol. 2, Short Papers). 623--627.Google Scholar
- Xiong, Y., Zhu, J., Huang, H., and Xu, H. 2009. Minimum tag error for discriminative training of conditional random fields. Inf. Sci. 179, 1--2, 169--179. Google Scholar
Digital Library
- Xue, N., Xia, F., Chiou, F., and Palmer, M. 2005. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Nat. Lang. Eng. 11, 2, 207--238. Google Scholar
Digital Library
- Xue, N. and Shen, L. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 176--179. Google Scholar
Digital Library
- Zhang, H., Yu, H., Xiong, D., and Liu, Q. 2003. HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 184--187. Google Scholar
Digital Library
- Zhang, M., Zhang, Y., Che, W., and Liu, T. 2013. Chinese parsing exploiting characters. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 125--134.Google Scholar
- Zhang, M., Zhang, Y., Che, W., and Liu, T. 2014. Type-supervised domain adaptation for joint segmentation and POS tagging. In Proceedings of the 14th Conference of the European Chapter of the ACL. 588--597.Google Scholar
- Zhang, R., Kikui, G., and Sumita, E. 2006. Subword-based tagging for confidence-dependent Chinese word segmentation. In Proceedings of the COLING/ACL. 961--968. Google Scholar
Digital Library
- Zhang, Y., Vogel, S., and Waibel, A. 2004. Interpreting BLEU/NIST scores: How much improvement do we need to have a better system. In Proceedings of the 4th International Conference on Language Resource and Evaluation (LREC). 2051--2054.Google Scholar
- Zhang, Y. and Clark, S. 2007. Chinese segmentation with a word-based perceptron algorithm. In Proceedings of the ACL. 840--847.Google Scholar
- Zhang, Y. and Clark, S. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of the ACL/HLT. 888--896.Google Scholar
- Zhang, Y. and Clark, S. 2011. Syntactic processing using the generalized perceptron and beam search. Comput. Linguistics 37, 105--151. Google Scholar
Digital Library
- Zhao, H., Huang, C., and Li, M. 2006a. An improved Chinese word segmentation system with conditional random field. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing.162--165.Google Scholar
- Zhao, H., Huang, C.-N., Li, M., and Lu, B.-L. 2006b. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the PACLIC-20. 87--94.Google Scholar
- Zhao, H., Huang, C.-N., Li, M., and Lu, B.-L. 2010a. A unified character-based tagging framework for Chinese word segmentation. ACM Trans. Asian Lang. Inf. Process. 9, 2, 1--32. Google Scholar
Digital Library
- Zhao, H. and Kit, C. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 106--111.Google Scholar
- Zhao, H., Song, Y., and Kit, C. 2010b. How large a corpus do we need: Statistical method versus rule-based method. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10).Google Scholar
- Zhao, H. and Liu, Q. 2010. The CIPS-SIGHAN CLP 2010 Chinese word segmentation bakeoff. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP’10).199--209.Google Scholar
- Zipf, G. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.Google Scholar
Index Terms
A Unified Model for Solving the OOV Problem of Chinese Word Segmentation
Recommendations
Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation
Among statistical approaches to Chinese word segmentation, the word-based n-gram (generative) model and the character-based tagging (discriminative) model are two dominant approaches in the literature. The former gives excellent performance for the in-...
A Unified Character-Based Tagging Framework for Chinese Word Segmentation
Chinese word segmentation is an active area in Chinese language processing though it is suffering from the argument about what precisely is a word in Chinese. Based on corpus-based segmentation standard, we launched this study. In detail, we regard ...
Study on the Influencing Factors of Chinese Word Segmentation
IALP '12: Proceedings of the 2012 International Conference on Asian Language ProcessingOut-of-vocabulary words (OOV) and ambiguity are two important issues for Chinese word segmentation (CWS). In previous studies, the measurement of OOV has been clearly stated, while the measurement of ambiguity requires further clarification. This paper ...






Comments