skip to main content
research-article

A Unified Model for Solving the OOV Problem of Chinese Word Segmentation

Published:12 June 2015Publication History
Skip Abstract Section

Abstract

This article proposes a unified, character-based, generative model to incorporate additional resources for solving the out-of-vocabulary (OOV) problem of Chinese word segmentation, within which different types of additional information can be utilized independently in corresponding submodels. This article mainly addresses the following three types of OOV: unseen dictionary words, named entities, and suffix-derived words, none of which are handled well by current approaches. The results show that our approach can effectively improve the performance of the first two types with positive interaction in F-score. Additionally, we also analyze reason that suffix information is not helpful. After integrating the proposed generative model with the corresponding discriminative approach, our evaluation on various corpora---including SIGHAN-2005, CIPS-SIGHAN-2010, and the Chinese Treebank (CTB)---shows that our integrated approach achieves the best performance reported in the literature on all testing sets when additional information and resources are allowed.

References

  1. Baroni, M. 2009. Distributions in text. In Corpus Linguistics: An International Handbook, A. Lüdeling and M. Kytö (Eds.). Mouton de Gruyter, Berlin.Google ScholarGoogle Scholar
  2. Bilmes, J. A. and Kirchhoff, K. 2003. Factored language models and generalized parallel backoff. In Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics (HLT/NAACL’03). 4--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chen, S. F. and Goodman, J. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Technical Report TR-10-98. Harvard University Center for Research in Computing Technology.Google ScholarGoogle Scholar
  4. Dong, Z., Dong, Q., and Hao, C. 2010. Word segmentation needs change---From a linguists view. In Proceedings of the 1st CIPS-SIGHAN Joint Conference on Chinese Language Processing. 1--7.Google ScholarGoogle Scholar
  5. Emerson, T. 2005. The second international Chinese word segmentation bakeoff. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 123--133.Google ScholarGoogle Scholar
  6. Feng, H., Chen, K., Deng, X., and Zheng, W. 2004. Accessor variety criteria for Chinese wordextraction. Comput. Linguistics 30, 1, 75--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Gao, J., Li, M., Wu, A., and Huang, C.-N. 2005. Chinese word segmentation and named entity recognition: A pragmatic approach. Comput. Linguistics 31, 531--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hatori, J., Matsuzaki, T., Miyao, Y., and Tsujii, J. 2012. Incremental joint approach to word segmentation, POS tagging, and dependency parsing in Chinese. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 1045--1053. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Huang, C. and Zhao, H. 2007. Chinese word segmentation: A decade review. J. Chinese Inf.Process. 21, 3, 8--20.Google ScholarGoogle Scholar
  10. Jiampojamarn, S., Cherry, C., and Kondrak, G. 2010. Integrating joint n-gram features into a discriminative training framework. In Proceedings of the NAACL. 697--700. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jiang, W., Huang, L., and Liu, Q. 2009. Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging---A case study. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 522--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jiang, W., Huang, L., Liu, Q., and Lu, Y. 2008. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the ACL. 897--904.Google ScholarGoogle Scholar
  13. Jiang, W., Sun, M., Lv, Y., Yang, Y., and Liu, Q. 2013. Discriminative learning with natural annotations: Word segmentation as a case study. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. (Vol. 1, Long Papers). 761--769.Google ScholarGoogle Scholar
  14. Jin, G. and Chen, X. 2008. The fourth international Chinese language processing bakeoff: Chinese word segmentation, named entity recognition and Chinese POS tagging. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 69.Google ScholarGoogle Scholar
  15. Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., and Isahara H. 2009. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 513--521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Li, X., Wang, K., Zong, C., and Su, K.-Y. 2012. Integrating surface and abstract features for robust cross-domain Chinese word segmentation. In Proceedings of COLING. 1653--1670.Google ScholarGoogle Scholar
  17. Li, X., Zong, C., and Su, K.-Y. 2013. A study of the effectiveness of suffixes for Chinese word segmentation. In Proceedings of the 27th Pacific Asia Conference on Language, Information and Computation.Google ScholarGoogle Scholar
  18. Li, Z. 2011. Parsing the internal structure of words: A new paradigm for Chinese word segmentation.In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.1405--1414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Li, Z. and Sun, M. 2009. Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguistics 35, 4, 505--512. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Li, Z. and Zhou, G. 2012. Unified dependency parsing of Chinese morphological and syntactic structures. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 1445--1454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ng, H. T. and Low, J. K. 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? Word-based or character-based. In Proceedings of the EMNLP. 277--284.Google ScholarGoogle Scholar
  22. Nadeau, D. and Sekine, S. 2007. A survey of named entity recognition and classification.Lingvisticae Investigationes, 30, 1, 3--26.Google ScholarGoogle Scholar
  23. Och, F. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. 160--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Peng, F., Feng, F., and McCallum, A. 2004. Chinese segmentation and new word detection using conditional random fields. In Proceedings of COLING. 562--568. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Qian, X. and Liu, Y. 2012. Joint Chinese word segmentation, POS tagging and parsing. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 501--511. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Stolcke, A. 2002. SRILM---An extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing. 311--318.Google ScholarGoogle Scholar
  27. Sun, J., Gao, J., Zhang, L., Zhou, M., and Huang, C. 2002. Chinese named entity identification using class-based language model. In Proceedings of the 19th International Conference on Computational Linguistics. pp 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sun, W. 2010. Word-based and character-based word segmentation models: Comparison and combination. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). 1211--1219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sun, W. 2011. A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. 1385--1394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sun, W. and Xu, J. 2011. Enhancing Chinese word segmentation using unlabeled data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 970--979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. 2005. A conditional random field word segmenter for SIGHAN bakeoff 2005. In Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. 168--171.Google ScholarGoogle Scholar
  32. Wang, K., Zong, C., and Su, K.-Y. 2009. Which is more suitable for Chinese word segmentation, thegenerative model or the discriminative one? In Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation (PACLIC’23). 827--834.Google ScholarGoogle Scholar
  33. Wang, K., Zong, C., and Su, K.-Y. 2012. Integrating generative and discriminative character-based models for Chinese word segmentation. ACM Trans. Asian Lang. Inf. Process. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wang, Y., Kazama, J., Tsuruoka, Y., Chen, W., Zhang, Y., and Torisawa, K. 2011. Improving Chinese word segmentation and POS tagging with semi-supervised methods using large auto-analyzed data. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 309--317.Google ScholarGoogle Scholar
  35. Wang, Z., Zong, C., and Xue, N. 2013. A lattice-based framework for joint Chinese word segmentation, POS tagging and parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol. 2, Short Papers). 623--627.Google ScholarGoogle Scholar
  36. Xiong, Y., Zhu, J., Huang, H., and Xu, H. 2009. Minimum tag error for discriminative training of conditional random fields. Inf. Sci. 179, 1--2, 169--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xue, N., Xia, F., Chiou, F., and Palmer, M. 2005. The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Nat. Lang. Eng. 11, 2, 207--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Xue, N. and Shen, L. 2003. Chinese word segmentation as LMR tagging. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 176--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zhang, H., Yu, H., Xiong, D., and Liu, Q. 2003. HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 184--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zhang, M., Zhang, Y., Che, W., and Liu, T. 2013. Chinese parsing exploiting characters. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 125--134.Google ScholarGoogle Scholar
  41. Zhang, M., Zhang, Y., Che, W., and Liu, T. 2014. Type-supervised domain adaptation for joint segmentation and POS tagging. In Proceedings of the 14th Conference of the European Chapter of the ACL. 588--597.Google ScholarGoogle Scholar
  42. Zhang, R., Kikui, G., and Sumita, E. 2006. Subword-based tagging for confidence-dependent Chinese word segmentation. In Proceedings of the COLING/ACL. 961--968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhang, Y., Vogel, S., and Waibel, A. 2004. Interpreting BLEU/NIST scores: How much improvement do we need to have a better system. In Proceedings of the 4th International Conference on Language Resource and Evaluation (LREC). 2051--2054.Google ScholarGoogle Scholar
  44. Zhang, Y. and Clark, S. 2007. Chinese segmentation with a word-based perceptron algorithm. In Proceedings of the ACL. 840--847.Google ScholarGoogle Scholar
  45. Zhang, Y. and Clark, S. 2008. Joint word segmentation and POS tagging using a single perceptron. In Proceedings of the ACL/HLT. 888--896.Google ScholarGoogle Scholar
  46. Zhang, Y. and Clark, S. 2011. Syntactic processing using the generalized perceptron and beam search. Comput. Linguistics 37, 105--151. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zhao, H., Huang, C., and Li, M. 2006a. An improved Chinese word segmentation system with conditional random field. In Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing.162--165.Google ScholarGoogle Scholar
  48. Zhao, H., Huang, C.-N., Li, M., and Lu, B.-L. 2006b. Effective tag set selection in Chinese word segmentation via conditional random field modeling. In Proceedings of the PACLIC-20. 87--94.Google ScholarGoogle Scholar
  49. Zhao, H., Huang, C.-N., Li, M., and Lu, B.-L. 2010a. A unified character-based tagging framework for Chinese word segmentation. ACM Trans. Asian Lang. Inf. Process. 9, 2, 1--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Zhao, H. and Kit, C. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing. 106--111.Google ScholarGoogle Scholar
  51. Zhao, H., Song, Y., and Kit, C. 2010b. How large a corpus do we need: Statistical method versus rule-based method. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10).Google ScholarGoogle Scholar
  52. Zhao, H. and Liu, Q. 2010. The CIPS-SIGHAN CLP 2010 Chinese word segmentation bakeoff. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP’10).199--209.Google ScholarGoogle Scholar
  53. Zipf, G. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.Google ScholarGoogle Scholar

Index Terms

  1. A Unified Model for Solving the OOV Problem of Chinese Word Segmentation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!