skip to main content
research-article

A Seed-Based Method for Generating Chinese Confusion Sets

Authors Info & Claims
Published:22 July 2016Publication History
Skip Abstract Section

Abstract

In natural language, people often misuse a word (called a “confused word”) in place of other words (called “confusing words”). In misspelling corrections, many approaches to finding and correcting misspelling errors are based on a simple notion called a “confusion set.” The confusion set of a confused word consists of confusing words. In this article, we propose a new method of building Chinese character confusion sets.

Our method is composed of two major phases. In the first phase, we build a list of seed confusion sets for each Chinese character, which is based on measuring similarity in character pinyin or similarity in character shape. In this phase, all confusion sets are constructed manually, and the confusion sets are organized into a graph, called a “seed confusion graph” (SCG), in which vertices denote characters and edges are pairs of characters in the form (confused character, confusing character).

In the second phase, we extend the SCG by acquiring more pairs of (confused character, confusing character) from a large Chinese corpus. For this, we use several word patterns (or patterns) to generate new confusion pairs and then verify the pairs before adding them into a SCG. Comprehensive experiments show that our method of extending confusion sets is effective. Also, we shall use the confusion sets in Chinese misspelling corrections to show the utility of our method.

References

  1. Jiang CAO, Xiaojun WU, Yunqing XIA, and Fang ZHENG. 2009. Pinyin-indexed method for approximate matching in chinese. Journal of Tsinghua University (Science and Technology) 49, S1 (2009), 1328--1332.Google ScholarGoogle Scholar
  2. Yong-Zhi Chen, Shih-Hung Wu, Ping-che Yang, and Tsun Ku. 2010. Improving the template generation or chinese character error detection with confusion sets. Comput. Ling. Chinese Lang. Processing 15, 2 (2010), 127--144.Google ScholarGoogle Scholar
  3. Xing-Chun Diao, Ming-Chao Tan, and Jian-Jun Cao. 2010. New method of character string similarity compute based on fusing multiple edit distances. Appl. Res. Comput. 27 (2010), 12.Google ScholarGoogle Scholar
  4. Ute Essen and Volker Steinbiss. 1992. Cooccurrence smoothing for stochastic language modeling. In Acous tics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, Vol. 1. IEEE, 161--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Qiangze Feng and Cungen Cao. 2006. Sound Distinguishing Method in Speech Sound Inquiry. (July 26 2006). CN Patent 1,266,633.Google ScholarGoogle Scholar
  6. Davide Fossati and Barbara Di Eugenio. 2007. A mixed trigrams approach for context sensitive spell checking. In Computational Linguistics and Intelligent Text Processing. Springer, Berlin, 623--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Andrew R. Golding. 1995. A bayesian hybrid method for context-sensitive spelling correction. In Proceedings of the Third Workshop on Very Large Corpora, Vol. 3. Massachusetts Institute of Technology, Cambridge, MA, 39--53.Google ScholarGoogle Scholar
  8. Andrew R. Golding and Dan Roth. 1999. A winnow-based approach to context-sensitive spelling correction. Mach. Learn. 34, 1--3 (1999), 107--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Andrew R. Golding and Yves Schabes. 1996. Combining trigram-based and feature-based methods for ontext-sensitive spelling correction. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 71--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dan Jurafsky, James H. Martin, Andrew Kehler, Keith Vander Linden, and Nigel Ward. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Vol. 2. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Mu Li, Yang Zhang, Muhua Zhu, and Ming Zhou. 2006. Exploring distributional similarity based models for query spelling correction. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1025--1032. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Lin and R. Song. 2010. A stroke-segment-mesh (SSM) glyph description method of chinese characters. J. Comput. Res. Dev. 47, 2 (2010), 318--327.Google ScholarGoogle Scholar
  13. Chao-Lin Liu and Jen-Hsiang Lin. 2008. Using structural information for identifying similar chinese characters. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 93--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chao-Lin Liu, Kan-Wen Tien, Min-Hua Lai, Yi-Hsuan Chuang, and Shih-Hung Wu. 2009. Capturing errors in written chinese words. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, 25--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Liangliang Liu, Shi Wang, Dongsheng Wang, Pingze Wang, and Cungen Cao. 2013. Automatic text error detection in domain question answering. J. Chin. Inform. Process. 27, 3 (2013), 77--83.Google ScholarGoogle Scholar
  16. Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991. Context based spelling correction. Inform. Process. Manag. 27, 5 (1991), 517--522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yu Meng and Yao Tian-Shun. 1998. A hybrid method for chinese text collation. J. Chin. Inform. Process. 12, 2 (1998), 31--36.Google ScholarGoogle Scholar
  18. Fuji Ren, Hongchi Shi, and Qiang Zhou. 2001. A hybrid approach to automatic chinese text checking and error correction. In Systems, Man, and Cybernetics, 2001 IEEE International Conference on, Vol. 3. IEEE, 1693--1698.Google ScholarGoogle Scholar
  19. R. Song, M. Lin, and S. L. Ge. 1964. Similarity calculation of chinese character glyph and its application in computer aided proofreading system. J. Chin. Comput. Syst. 29, 10 (1964), 1968.Google ScholarGoogle Scholar
  20. Shi Wang, Weimin Wang, and Jianhui Fu. 2012a. Automatic component layer analysis method for chinese characters. (Feb. 8 2012). CN Patent App. CN 201,110,205,810.Google ScholarGoogle Scholar
  21. Shi Wang, Weimin Wang, and Jianhui Fu. 2012b. Chinese character pattern cognition similarity computing method. (March 28 2012). CN Patent App. CN 201,110,205,807.Google ScholarGoogle Scholar
  22. Lei Zhang, Changning Huang, Ming Zhou, and Haihua Pan. 2000. Automatic detecting/correcting errors in chinese text by an approximate word-matching algorithm. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 248--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lei Zhang, Ming Zhou, Changning Huang, and H. H. Pan. 1999. Multifeature-based approach to automatic error detection and correction of chinese text. In Proceedings of the First Workshop on Natural Language Processing and Neural Networks.Google ScholarGoogle Scholar

Index Terms

  1. A Seed-Based Method for Generating Chinese Confusion Sets

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 16, Issue 1
      TALLIP Notes and Regular Papers
      March 2017
      133 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/2961867
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 July 2016
      • Revised: 1 May 2016
      • Accepted: 1 May 2016
      • Received: 1 March 2015
      Published in tallip Volume 16, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!