Abstract
In natural language, people often misuse a word (called a “confused word”) in place of other words (called “confusing words”). In misspelling corrections, many approaches to finding and correcting misspelling errors are based on a simple notion called a “confusion set.” The confusion set of a confused word consists of confusing words. In this article, we propose a new method of building Chinese character confusion sets.
Our method is composed of two major phases. In the first phase, we build a list of seed confusion sets for each Chinese character, which is based on measuring similarity in character pinyin or similarity in character shape. In this phase, all confusion sets are constructed manually, and the confusion sets are organized into a graph, called a “seed confusion graph” (SCG), in which vertices denote characters and edges are pairs of characters in the form (confused character, confusing character).
In the second phase, we extend the SCG by acquiring more pairs of (confused character, confusing character) from a large Chinese corpus. For this, we use several word patterns (or patterns) to generate new confusion pairs and then verify the pairs before adding them into a SCG. Comprehensive experiments show that our method of extending confusion sets is effective. Also, we shall use the confusion sets in Chinese misspelling corrections to show the utility of our method.
- Jiang CAO, Xiaojun WU, Yunqing XIA, and Fang ZHENG. 2009. Pinyin-indexed method for approximate matching in chinese. Journal of Tsinghua University (Science and Technology) 49, S1 (2009), 1328--1332.Google Scholar
- Yong-Zhi Chen, Shih-Hung Wu, Ping-che Yang, and Tsun Ku. 2010. Improving the template generation or chinese character error detection with confusion sets. Comput. Ling. Chinese Lang. Processing 15, 2 (2010), 127--144.Google Scholar
- Xing-Chun Diao, Ming-Chao Tan, and Jian-Jun Cao. 2010. New method of character string similarity compute based on fusing multiple edit distances. Appl. Res. Comput. 27 (2010), 12.Google Scholar
- Ute Essen and Volker Steinbiss. 1992. Cooccurrence smoothing for stochastic language modeling. In Acous tics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, Vol. 1. IEEE, 161--164. Google Scholar
Digital Library
- Qiangze Feng and Cungen Cao. 2006. Sound Distinguishing Method in Speech Sound Inquiry. (July 26 2006). CN Patent 1,266,633.Google Scholar
- Davide Fossati and Barbara Di Eugenio. 2007. A mixed trigrams approach for context sensitive spell checking. In Computational Linguistics and Intelligent Text Processing. Springer, Berlin, 623--633. Google Scholar
Digital Library
- Andrew R. Golding. 1995. A bayesian hybrid method for context-sensitive spelling correction. In Proceedings of the Third Workshop on Very Large Corpora, Vol. 3. Massachusetts Institute of Technology, Cambridge, MA, 39--53.Google Scholar
- Andrew R. Golding and Dan Roth. 1999. A winnow-based approach to context-sensitive spelling correction. Mach. Learn. 34, 1--3 (1999), 107--130. Google Scholar
Digital Library
- Andrew R. Golding and Yves Schabes. 1996. Combining trigram-based and feature-based methods for ontext-sensitive spelling correction. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 71--78. Google Scholar
Digital Library
- Dan Jurafsky, James H. Martin, Andrew Kehler, Keith Vander Linden, and Nigel Ward. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Vol. 2. MIT Press. Google Scholar
Digital Library
- Mu Li, Yang Zhang, Muhua Zhu, and Ming Zhou. 2006. Exploring distributional similarity based models for query spelling correction. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1025--1032. Google Scholar
Digital Library
- M. Lin and R. Song. 2010. A stroke-segment-mesh (SSM) glyph description method of chinese characters. J. Comput. Res. Dev. 47, 2 (2010), 318--327.Google Scholar
- Chao-Lin Liu and Jen-Hsiang Lin. 2008. Using structural information for identifying similar chinese characters. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 93--96. Google Scholar
Digital Library
- Chao-Lin Liu, Kan-Wen Tien, Min-Hua Lai, Yi-Hsuan Chuang, and Shih-Hung Wu. 2009. Capturing errors in written chinese words. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. Association for Computational Linguistics, 25--28. Google Scholar
Digital Library
- Liangliang Liu, Shi Wang, Dongsheng Wang, Pingze Wang, and Cungen Cao. 2013. Automatic text error detection in domain question answering. J. Chin. Inform. Process. 27, 3 (2013), 77--83.Google Scholar
- Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991. Context based spelling correction. Inform. Process. Manag. 27, 5 (1991), 517--522. Google Scholar
Digital Library
- Yu Meng and Yao Tian-Shun. 1998. A hybrid method for chinese text collation. J. Chin. Inform. Process. 12, 2 (1998), 31--36.Google Scholar
- Fuji Ren, Hongchi Shi, and Qiang Zhou. 2001. A hybrid approach to automatic chinese text checking and error correction. In Systems, Man, and Cybernetics, 2001 IEEE International Conference on, Vol. 3. IEEE, 1693--1698.Google Scholar
- R. Song, M. Lin, and S. L. Ge. 1964. Similarity calculation of chinese character glyph and its application in computer aided proofreading system. J. Chin. Comput. Syst. 29, 10 (1964), 1968.Google Scholar
- Shi Wang, Weimin Wang, and Jianhui Fu. 2012a. Automatic component layer analysis method for chinese characters. (Feb. 8 2012). CN Patent App. CN 201,110,205,810.Google Scholar
- Shi Wang, Weimin Wang, and Jianhui Fu. 2012b. Chinese character pattern cognition similarity computing method. (March 28 2012). CN Patent App. CN 201,110,205,807.Google Scholar
- Lei Zhang, Changning Huang, Ming Zhou, and Haihua Pan. 2000. Automatic detecting/correcting errors in chinese text by an approximate word-matching algorithm. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 248--254. Google Scholar
Digital Library
- Lei Zhang, Ming Zhou, Changning Huang, and H. H. Pan. 1999. Multifeature-based approach to automatic error detection and correction of chinese text. In Proceedings of the First Workshop on Natural Language Processing and Neural Networks.Google Scholar
Index Terms
A Seed-Based Method for Generating Chinese Confusion Sets
Recommendations
Using Convolutional Neural Network to Handle Word Shape Similarities in Handwritten Cursive Arabic Scripts of Pashto Language
Pattern Recognition and Artificial IntelligenceAutomatic Construction of Chinese Typo-Pairs Based on Web Corpus
WI '19 Companion: IEEE/WIC/ACM International Conference on Web Intelligence - Companion VolumeWith the development of big data, the amount of text data is growing bigger and bigger in which errors are also more and more. The traditional human-correction cannot meet the actual demand. It is a trend for automatic text proofing by using computer ...
A New Method for Segmenting Handwritten Chinese Characters
ICDAR '97: Proceedings of the 4th International Conference on Document Analysis and RecognitionA new approach is proposed to segment off-line handwritten Chinese character. Many papers have been published on the off-line recognition of Chinese characters, and almost all of them focus on the recognition of isolated Chinese character. The ...






Comments