skip to main content
research-article

Chinese Spelling Checker Based on an Inverted Index List with a Rescoring Mechanism

Published:11 November 2015Publication History
Skip Abstract Section

Abstract

An approach is proposed for Chinese spelling error detection and correction, in which an inverted index list with a rescoring mechanism is used. The inverted index list is a structure for mapping from word to desired sentence, and for representing nodes in lattices constructed through character expansion (according to predefined phonologically and visually similar character sets). Pruning based on a contextual dependency confidence measure was used to markedly reduce the search space and computational complexity. Relevant mapping relations between the original input and desired input were obtained using a scoring mechanism composed of class-based language and maximum entropy correction models containing character, word, and contextual features. The proposed method was evaluated using data sets provided by SigHan 7 bakeoff. The experimental results show that the proposed method achieved acceptable performance in terms of recall rate or precision rate in error sentence detection and error location detection, and it outperformed other approaches in error location detection and correction.

References

  1. Farooq Ahmad and Grzegorz Kondrak. 2005. Learning a spelling error model from search query logs. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT’05). Association for Computational Linguistics, Stroudsburg, PA, 955--962. DOI: http://dx.doi.org/10.3115/1220575.1220695. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Mayce Al Azawi, and Thomas M. Breuel. 2014. Context-dependent confusions rules for building error model using weighted finite state transducers for OCR post-processing. In Proceedings of the IEEE 11th IAPR International Workshop on Document Analysis Systems (DAS’14). 116--120. DOI: http://dx.doi.org/10.1109/DAS.2014.75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mohamed I. Alkanhal, Mohamed A. Al-Badrashiny, Mansour M. Alghamdi and Abdulaziz O. Al-Qabbany. 2012. Automatic stochastic Arabic spelling correction with emphasis on space insertions and deletions. IEEE Trans. Audio Speech Lang. Process. 20, 7, 2111--2122. DOI: http://doi.acm.org/10.1109/TASL.2012.2197612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Zhuowei Bao, Benny Kimelfeld, and Yunyao Li. 2011. A graph approach to spelling correction in domain-centric search. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Vol. 1, Association for Computational Linguistics, Stroudsburg, PA, 905--914. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Youssef Bassil and Mohammad Alwani. 2012. Post-editing error correction algorithm for speech recognition using Bing spelling suggestion. arXiv preprint arXiv:1203.5255. DOI: http://10.14569/IJACSA.2012.030217.Google ScholarGoogle Scholar
  6. Andrew Carlson and Ian Fette. 2007. Memory-based context-sensitive spelling correction at web scale. In Proceedings of the IEEE 6th International Conference on Machine Learning and Applications (ICMLA’07). 166--171. DOI: http://dx.doi.org/10.1109/ICMLA.2007.50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Richard G. Casey and Eric Lecolinet. 1996. A survey of methods and strategies in character segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18, 7, 690--706. DOI: http://dx.doi.org/10.1109/34.506792. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chao-Huang Chang. 1995. A new approach for automatic Chinese spelling correction. In Proceedings of Natural Language Processing Pacific Rim Symposium. 278--283.Google ScholarGoogle Scholar
  9. Tao-Hsing Chang, Hsueh-Chih Chen, Yuen-Hsien Tseng, and Jian-Liang Zheng. 2013. Automatic detection and correction for Chinese misspelled words using phonological and orthographic similarities. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 97--101.Google ScholarGoogle Scholar
  10. Kuan-Yu Chen, Hung-Shin Lee, Chung-Han Lee, Hsin-Min Wang and Hsin-Hsi Chen. 2013b. A Study of language modeling for Chinese spelling check. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 79--83.Google ScholarGoogle Scholar
  11. Wei-Te Chen, Su-Chu Lin, Shu-Ling Huang, You-Shan Chung, and Keh-Jiann Chen. 2010. E-HowNet and automatic construction of a lexical ontology. In Proceedings of the 23rd International Conference on Computational Linguistics. 45--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yong Zhi Chen, Shih Hung Wu, Ping Che Yang, Tsun Ku, and Gwo Dong Chen. 2011. Improve the detection of improperly used Chinese characters in students’ essays with error model. Int. J. Continuing Engin. Educ. Life Long Learning 21, 1, 103--116. DOI: http://dx.doi.org/10.1504/IJCEELL.2011.039697.Google ScholarGoogle ScholarCross RefCross Ref
  13. Hsun-wen Chiu, Jian-cheng Wu, and Jason S. Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 49--53.Google ScholarGoogle Scholar
  14. John N. Darroch and Douglas Ratcliff. 1972. Generalized iterative scaling for log-linear models. Ann. Math. Statistics 43, 5, 1470--1480. DOI: http://dx.doi.org/10.1214/aoms/1177692379.Google ScholarGoogle ScholarCross RefCross Ref
  15. Sebastian Deorowicz and Marcin G. Ciura. 2005. Correcting spelling errors by modelling their causes. Int. J. Appl. Math. Comput. Sci. 15, 2, 275--285.Google ScholarGoogle Scholar
  16. Ramy Eskander, Nizar Habash, Ann Bies, Seth Kulick, and Mohamed Maamouri. 2013. Automatic correction and extension of morphological annotations. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse. 1--10.Google ScholarGoogle Scholar
  17. Fellbaum Christiane. 1998. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). MIT Press. https://mitpress.mit.edu/books/wordnet.Google ScholarGoogle Scholar
  18. Michael Gamon. 2010. Using mostly native data to correct errors in learners’ writing: A meta-classifier approach. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. 163--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk, and Xu Sun. 2010. A Large Scale Ranker-Based System for Search Query Spelling Correction. In Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 358--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Andrew R. Golding and Dan Roth. 1999. A winnow-based approach to context-sensitive spelling correction. Mach. Learning 34, 1--3, 107--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Google. 2010. A Java API for Google spelling check service. http://code.google.com/p/google-api-spellingjava/.Google ScholarGoogle Scholar
  22. Dongxu Han and Baobao Chang. 2013. A maximum entropy approach to Chinese spelling check. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 74--78.Google ScholarGoogle Scholar
  23. HANDICT. 2010. A source for traditional and simplified Chinese characters. http://www.zdic.net/appendix/f19.htm.Google ScholarGoogle Scholar
  24. Yu He and Guohong Fu. 2013. Description of HLJU Chinese spelling checker for SIGHAN Bakeoff 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 84--87.Google ScholarGoogle Scholar
  25. Yu-Ming Hsieh, Ming-Hong Bai, and Keh-Jiann Chen. 2013. Introduction to CKIP Chinese spelling check system for SIGHAN Bakeoff 2013 Evaluation. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 5--63.Google ScholarGoogle Scholar
  26. Chuen-Min Huang, Mei-Chen Wu, and Ching-Che Chang. 2008. Error detection and correction based on Chinese phonemic alphabet in Chinese text. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 16, 1, 89--105. DOI: http://dx.doi.org/10.1142/S0218488508005261.Google ScholarGoogle ScholarCross RefCross Ref
  27. Yu-Jui Huang, Ming-chin Yen, Guan-Huei Wu, Yao-Yi Wang, and Jui-Feng Yeh. 2011. Print pickets combined language models and knowledge resources in web. In ROCLING 2011 Poster Papers. Association for Computational Linguistics, Stroudsburg, PA, 297--309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Anton K. Ingason, Skúli B. Jóhannsson, Eiríkur Rögnvaldsson, Hrafn Loftsson, and Sigrún Helgadóttir. 2009. Context-Sensitive Spelling Correction and Rich Morphology. In Proceedings of the Nordic Conference on Computational Linguistics. 231--234.Google ScholarGoogle Scholar
  29. Aminul Islam and Diana Inkpen. 2009. Real-word spelling correction using google web 1t 3-grams. In Proceedings of Empirical Methods in Natural Language Processing. 1241--1249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Zhongye Jia, Peilu Wang, and Hai Zhao. 2013. Graph model for Chinese spell checking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 88--92.Google ScholarGoogle Scholar
  31. Ying Jiang, Tong Wang, Tao Lin, Fangjie Wang, Wenting Cheng, Xiaofei Liu, Chenghui Wang, and Weijian Zhang. 2012. A rule based Chinese spelling and grammar detection system utility. In Proceedings of the International Conference on System Science and Engineering. 437--440. DOI: http://dx.doi.org/10.1109/ICSSE.2012.6257223.Google ScholarGoogle ScholarCross RefCross Ref
  32. Karen Kukich. 1992. Techniques for automatically correcting words in text. ACM Comput. Surv. 24, 4, 377--439. DOI: http://doi.acm.org/10.1145/146370.146380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jianhua Li and Xiaolong Wang. 2002. Combining trigram and automatic weight distribution in Chinese spelling error correction. J. Comput. Sci. Technol. 17, 6, 915--923. DOI: http://doi.acm.org/10.1007/BF02960784. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mu Li, Yang Zhang, Muhua Zhu, and Ming Zhou. 2006. Exploring distributional similarity based models for query spelling correction. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, 1025--1032. DOI: http://dx.doi.org/10.3115/1220175.1220304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yanen Li, Huizhong Duan, and ChengXiang Zhai. 2011. CloudSpeller: Spelling correction for search queries by using a unified hidden Markov model with web-scale resources. In Proceedings of the Spelling Alteration for Web Search Workshop. 10--14.Google ScholarGoogle Scholar
  36. C.-L. Liu, M.-H. Lai, K.-W. Tien, Y.-H. Chuang, S.-H. Wu, and C.-Y. Lee. 2011. Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications. ACM Trans. Asian Lang. Inform. Process. 10, 2, Article 10, 39 pages. DOI: http://doi.acm.org/10.1145/1967293.1967297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xiaodong Liu, Fei Cheng, Yanyan Luo, Kevin Duh, and Yuji Matsumoto. 2013. A hybrid Chinese spelling correction using language model and statistical machine translation with reranking. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 54--58.Google ScholarGoogle Scholar
  38. Eric Mays, Fred J. Damerau, and Robert L. Mercer. 1991. Context based spelling correction. Inf. Process. Manage. 27, 5, 517--522. DOI: http://doi.acm.org/10.1016/0306-4573(91)90066-U. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Microsoft web n-gram services. 2010. http://research.microsoft.com/web-ngram.Google ScholarGoogle Scholar
  40. Roger Mitton. 2009. Ordering the suggestions of a spellchecker without using context. Nat. Lang. Eng. 15, 2, 173--192. DOI: http://doi.acm.org/10.1017/S1351324908004804. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Roger Mitton. 2010. Fifty years of spellchecking. Writing Syst. Research, 2, 1, 1--7. DOI: http://doi.acm.org/10.1093/wsr/wsq004.Google ScholarGoogle ScholarCross RefCross Ref
  42. Michael N. Nawar and Moheb M. Ragheb. 2014. Fast and robust Arabic error correction system. In Proceedings of the Arabic Natural Language Processing Workshop. 143.Google ScholarGoogle Scholar
  43. Steffen Remus. 2014. Unsupervised relation extraction of in-domain data from focused crawls. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics. 11--20.Google ScholarGoogle ScholarCross RefCross Ref
  44. Fuji Ren, Hongchi Shi, and Qiang Zhou. 2001. A hybrid approach to automatic Chinese text checking and error correction. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. Vol. 3, 1693--1698. DOI: http://doi.acm.org/10.1109/ICSMC.2001.973529.Google ScholarGoogle Scholar
  45. M. Rodphon, K. Siriboon, and B. Kruatrachue. 2001. Thai OCR error correction using token passing algorithm. In Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM’01). Vol. 2, 599--602. DOI: http://doi.acm.org/10.1109/PACRIM.2001.953704.Google ScholarGoogle Scholar
  46. C. Anton Rytting, David M. Zajic, Paul Rodrigues, Sarah C. Wayland, Christian Hettick, Tim Buckwalter, and Charles C. 2011. Spelling correction for dialectal Arabic dictionary lookup. ACM Trans. Asian Lang. Inform. Process. 10, 1, 3. DOI: http://doi.acm.org/10.1145/1929908.1929911. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Seobook. 2010. Keyword typo generator. http://tools.seobook.com/spelling/keywordstypos.Google ScholarGoogle Scholar
  48. Khaled Shaalan, Younes Samih, Mohammed Attia, Pavel Pecina, and Josef van Genabith. 2012. Arabic word generation and modelling for spell checking. In Proceedings of the Language Resources and Evaluation Conference. 719--725.Google ScholarGoogle Scholar
  49. Gong Shuai, Xiong Jinhua, Zhang Cheng, and Liu Zhiyong. 2013. Identifying semantic-related search tasks in Query Log. In Web Technologies and Applications. Springer, 518--525. DOI: http://dx.doi.org/10.1007/978-3-642-37401-2_51.Google ScholarGoogle Scholar
  50. Cucerzan Silviu and Eric Brill. 2004. Spelling Correction as an iterative process that exploits the collective knowledge of web users. In Proceedings of the Conference on Empirical Methods on Natural Language Processing. 293--300.Google ScholarGoogle Scholar
  51. Xu Sun, Jianfeng Gao, Daniel Micol, and Chris Quirk. 2010. Learning phrase-based spelling error models from clickthrough data. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). Association for Computational Linguistics, Stroudsburg, PA, 266--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Chia-Hung Tai, Jia-Zen Fan, Shu-Ling Huang, and Keh-Jiann Chen. 2009. Automatic sense derivation for determinative-measure compounds under the framework of E-HowNet. Int. J. Computat. Linguistics Chinese Language Process. 14, 1.Google ScholarGoogle Scholar
  53. Kristina Toutanova and Robert C. Moore. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). Association for Computational Linguistics, Stroudsburg, PA, 144--151. DOI: http://dx.doi.org/10.3115/1073083.1073109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Chung-Hsien Wu and Gwo-Lang Yan. 2005. Speech Act Modeling and Verification of Spontaneous Speech With Disfluency in a Spoken Dialogue System. IEEE Trans. Speech Audio Process. 13, 3, 330--344. DOI: http://dx.doi.org/10.1109/TSA.2005.845820.Google ScholarGoogle ScholarCross RefCross Ref
  55. Chun-Hung Wang, Jason S. Chang, and Jian-Cheng Wu. 2013a. Automatic Chinese confusion words extraction using conditional random fields and the web. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 64--68.Google ScholarGoogle Scholar
  56. Yih-Ru Wang, Yuan-Fu Liao, Yeh-Kuang Wu, and Liang-Chun Chang. 2013b. Conditional random field-based parser and language model for traditional Chinese spelling checker. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 69--73.Google ScholarGoogle Scholar
  57. Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 35--42.Google ScholarGoogle Scholar
  58. Jinhua Xiong, Qiao Zhao, Jianpeng Hou, Qianbo Wang, Yuanzhuo Wang, and Xueqi Cheng. 2014. Extended HMM and ranking models for Chinese spelling correction. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. 133--138.Google ScholarGoogle ScholarCross RefCross Ref
  59. Ting-Hao Yang, Yu-Lun Hsieh, Yu-Hsuan Chen, Michael Tsang, Cheng-Wei Shih, and Wen-Lian Hsu. 2013. Sinica-IASL Chinese spelling check system at SIGHAN-7. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 93--96.Google ScholarGoogle Scholar
  60. C. C. Ye. 1987. Basic Vocabulary Table of Modern Chinese Characters. Beijing Education Publishing House.Google ScholarGoogle Scholar
  61. Liang-Chih Yu, Chao-Hong Liu, and Chung-Hsien Wu. 2013. Candidate scoring using web-based measure for Chinese spelling error correction. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing. 108--112.Google ScholarGoogle Scholar
  62. Taha Zerrouki, Khaled Alhawaity, and Amar Balla. 2014. Autocorrection of Arabic common errors for large text corpus. In Proceedings of the Arabic Natural Language Processing Workshop. 127--143.Google ScholarGoogle ScholarCross RefCross Ref
  63. Lei Zhang, Changning Huang, Ming Zhou, and Haihua Pan. 2000. Automatic detecting/correcting errors in Chinese text by an approximate word-matching algorithm. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL’00). Association for Computational Linguistics, Stroudsburg, PA, 248--254. DOI: http://dx.doi.org/10.3115/1075218.1075250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Dong Zhendong and Qiang Dong. 2006. HowNet and the Computation of Meaning. World Scientific Publishing Co. Pte. Ltd. DOI: http://dx.doi.org/10.1142/9789812774675_0010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Chinese Spelling Checker Based on an Inverted Index List with a Rescoring Mechanism

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!