skip to main content
research-article

A Probabilistic Framework for Chinese Spelling Check

Published:11 November 2015Publication History
Skip Abstract Section

Abstract

Chinese spelling check (CSC) is still an unsolved problem today since there are many homonymous or homomorphous characters. Recently, more and more CSC systems have been proposed. To the best of our knowledge, language modeling is one of the major components among these systems because of its simplicity and moderately good predictive power. After deeply analyzing the school of research, we are aware that most of the systems only employ the conventional n-gram language models. The contributions of this article are threefold. First, we propose a novel probabilistic framework for CSC, which naturally combines several important components, such as the substitution model and the language model, to inherit their individual merits as well as to overcome their limitations. Second, we incorporate the topic language models into the CSC system in an unsupervised fashion. The topic language models can capture the long-span semantic information from a word (character) string while the conventional n-gram language models can only preserve the local regularity information. Third, we further integrate Web resources with the proposed framework to enhance the overall performance. Our rigorously empirical experiments demonstrate the consistent and utility performance of the proposed framework in the CSC task.

References

  1. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13, 4, 359--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Berlin Chen. 2009. Word topic models for spoken document retrieval and transcription. ACM Trans. Asian Lang. Inform. Process. 8, 1, 2:1--2:27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yong-Zhi Chen, Shih-Hung Wu, Chia-Ching Lu, and Tsun Ku. 2009. Chinese confusion word set for automatic generation of spelling error detecting template. In Proceedings of the Conference on Computational Linguistics and Speech Processing. 359--372.Google ScholarGoogle Scholar
  5. Kuan-Yu Chen, Hsuan-Sheng Chiu, and Berlin Chen. 2010. Latent Topic modeling of word vicinity information for speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 5394--5397.Google ScholarGoogle ScholarCross RefCross Ref
  6. Kuan-Yu Chen and Berlin Chen. 2011. Relevance language modeling for speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 5568--5571.Google ScholarGoogle ScholarCross RefCross Ref
  7. Berlin Chen and Kuan-Yu Chen. 2013. Leveraging relevance cues for language modeling in speech recognition. Inf. Process. Manag. 49, 4, 807--816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kuan-Yu Chen, Hung-Shin Lee, Chung-Han Lee, Hsin-Min Wang, and Hsin-Hsi Chen. 2013. A study of language modeling for Chinese spelling check. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 79--83.Google ScholarGoogle Scholar
  9. Kuan-Yu Chen, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen, and Hsin-Hsi Chen. 2014a. I-vector based language modeling for spoken document retrieval. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 7083--7088.Google ScholarGoogle ScholarCross RefCross Ref
  10. Kuan-Yu Chen, Shih-Hung Liu, Berlin Chen, Hsin-Min Wang, Wen-Lian Hsu, and Hsin-Hsi Chen. 2014b. A recurrent neural network language modeling framework for extractive speech summarization. In Proceedings of the International Conference on Multimedia and Expo.Google ScholarGoogle ScholarCross RefCross Ref
  11. Hsun-Wen Chiu, Jian-Cheng Wu, and Jason S. Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 49--53.Google ScholarGoogle Scholar
  12. Ondrej Glembek, Lukas Burget, Pavel Matejka, Martin Karafiat, and Patrick Kenny. 2011. Simplification and optimization of i-vector extraction. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 4516--4519.Google ScholarGoogle ScholarCross RefCross Ref
  13. Thomas L. Griffiths and Mark Steyvers. 2004. Finding Scientific Topics. Proc. Nat. Acad. Sci. U.S.A. 5228--5235.Google ScholarGoogle Scholar
  14. Md. Akmal Haidar and Douglas O’Shaughnessy. 2013. Comparison of a bigram PLSA and a novel context-based PLSA language model for speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 8440--8444.Google ScholarGoogle Scholar
  15. Dongxu Han and Baobao Chang. 2013. A maximum entropy approach to Chinese spelling check. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 74--78.Google ScholarGoogle Scholar
  16. Yu-Ming Hsieh, Ming-Hong Bai, and Keh-Jiann Chen. 2013. Introduction to CKIP Chinese spelling check system for SIGHAN bakeoff 2013 evaluation. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 59--63.Google ScholarGoogle Scholar
  17. Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. 289--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Song-Fang Huang and Steve Renals. 2007. Hierarchical Pitman-Yor language models for ASR in meetings. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 124--129.Google ScholarGoogle Scholar
  19. Frederick Jelinek. 1999. Statistical Methods for Speech Recognition. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Prentice Hall PTR, Upper Saddle River, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mark D. Kernighan, Kenneth W. Church, and William A. Gale. 1990. A spelling correction program based on a noisy channel model. In Proceedings of the International Conference on Computational Linguistics. 205--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Wen-Jui Kuo, Tzu-Chen Yeh, Jun-Ren Lee, Li-Fen Chen, Po-Lei Lee, Shyan-Shiou Chen, Low-Tone Ho, Daisy L Hung, Ovid J. L. Tzeng, and Jen-Chuen Hsieh. 2004. Orthographic and phonological processing of Chinese characters: An fMRI study. Neuroimage 21, 4, 1721--1731.Google ScholarGoogle ScholarCross RefCross Ref
  23. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning. 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Victor Lavrenko and W. Bruce Croft. 2001. Relevance-based Language Models. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 120--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chia-Ying Lee, Jie-Li Tsai, Hsu-Wen Huang, Daisy L. Hung, and Ovid J. L. Tzeng. 2006. The temporal signatures of semantic and phonological activations for Chinese sublexical processing: An event-related potential study. Brain Research 1121, 1, 150--159.Google ScholarGoogle ScholarCross RefCross Ref
  26. Shih-Hsiang Lin and Berlin Chen. 2010. A Risk Minimization Framework for Extractive Speech Summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 79--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Chao-Lin Liu, Min-Hua Lai, Kan-Wen Tien, Yi-Hsuan Chuang, Shih-Hung Wu, and Chia-Ying Lee. 2011. Visually and phonologically similar Characters in incorrect Chinese words: Analyses, identification, and applications. ACM Trans. Asian Lang. Inform. Process. 10, 2, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Xiaodong Liu, Fei Cheng, Yanyan Luo, Kevin Duh, and Yuji Matsumoto. 2013. A hybrid Chinese spelling correction using language model and statistical machine translation with reranking. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 54--58.Google ScholarGoogle Scholar
  29. Shih-Hung Liu, Kuan-Yu Chen, Yu-Lun Hsieh, Berlin Chen, Hsin-Min Wang, Hsu-Chun Yen, and Wen-Lian Hsu. 2014. Effective pseudo-relevance feedback for language modeling in extractive speech summarization. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 3226--3230.Google ScholarGoogle ScholarCross RefCross Ref
  30. Wei-Yun Ma and Keh-Jiann Chen. 2003. Introduction to CKIP Chinese word segmentation system for the First International Chinese Word Segmentation Bakeoff. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 168--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jiazhong Nie, Runxin Li, Dingsheng Luo, and Xihong Wu. 2007. Refine bigram PLSA model by assigning latent topics unevenly. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 141--146.Google ScholarGoogle Scholar
  32. Jay M. Ponte and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 275--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Andreas Stolcke. 2000. SRI Language Modeling Toolkit. http://www.speech.sri.com/projects/srilm/.Google ScholarGoogle Scholar
  34. Andreas Stolcke. 2005. SRILM: An extensible language modeling toolkit. In Proceedings of INTERSPEECH. 901--904.Google ScholarGoogle Scholar
  35. Mikolov Tomáš, Karafiát Martin, Burget Lukáš, Černocký Jan and Khudanpur Sanjeev. 2010. Recurrent neural network based language model. In Proceedings of INTERSPEECH. 1045--1048.Google ScholarGoogle Scholar
  36. Jie-Li Tsai, Chia-Ying Lee, Ying-Chun Lin, Ovid J. L. Tzeng, and Daisy L. Hung. 2006. Neighborhood size effects of Chinese Words in lexical decision and reading. Language & Linguistics. 7, 3, 659--675.Google ScholarGoogle Scholar
  37. Andrew J. Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 2, 260--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Hanna M. Wallach. 2006. Topic modeling: Beyond bag-of-words. In Proceedings of the International Conference on Machine Learning. 977--984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Chun-Hung Wang, Jason S. Chang, and Jian-Cheng Wu. 2013a. Automatic Chinese confusion words extraction using conditional random fields and the Web. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 64--68.Google ScholarGoogle Scholar
  40. Yih-Ru Wang, Yuan-Fu Liao, Yeh-Kuang Wu, and Liang-Chun Chang. 2013b. Conditional random field-based parser and language model for traditional Chinese spelling checker. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 69--73.Google ScholarGoogle Scholar
  41. Shih-Hung Wu, Yong-Zhi Chen, Ping-che Yang, Tsun Ku, and Chao-Lin Liu. 2010. Reducing the false alarm rate of Chinese character error detection and correction. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop.Google ScholarGoogle Scholar
  42. Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 35--42.Google ScholarGoogle Scholar
  43. Shaohua Yang, Hai Zhao, Xiaolin Wang, and Baoliang Lu. 2012. Spell checking for Chinese. In Proceedings of the Annual Conference on Language Resources and Evaluation. 730--736.Google ScholarGoogle Scholar
  44. Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-Yi Chen, and Mao-Chuan Su. 2013. Chinese word spelling correction based on n-gram ranked inverted index list. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 43--48.Google ScholarGoogle Scholar
  45. Liang-Chih Yu, Chao-Hong Liu, and Chung-Hsien Wu. 2013. Candidate scoring using web-based measure for Chinese spelling error correction. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 108--112.Google ScholarGoogle Scholar
  46. Lei Zhang, Ming Zhou, Changning Huang, and Mingyu Lu. 2000. Approach in automatic detection and correction of errors in Chinese text based on feature and learning. In Proceedings of the World Congress on Intelligent Control and Automation. 2744--2748.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Probabilistic Framework for Chinese Spelling Check

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 14, Issue 4
      Special Issue on Chinese Spell Checking
      October 2015
      92 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/2845556
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 November 2015
      • Accepted: 1 April 2015
      • Revised: 1 January 2015
      • Received: 1 August 2014
      Published in tallip Volume 14, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!