Abstract
Chinese spelling check (CSC) is still an unsolved problem today since there are many homonymous or homomorphous characters. Recently, more and more CSC systems have been proposed. To the best of our knowledge, language modeling is one of the major components among these systems because of its simplicity and moderately good predictive power. After deeply analyzing the school of research, we are aware that most of the systems only employ the conventional n-gram language models. The contributions of this article are threefold. First, we propose a novel probabilistic framework for CSC, which naturally combines several important components, such as the substitution model and the language model, to inherit their individual merits as well as to overcome their limitations. Second, we incorporate the topic language models into the CSC system in an unsupervised fashion. The topic language models can capture the long-span semantic information from a word (character) string while the conventional n-gram language models can only preserve the local regularity information. Third, we further integrate Web resources with the proposed framework to enhance the overall performance. Our rigorously empirical experiments demonstrate the consistent and utility performance of the proposed framework in the CSC task.
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google Scholar
Digital Library
- Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 13, 4, 359--393. Google Scholar
Digital Library
- Berlin Chen. 2009. Word topic models for spoken document retrieval and transcription. ACM Trans. Asian Lang. Inform. Process. 8, 1, 2:1--2:27. Google Scholar
Digital Library
- Yong-Zhi Chen, Shih-Hung Wu, Chia-Ching Lu, and Tsun Ku. 2009. Chinese confusion word set for automatic generation of spelling error detecting template. In Proceedings of the Conference on Computational Linguistics and Speech Processing. 359--372.Google Scholar
- Kuan-Yu Chen, Hsuan-Sheng Chiu, and Berlin Chen. 2010. Latent Topic modeling of word vicinity information for speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 5394--5397.Google Scholar
Cross Ref
- Kuan-Yu Chen and Berlin Chen. 2011. Relevance language modeling for speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 5568--5571.Google Scholar
Cross Ref
- Berlin Chen and Kuan-Yu Chen. 2013. Leveraging relevance cues for language modeling in speech recognition. Inf. Process. Manag. 49, 4, 807--816. Google Scholar
Digital Library
- Kuan-Yu Chen, Hung-Shin Lee, Chung-Han Lee, Hsin-Min Wang, and Hsin-Hsi Chen. 2013. A study of language modeling for Chinese spelling check. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 79--83.Google Scholar
- Kuan-Yu Chen, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen, and Hsin-Hsi Chen. 2014a. I-vector based language modeling for spoken document retrieval. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 7083--7088.Google Scholar
Cross Ref
- Kuan-Yu Chen, Shih-Hung Liu, Berlin Chen, Hsin-Min Wang, Wen-Lian Hsu, and Hsin-Hsi Chen. 2014b. A recurrent neural network language modeling framework for extractive speech summarization. In Proceedings of the International Conference on Multimedia and Expo.Google Scholar
Cross Ref
- Hsun-Wen Chiu, Jian-Cheng Wu, and Jason S. Chang. 2013. Chinese spelling checker based on statistical machine translation. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 49--53.Google Scholar
- Ondrej Glembek, Lukas Burget, Pavel Matejka, Martin Karafiat, and Patrick Kenny. 2011. Simplification and optimization of i-vector extraction. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 4516--4519.Google Scholar
Cross Ref
- Thomas L. Griffiths and Mark Steyvers. 2004. Finding Scientific Topics. Proc. Nat. Acad. Sci. U.S.A. 5228--5235.Google Scholar
- Md. Akmal Haidar and Douglas O’Shaughnessy. 2013. Comparison of a bigram PLSA and a novel context-based PLSA language model for speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 8440--8444.Google Scholar
- Dongxu Han and Baobao Chang. 2013. A maximum entropy approach to Chinese spelling check. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 74--78.Google Scholar
- Yu-Ming Hsieh, Ming-Hong Bai, and Keh-Jiann Chen. 2013. Introduction to CKIP Chinese spelling check system for SIGHAN bakeoff 2013 evaluation. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 59--63.Google Scholar
- Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. 289--296. Google Scholar
Digital Library
- Song-Fang Huang and Steve Renals. 2007. Hierarchical Pitman-Yor language models for ASR in meetings. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 124--129.Google Scholar
- Frederick Jelinek. 1999. Statistical Methods for Speech Recognition. MIT Press. Google Scholar
Digital Library
- Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Prentice Hall PTR, Upper Saddle River, NJ. Google Scholar
Digital Library
- Mark D. Kernighan, Kenneth W. Church, and William A. Gale. 1990. A spelling correction program based on a noisy channel model. In Proceedings of the International Conference on Computational Linguistics. 205--210. Google Scholar
Digital Library
- Wen-Jui Kuo, Tzu-Chen Yeh, Jun-Ren Lee, Li-Fen Chen, Po-Lei Lee, Shyan-Shiou Chen, Low-Tone Ho, Daisy L Hung, Ovid J. L. Tzeng, and Jen-Chuen Hsieh. 2004. Orthographic and phonological processing of Chinese characters: An fMRI study. Neuroimage 21, 4, 1721--1731.Google Scholar
Cross Ref
- John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning. 282--289. Google Scholar
Digital Library
- Victor Lavrenko and W. Bruce Croft. 2001. Relevance-based Language Models. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 120--127. Google Scholar
Digital Library
- Chia-Ying Lee, Jie-Li Tsai, Hsu-Wen Huang, Daisy L. Hung, and Ovid J. L. Tzeng. 2006. The temporal signatures of semantic and phonological activations for Chinese sublexical processing: An event-related potential study. Brain Research 1121, 1, 150--159.Google Scholar
Cross Ref
- Shih-Hsiang Lin and Berlin Chen. 2010. A Risk Minimization Framework for Extractive Speech Summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 79--87. Google Scholar
Digital Library
- Chao-Lin Liu, Min-Hua Lai, Kan-Wen Tien, Yi-Hsuan Chuang, Shih-Hung Wu, and Chia-Ying Lee. 2011. Visually and phonologically similar Characters in incorrect Chinese words: Analyses, identification, and applications. ACM Trans. Asian Lang. Inform. Process. 10, 2, 1. Google Scholar
Digital Library
- Xiaodong Liu, Fei Cheng, Yanyan Luo, Kevin Duh, and Yuji Matsumoto. 2013. A hybrid Chinese spelling correction using language model and statistical machine translation with reranking. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 54--58.Google Scholar
- Shih-Hung Liu, Kuan-Yu Chen, Yu-Lun Hsieh, Berlin Chen, Hsin-Min Wang, Hsu-Chun Yen, and Wen-Lian Hsu. 2014. Effective pseudo-relevance feedback for language modeling in extractive speech summarization. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 3226--3230.Google Scholar
Cross Ref
- Wei-Yun Ma and Keh-Jiann Chen. 2003. Introduction to CKIP Chinese word segmentation system for the First International Chinese Word Segmentation Bakeoff. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 168--171. Google Scholar
Digital Library
- Jiazhong Nie, Runxin Li, Dingsheng Luo, and Xihong Wu. 2007. Refine bigram PLSA model by assigning latent topics unevenly. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding. 141--146.Google Scholar
- Jay M. Ponte and W. Bruce Croft. 1998. A language modeling approach to information retrieval. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 275--281. Google Scholar
Digital Library
- Andreas Stolcke. 2000. SRI Language Modeling Toolkit. http://www.speech.sri.com/projects/srilm/.Google Scholar
- Andreas Stolcke. 2005. SRILM: An extensible language modeling toolkit. In Proceedings of INTERSPEECH. 901--904.Google Scholar
- Mikolov Tomáš, Karafiát Martin, Burget Lukáš, Černocký Jan and Khudanpur Sanjeev. 2010. Recurrent neural network based language model. In Proceedings of INTERSPEECH. 1045--1048.Google Scholar
- Jie-Li Tsai, Chia-Ying Lee, Ying-Chun Lin, Ovid J. L. Tzeng, and Daisy L. Hung. 2006. Neighborhood size effects of Chinese Words in lexical decision and reading. Language & Linguistics. 7, 3, 659--675.Google Scholar
- Andrew J. Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13, 2, 260--269. Google Scholar
Digital Library
- Hanna M. Wallach. 2006. Topic modeling: Beyond bag-of-words. In Proceedings of the International Conference on Machine Learning. 977--984. Google Scholar
Digital Library
- Chun-Hung Wang, Jason S. Chang, and Jian-Cheng Wu. 2013a. Automatic Chinese confusion words extraction using conditional random fields and the Web. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 64--68.Google Scholar
- Yih-Ru Wang, Yuan-Fu Liao, Yeh-Kuang Wu, and Liang-Chun Chang. 2013b. Conditional random field-based parser and language model for traditional Chinese spelling checker. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 69--73.Google Scholar
- Shih-Hung Wu, Yong-Zhi Chen, Ping-che Yang, Tsun Ku, and Chao-Lin Liu. 2010. Reducing the false alarm rate of Chinese character error detection and correction. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop.Google Scholar
- Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 35--42.Google Scholar
- Shaohua Yang, Hai Zhao, Xiaolin Wang, and Baoliang Lu. 2012. Spell checking for Chinese. In Proceedings of the Annual Conference on Language Resources and Evaluation. 730--736.Google Scholar
- Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-Yi Chen, and Mao-Chuan Su. 2013. Chinese word spelling correction based on n-gram ranked inverted index list. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 43--48.Google Scholar
- Liang-Chih Yu, Chao-Hong Liu, and Chung-Hsien Wu. 2013. Candidate scoring using web-based measure for Chinese spelling error correction. In Proceedings of the Special Interest Group on Chinese Language Processing Workshop. 108--112.Google Scholar
- Lei Zhang, Ming Zhou, Changning Huang, and Mingyu Lu. 2000. Approach in automatic detection and correction of errors in Chinese text based on feature and learning. In Proceedings of the World Congress on Intelligent Control and Automation. 2744--2748.Google Scholar
Cross Ref
Index Terms
A Probabilistic Framework for Chinese Spelling Check
Recommendations
Chinese Spelling Error Detection Using a Fusion Lattice LSTM
Spelling error detection serves as a crucial preprocessing in many natural language processing applications. Unlike English, where every single word is directly typed by keyboard, we have to use an input method to input Chinese characters. The pinyin ...
Quantitative Analysis of Chinese and English Verb Valencies Based on Probabilistic Valency Pattern Theory
Chinese Lexical SemanticsAbstractThis study examines the similarities and differences between Chinese and English verb valencies based on the Probabilistic Valency Pattern Theory (PVPT). We adopted the Parallel Universal Dependencies treebanks of Chinese and English to ensure ...
Self-organizing semantic maps and its application to word alignment in Japanese-Chinese parallel corpora
2004 Special issue: New developments in self-organizing systemsThis paper presents a method involving self-organizing monolingual semantic maps that are visible and continuous representations where Chinese or Japanese words with similar meanings are placed at the same or neighboring points so that the distance ...






Comments