Abstract
The lack of label data is one of the significant bottlenecks for Chinese Spelling Check. Existing researches use the automatic generation method by exploiting unlabeled data to expand the supervised corpus. However, there is a big gap between the real input scenario and automatically generated corpus. Thus, we develop a competitive general speller ECSpell, which adopts the Error-consistent masking strategy to create data for pretraining. This error-consistency masking strategy is used to specify the error types of automatically generated sentences consistent with the real scene. The experimental result indicates that our model outperforms previous state-of-the-art models on the general benchmark.
Moreover, spellers often work within a particular domain in real life. Due to many uncommon domain terms, experiments on our built domain-specific datasets show that general models perform terribly. Inspired by the common practice of input methods, we propose to add an alterable user dictionary to handle the zero-shot domain-adaption problem. Specifically, we attach a User Dictionary guided inference module (UD) to a general token classification-based speller. Our experiments demonstrate that ECSpellUD, namely, ECSpell combined with UD, surpasses all the other baselines broadly, even approaching the performance on the general benchmark.1
- [1] . 2020. Chunk-based Chinese spelling check with global optimization. In Findings of the Association for Computational Linguistics: EMNLP 2020, , , and (Eds.). Association for Computational Linguistics, 2031–2040.Google Scholar
- [2] . 2020. SpellGCN: Incorporating phonological and visual similarities into language models for Chinese spelling check. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20), , , , and (Eds.). Association for Computational Linguistics, 871–881.Google Scholar
Cross Ref
- [3] . 2016. Systran’s pure neural machine translation systems. Retrieved from https://arXiv:1610.05540.Google Scholar
- [4] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arxiv.org/abs/1810.04805.Google Scholar
- [5] . 2019. Training neural machine translation to apply terminology constraints. Retrieved from https://arXiv:1906.01105.Google Scholar
- [6] . 2019. Automatically build corpora for Chinese spelling check based on the input method. In Proceedings of the 8th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC’19)(
Lecture Notes in Computer Science , Vol. 11838), , , , , and (Eds.). Springer, 471–485.Google ScholarDigital Library
- [7] . 2014. Introduction to NJUPT Chinese spelling check systems in CLP-2014 bakeoff. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. Association for Computational Linguistics, 167–172.Google Scholar
Cross Ref
- [8] . 2019. Chinese spelling check based on sequence labeling. In Proceedings of the International Conference on Asian Language Processing (IALP’19), , , , , and (Eds.). IEEE, 373–378.
DOI: Google ScholarCross Ref
- [9] . 2018. Neural machine translation decoding with terminology constraints. Retrieved from https://arXiv:1805.03750.Google Scholar
- [10] . 2017. Lexically constrained decoding for sequence generation using grid beam search. Retrieved from https://arXiv:1704.07138.Google Scholar
- [11] . 2019. FASPell: A fast, adaptable, simple, powerful Chinese spell checker based On DAE-decoder paradigm. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT’19). Association for Computational Linguistics, 160–169.Google Scholar
Cross Ref
- [12] . 2021. PHMOSpell: Phonological and morphological knowledge guided Chinese spelling check. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5958–5967.Google Scholar
Cross Ref
- [13] . 2012. A rule based Chinese spelling and grammar detection system utility. In Proceedings of the International Conference on System Science and Engineering (ICSSE’12). IEEE, 437–440.Google Scholar
Cross Ref
- [14] . 2021. Exploration and exploitation: Two ways to improve Chinese spelling correction models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP’21) (Volume 2: Short Papers), , , , and (Eds.). Association for Computational Linguistics, 441–446.Google Scholar
Cross Ref
- [15] . 2021. Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese Grammatical Error Correction. Retrieved from https://arxiv.org/abs/2106.01609.Google Scholar
- [16] . 2022. The past mistake is the future wisdom: Error-driven contrastive probability optimization for Chinese spell checking. In Findings of the Association for Computational Linguistics: ACL 2022, , , and (Eds.). Association for Computational Linguistics, 3202–3213.Google Scholar
- [17] . 2010. Visually and phonologically similar characters in incorrect simplified Chinese words. In Coling 2010: Posters. Coling 2010 Organizing Committee, 739–747.Google Scholar
- [18] . 2021. PLOME: Pre-training with misspelled knowledge for Chinese spelling correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2991–3000.Google Scholar
Cross Ref
- [19] . 2019. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). Retrieved from OpenReview.net.Google Scholar
- [20] . 2019. Glyce: Glyph-vectors for Chinese character representations. Retrieved from https://arxiv.org/abs/1901.10125.Google Scholar
- [21] . 2020. Integrating domain terminology into neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics (COLING’20), , , and (Eds.). International Committee on Computational Linguistics, 3925–3937.Google Scholar
Cross Ref
- [22] . 2021. Domain-shift conditioning using adaptable filtering via hierarchical embeddings for robust Chinese spell check. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 2027–2036.Google Scholar
Digital Library
- [23] . 2018. Automated phrase mining from massive text corpora. IEEE Trans. Knowl. Data Eng. 30, 10 (2018), 1825–1837.Google Scholar
Cross Ref
- [24] . 2019. Code-switching for enhancing NMT with pre-specified translation. Retrieved from https://arXiv:1904.09107.Google Scholar
- [25] . 2020. Lexically constrained neural machine translation with Levenshtein transformer. Retrieved from https://arXiv:2004.12681.Google Scholar
- [26] . 2015. Introduction to SIGHAN 2015 bake-off for Chinese spelling check. In Proceedings of the 8th SIGHAN Workshop on Chinese Language Processing ([email protected]’15), , , , and (Eds.). Association for Computational Linguistics, 32–37.Google Scholar
Cross Ref
- [27] . 2017. Attention is all you need. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, , , , , , , and (Eds.). 5998–6008.Google Scholar
- [28] . 2021. Dynamic connected networks for Chinese spelling check. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 2437–2446.Google Scholar
- [29] . 2018. A hybrid approach to automatic corpus generation for Chinese spelling check. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, , , , and (Eds.). Association for Computational Linguistics, 2517–2527.Google Scholar
Cross Ref
- [30] . 2019. Confusionset-guided pointer networks for Chinese spelling check. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5780–5785.Google Scholar
Cross Ref
- [31] . 2013. Conditional random field-based parser and language model for traditional Chinese spelling checker. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing ([email protected]’13), , , , and (Eds.). Asian Federation of Natural Language Processing, 69–73.Google Scholar
- [32] . 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing ([email protected]’13), , , , and (Eds.). Asian Federation of Natural Language Processing, 35–42.Google Scholar
- [33] . 2021. Read, listen, and see: Leveraging multimodal information helps Chinese spell checking. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online, 716–728.Google Scholar
- [34] . 2014. Chinese spelling error detection and correction based on language model, pronunciation, and shape. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing, , , , and (Eds.). Association for Computational Linguistics, 220–223.Google Scholar
Cross Ref
- [35] . 2014. Overview of SIGHAN 2014 bake-off for Chinese spelling check. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing, , , , and (Eds.). Association for Computational Linguistics, 126–132.Google Scholar
Cross Ref
- [36] . 2021. Correcting Chinese spelling errors with phonetic pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 2250–2261.Google Scholar
- [37] . 2020. Spelling error correction with soft-masked BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 882–890.Google Scholar
Cross Ref
Index Terms
General and Domain-adaptive Chinese Spelling Check with Error-consistent Pretraining
Recommendations
Automatically Build Corpora for Chinese Spelling Check Based on the Input Method
Natural Language Processing and Chinese ComputingAbstractChinese Spelling Check (CSC) is very important for Chinese language processing. To utilize supervised learning for CSC, one of the main challenges is that high-quality annotated corpora are not enough in building models. This paper proposes new ...
Dual-Detector: An Unsupervised Learning Framework for Chinese Spelling Check
Advances in Knowledge Discovery and Data MiningAbstractThe task of Chinese Spelling Check (CSC) is to detect and correct spelling errors in Chinese sentences. Since the scale of labeled CSC training set is quite small, we propose an unsupervised Chinese spelling correction framework based on ...
A Hybrid Ranking Approach to Chinese Spelling Check
Special Issue on Chinese Spell CheckingWe propose a novel framework for Chinese Spelling Check (CSC), which is an automatic algorithm to detect and correct Chinese spelling errors. Our framework contains two key components: candidate generation and candidate ranking. Our framework differs ...






Comments