skip to main content
research-article

General and Domain-adaptive Chinese Spelling Check with Error-consistent Pretraining

Authors Info & Claims
Published:09 May 2023Publication History
Skip Abstract Section

Abstract

The lack of label data is one of the significant bottlenecks for Chinese Spelling Check. Existing researches use the automatic generation method by exploiting unlabeled data to expand the supervised corpus. However, there is a big gap between the real input scenario and automatically generated corpus. Thus, we develop a competitive general speller ECSpell, which adopts the Error-consistent masking strategy to create data for pretraining. This error-consistency masking strategy is used to specify the error types of automatically generated sentences consistent with the real scene. The experimental result indicates that our model outperforms previous state-of-the-art models on the general benchmark.

Moreover, spellers often work within a particular domain in real life. Due to many uncommon domain terms, experiments on our built domain-specific datasets show that general models perform terribly. Inspired by the common practice of input methods, we propose to add an alterable user dictionary to handle the zero-shot domain-adaption problem. Specifically, we attach a User Dictionary guided inference module (UD) to a general token classification-based speller. Our experiments demonstrate that ECSpellUD, namely, ECSpell combined with UD, surpasses all the other baselines broadly, even approaching the performance on the general benchmark.1

REFERENCES

  1. [1] Bao Zuyi, Li Chen, and Wang Rui. 2020. Chunk-based Chinese spelling check with global optimization. In Findings of the Association for Computational Linguistics: EMNLP 2020, Cohn Trevor, He Yulan, and Liu Yang (Eds.). Association for Computational Linguistics, 20312040.Google ScholarGoogle Scholar
  2. [2] Cheng Xingyi, Xu Weidi, Chen Kunlong, Jiang Shaohua, Wang Feng, Wang Taifeng, Chu Wei, and Qi Yuan. 2020. SpellGCN: Incorporating phonological and visual similarities into language models for Chinese spelling check. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20), Jurafsky Dan, Chai Joyce, Schluter Natalie, and Tetreault Joel R. (Eds.). Association for Computational Linguistics, 871881.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Crego Josep, Kim Jungi, Klein Guillaume, Rebollo Anabel, Yang Kathy, Senellart Jean, Akhanov Egor, Brunelle Patrice, Coquard Aurelien, Deng Yongchao, et al. 2016. Systran’s pure neural machine translation systems. Retrieved from https://arXiv:1610.05540.Google ScholarGoogle Scholar
  4. [4] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arxiv.org/abs/1810.04805.Google ScholarGoogle Scholar
  5. [5] Dinu Georgiana, Mathur Prashant, Federico Marcello, and Al-Onaizan Yaser. 2019. Training neural machine translation to apply terminology constraints. Retrieved from https://arXiv:1906.01105.Google ScholarGoogle Scholar
  6. [6] Duan Jianyong, Pan Lijian, Wang Hao, Zhang Mei, and Wu Mingli. 2019. Automatically build corpora for Chinese spelling check based on the input method. In Proceedings of the 8th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC’19)(Lecture Notes in Computer Science, Vol. 11838), Tang Jie, Kan Min-Yen, Zhao Dongyan, Li Sujian, and Zan Hongying (Eds.). Springer, 471485.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Gu Lei, Wang Yong, and Liang Xitao. 2014. Introduction to NJUPT Chinese spelling check systems in CLP-2014 bakeoff. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing. Association for Computational Linguistics, 167172.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Han Zijia, Lv Chengguo, Wang Qiansheng, and Fu Guohong. 2019. Chinese spelling check based on sequence labeling. In Proceedings of the International Conference on Asian Language Processing (IALP’19), Lan Man, Wu Yuanbin, Dong Minghui, Lu Yanfeng, and Yang Yan (Eds.). IEEE, 373378. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Hasler Eva, De Gispert Adrià, Iglesias Gonzalo, and Byrne Bill. 2018. Neural machine translation decoding with terminology constraints. Retrieved from https://arXiv:1805.03750.Google ScholarGoogle Scholar
  10. [10] Hokamp Chris and Liu Qun. 2017. Lexically constrained decoding for sequence generation using grid beam search. Retrieved from https://arXiv:1704.07138.Google ScholarGoogle Scholar
  11. [11] Hong Yuzhong, Yu Xianguo, He Neng, Liu Nan, and Liu Junhui. 2019. FASPell: A fast, adaptable, simple, powerful Chinese spell checker based On DAE-decoder paradigm. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT’19). Association for Computational Linguistics, 160169.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Huang Li, Li Junjie, Jiang Weiwei, Zhang Zhiyu, Chen Minchuan, Wang Shaojun, and Xiao Jing. 2021. PHMOSpell: Phonological and morphological knowledge guided Chinese spelling check. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 59585967.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Jiang Ying, Wang Tong, Lin Tao, Wang Fangjie, Cheng Wenting, Liu Xiaofei, Wang Chenghui, and Zhang Weijian. 2012. A rule based Chinese spelling and grammar detection system utility. In Proceedings of the International Conference on System Science and Engineering (ICSSE’12). IEEE, 437440.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Li Chong, Zhang Cenyuan, Zheng Xiaoqing, and Huang Xuanjing. 2021. Exploration and exploitation: Two ways to improve Chinese spelling correction models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP’21) (Volume 2: Short Papers), Zong Chengqing, Xia Fei, Li Wenjie, and Navigli Roberto (Eds.). Association for Computational Linguistics, 441446.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Li Piji and Shi Shuming. 2021. Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese Grammatical Error Correction. Retrieved from https://arxiv.org/abs/2106.01609.Google ScholarGoogle Scholar
  16. [16] Li Yinghui, Zhou Qingyu, Li Yangning, Li Zhongli, Liu Ruiyang, Sun Rongyi, Wang Zizhen, Li Chao, Cao Yunbo, and Zheng Hai-Tao. 2022. The past mistake is the future wisdom: Error-driven contrastive probability optimization for Chinese spell checking. In Findings of the Association for Computational Linguistics: ACL 2022, Muresan Smaranda, Nakov Preslav, and Villavicencio Aline (Eds.). Association for Computational Linguistics, 32023213.Google ScholarGoogle Scholar
  17. [17] Liu Chao-Lin, Lai Min-Hua, Chuang Yi-Hsuan, and Lee Chia-Ying. 2010. Visually and phonologically similar characters in incorrect simplified Chinese words. In Coling 2010: Posters. Coling 2010 Organizing Committee, 739747.Google ScholarGoogle Scholar
  18. [18] Liu Shulin, Yang Tao, Yue Tianchi, Zhang Feng, and Wang Di. 2021. PLOME: Pre-training with misspelled knowledge for Chinese spelling correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 29913000.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Loshchilov Ilya and Hutter Frank. 2019. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). Retrieved from OpenReview.net.Google ScholarGoogle Scholar
  20. [20] Meng Yuxian, Wu Wei, Wang Fei, Li Xiaoya, Nie Ping, Yin Fan, Li Muyu, Han Qinghong, Sun Xiaofei, and Li Jiwei. 2019. Glyce: Glyph-vectors for Chinese character representations. Retrieved from https://arxiv.org/abs/1901.10125.Google ScholarGoogle Scholar
  21. [21] Michon Elise, Crego Josep Maria, and Senellart Jean. 2020. Integrating domain terminology into neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics (COLING’20), Scott Donia, Bel Núria, and Zong Chengqing (Eds.). International Committee on Computational Linguistics, 39253937.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Nguyen Minh, Ngo Gia H., and Chen Nancy F.. 2021. Domain-shift conditioning using adaptable filtering via hierarchical embeddings for robust Chinese spell check. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 20272036.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Shang Jingbo, Liu Jialu, Jiang Meng, Ren Xiang, Voss Clare R., and Han Jiawei. 2018. Automated phrase mining from massive text corpora. IEEE Trans. Knowl. Data Eng. 30, 10 (2018), 18251837.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Song Kai, Zhang Yue, Yu Heng, Luo Weihua, Wang Kun, and Zhang Min. 2019. Code-switching for enhancing NMT with pre-specified translation. Retrieved from https://arXiv:1904.09107.Google ScholarGoogle Scholar
  25. [25] Susanto Raymond Hendy, Chollampatt Shamil, and Tan Liling. 2020. Lexically constrained neural machine translation with Levenshtein transformer. Retrieved from https://arXiv:2004.12681.Google ScholarGoogle Scholar
  26. [26] Tseng Yuen-Hsien, Lee Lung-Hao, Chang Li-Ping, and Chen Hsin-Hsi. 2015. Introduction to SIGHAN 2015 bake-off for Chinese spelling check. In Proceedings of the 8th SIGHAN Workshop on Chinese Language Processing ([email protected]’15), Yu Liang-Chih, Sui Zhifang, Zhang Yue, and Ng Vincent (Eds.). Association for Computational Linguistics, 3237.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, Guyon Isabelle, Luxburg Ulrike von, Bengio Samy, Wallach Hanna M., Fergus Rob, Vishwanathan S. V. N., and Garnett Roman (Eds.). 59986008.Google ScholarGoogle Scholar
  28. [28] Wang Baoxin, Che Wanxiang, Wu Dayong, Wang Shijin, Hu Guoping, and Liu Ting. 2021. Dynamic connected networks for Chinese spelling check. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 24372446.Google ScholarGoogle Scholar
  29. [29] Wang Dingmin, Song Yan, Li Jing, Han Jialong, and Zhang Haisong. 2018. A hybrid approach to automatic corpus generation for Chinese spelling check. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Riloff Ellen, Chiang David, Hockenmaier Julia, and Tsujii Jun’ichi (Eds.). Association for Computational Linguistics, 25172527.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Wang Dingmin, Tay Yi, and Zhong Li. 2019. Confusionset-guided pointer networks for Chinese spelling check. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 57805785.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Wang Yih-Ru, Liao Yuan-Fu, Wu Yeh-Kuang, and Chang Liang-Chun. 2013. Conditional random field-based parser and language model for traditional Chinese spelling checker. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing ([email protected]’13), Yu Liang-Chih, Tseng Yuen-Hsien, Zhu Jingbo, and Ren Fuji (Eds.). Asian Federation of Natural Language Processing, 6973.Google ScholarGoogle Scholar
  32. [32] Wu Shih-Hung, Liu Chao-Lin, and Lee Lung-Hao. 2013. Chinese spelling check evaluation at SIGHAN Bake-off 2013. In Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing ([email protected]’13), Yu Liang-Chih, Tseng Yuen-Hsien, Zhu Jingbo, and Ren Fuji (Eds.). Asian Federation of Natural Language Processing, 3542.Google ScholarGoogle Scholar
  33. [33] Xu Heng-Da, Li Zhongli, Zhou Qingyu, Li Chao, Wang Zizhen, Cao Yunbo, Huang Heyan, and Mao Xian-Ling. 2021. Read, listen, and see: Leveraging multimodal information helps Chinese spell checking. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online, 716728.Google ScholarGoogle Scholar
  34. [34] Yu Junjie and Li Zhenghua. 2014. Chinese spelling error detection and correction based on language model, pronunciation, and shape. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing, Le Sun, Zong Chengqing, Zhang Min, and Levow Gina-Anne (Eds.). Association for Computational Linguistics, 220223.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Yu Liang-Chih, Lee Lung-Hao, Tseng Yuen-Hsien, and Chen Hsin-Hsi. 2014. Overview of SIGHAN 2014 bake-off for Chinese spelling check. In Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chinese Language Processing, Le Sun, Zong Chengqing, Zhang Min, and Levow Gina-Anne (Eds.). Association for Computational Linguistics, 126132.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Zhang Ruiqing, Pang Chao, Zhang Chuanqiang, Wang Shuohuan, He Zhongjun, Sun Yu, Wu Hua, and Wang Haifeng. 2021. Correcting Chinese spelling errors with phonetic pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 22502261.Google ScholarGoogle Scholar
  37. [37] Zhang Shaohua, Huang Haoran, Liu Jicong, and Li Hang. 2020. Spelling error correction with soft-masked BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 882890.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. General and Domain-adaptive Chinese Spelling Check with Error-consistent Pretraining

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 5
      May 2023
      653 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3596451
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 May 2023
      • Online AM: 21 September 2022
      • Accepted: 14 September 2022
      • Revised: 21 July 2022
      • Received: 16 March 2022
      Published in tallip Volume 22, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)180
      • Downloads (Last 6 weeks)25

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!