skip to main content
short-paper
Open Access

Chinese Grammatical Error Correction Using Pre-trained Models and Pseudo Data

Published:10 March 2023Publication History

Skip Abstract Section

Abstract

In recent studies, pre-trained models and pseudo data have been key factors in improving the performance of the English grammatical error correction (GEC) task. However, few studies have examined the role of pre-trained models and pseudo data in the Chinese GEC task. Therefore, we develop Chinese GEC models based on three pre-trained models: Chinese BERT, Chinese T5, and Chinese BART, and then incorporate these models with pseudo data to determine the best configuration for the Chinese GEC task. On the natural language processing and Chinese computing (NLPCC) 2018 GEC shared task test set, all our single models outperform the ensemble models developed by the top team of the shared task. Chinese BART achieves an F score of 37.15, which is a state-of-the-art result. We then combine our Chinese GEC models with three kinds of pseudo data: Lang-8 (MaskGEC), Wiki (MaskGEC), and Wiki (Backtranslation). We find that most models can benefit from pseudo data, and BART+Lang-8 (MaskGEC) is the ideal setting in terms of accuracy and training efficiency. The experimental results demonstrate the effectiveness of the pre-trained models and pseudo data on the Chinese GEC task and provide an easily reproducible and adaptable baseline for future works. Finally, we annotate the error types of the development data; the results show that word-level errors dominate all error types, and word selection errors must be addressed even when using pre-trained models and pseudo data. Our codes are available at https://github.com/wang136906578/BERT-encoder-ChineseGEC.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Grammatical error correction (GEC) is the task of correcting a variety of grammatical errors in text written typically by non-native speakers. To date, many models based on encoder–decoder (EncDec) have been proposed for GEC and have achieved human-parity performance, particularly on several benchmark datasets for English [3, 9]. The key factor in performance improvement is the use of pre-trained models [11, 12, 23] and pseudo data [14, 34]. EncDec requires a large amount of training data, but GEC has limitations in available data compared with machine translation (MT).

In contrast to the rapid progress of research on English GEC, few studies on Chinese GEC, where available data is even more limited than in English, have investigated the methodologies for incorporating pre-trained models and pseudo data into GEC models. For pre-trained models, a limited number of studies have used BERT [1, 6, 16], although new Chinese pre-trained models are developed and released continuously [28, 29]. Moreover, Wang et al. [33] is one of the few studies that incorporated pseudo data into Chinese GEC models. Although they combined both rule-based and backtranslation methods to generate the pseudo data for Chinese GEC, they used non-public data to generate the pseudo data. Therefore, it is difficult to analyze the contribution of pseudo data to the final performance. It has also been reported that suitable settings for pseudo data utilization in GEC vary depending on language [20], suggesting that the best practices in English GEC cannot directly be applied to Chinese GEC.

This study comprehensively investigates methodologies for utilizing pre-trained models and pseudo data in Chinese GEC and provides the Chinese GEC community with an improved understanding of the incorporation of pre-trained models and pseudo data. Through extensive experiments with three large-scale pre-trained models (Chinese BERT [4], Chinese T5 [29], Chinese BART [28]), and three types of pseudo data (Lang-8 (MaskGEC), Wiki (MaskGEC), and Wiki (Backtranslation)), we show that BART offers the best performance, and BART+Lang-8 (MaskGEC) is the ideal setting in terms of accuracy and training efficiency. Additionally, we annotate the error types of the development data; the results show that word-level errors dominate all error types, and word selection errors must be addressed even when incorporating pre-trained models and pseudo data.

Skip 2RELATED WORK Section

2 RELATED WORK

2.1 English GEC Using Pre-trained Model and Pseudo Data

For English GEC tasks, BERT [5] is primarily used as a pre-trained model to improve the performance. Additionally, large-scale pseudo data are shown to contribute to the accuracy. In this subsection, we summarize some details about the works that attempted to incorporate BERT and pseudo data into their correction models.

Pre-trained model as a feature: . Kaneko et al. [10] first fine-tuned BERT on a learner corpus and then employed the word probability provided by BERT as re-ranking features. Using BERT for re-ranking features, they obtained an improvement of approximately 0.7 point in the \(\mathrm{F_{0.5}}\) score. By contrast, Kaneko et al. [11] first fine-tuned BERT using a grammatical error diagnosis task and then incorporated the fine-tuned BERT into the correction model by using a fusion method. They showed the effectiveness of BERT on the English GEC task and achieved comparatively high scores.

Pre-trained model in a pipeline: . Kantor et al. [12] used BERT to solve the GEC task by iteratively querying BERT as a black box language model. They added a [MASK] token into source sentences and predicted the word represented by the [MASK] token. If the word probability predicted by BERT exceeded the threshold, the word was output as a correction candidate. Using BERT, they obtained an improvement of 0.27 point in the \(\mathrm{F_{0.5}}\) score. Omelianchuk et al. [23] treated the GEC problem as a sequence editing problem. They used a BERT-based pre-trained model to predict the edit operations for an erroneous sentence, and the predicted edit operations were then used to correct the erroneous sentence.

Generating Pseudo Data for GEC: Xie et al. [34] proposed a method for generating pseudo data based on backtranslation. They first trained a backtranslation model on GEC data and then applied the backtranslation model to clean the monolingual corpus and acquire pseudo data. They also adopted noising methods to generate diverse pseudo data. Kiyono et al. [14] conducted experiments, focusing on three aspects: generating methods, selection of seed corpus, and optimization settings for English GEC pseudo data. They showed how the pseudo data should be generated or used for the English GEC task and achieved state-of-the-art performance at the time of publication.

2.2 Chinese GEC

In this subsection, we first describe the NLPCC 2018 Chinese GEC dataset and then provide some details about five methods that have been experimented on this dataset.

Given the success of the shared tasks on English GEC at the Conference on Natural Language Learning (CoNLL) [21, 22], a Chinese GEC shared task was introduced at NLPCC 2018. In this task, approximately one million sentences from the language learning website Lang-81 were used as training data and two thousand sentences from the PKU Chinese Learner Corpus [38] were used as test data. Two types of supervised GEC models were proposed by training them on the dataset: simple and complex models. Simple models are easy to understand and use but are less effective, whereas complex models achieve high accuracy but are hard to maintain. Our approaches, using pre-trained models and pseudo data, offer the best of both worlds; they are simple and effective.

Simple models: . Ren et al. [26] utilized a convolutional neural network (CNN), similar to that of Chollampatt and Ng [3]. However, because the structure of the CNN is different from that of BERT, it cannot be initialized with the weights learned by BERT. Zhao and Wang [39] proposed a dynamic masking method that replaces the tokens in the source sentences of NLPCC 2018 GEC shared task training data with other tokens (e.g., [PAD] token). They achieved comparatively high scores on the shared task without using any extra knowledge. This is a data augmentation method that can be combined with methods that utilize the pre-trained models.

Complex models: . Fu et al. [7] combined a 5-gram language model-based spell checker with subword-level and character-level encoder–decoder models using Transformer to obtain five types of outputs. Then, they re-ranked these outputs using the language model. Although they reported a high performance, several models were required, and their method of combining these models was complex. Hinson et al. [8] proposed a heterogeneous approach for Chinese GEC. In their method, an erroneous sentence is corrected using a spell checker model, sequence editing model, and sequence-to-sequence model in multiple rounds. Before their work, only sequence-to-sequence models were used for recycle generation in Chinese GEC [24]. They also used an automatic annotator to annotate four error types and evaluated the model performance on these error types. However, they only used character-level edit operations as the error types, which may not be appropriate for reflecting the nature of Chinese grammatical errors. Chen et al. [2] proposed a method comprising two parts: a sequence tagging error detection model and a sequence-to-sequence error correction model. The sequence tagging model identifies the erroneous text spans in the source sentence, and the detected text spans are then fed into sequence-to-sequence model. The sequence-to-sequence model corrects these detected text spans. Their method performs comparably to conventional sequence-to-sequence methods, with less than 50% time cost for inference. Sun et al. [30] proposed a shallow aggressive decoding method to improve the online inference speed for GEC models. Their approach offers a 12.0\(\times\) online inference speedup over the baseline model on the Chinese GEC task.

Skip 3CHINESE PRE-TRAINED MODELS Section

3 CHINESE PRE-TRAINED MODELS

We adopt three pre-trained models: Chinese BERT built by Cui et al. [4], Chinese T5 built by Su [29], and Chinese BART built by Shao et al. [28] to construct our Chinese GEC models. These pre-trained models were originally proposed by Devlin et al. [5], Raffel et al. [25], and Lewis et al. [15]. The main details and differences among the three pre-trained models are summarized in Table 1. All the details pertain to Chinese variations [4, 28, 29] rather than the original one [5, 15, 25]. From the table, we can observe that the three pre-trained models differ in following aspects:

Table 1.
BERTT5BART
Arch.Transformer Encoder 12-layer, 768-hidden, 12-headFull Transformer 12-layer, 768-hidden, 12-headFull Transformer 12-layer, 1024-hidden, 16-head
Param.110M275M406M
Tok.CharacterWord/CharacterCharacter
Vocab.21,12850,00021,128
MaskWhole Word Masking-Token Infilling
TaskMasked Language ModelSummarizationDenoising Auto-Encoding
Data5.4B Tokens30GB200GB
  • The architecture, number of parameters, tokenization, vocabulary size, masking strategy, pre-training task, and pre-training data size are presented.

Table 1. Summary of Pre-trained Models used in our Study

  • The architecture, number of parameters, tokenization, vocabulary size, masking strategy, pre-training task, and pre-training data size are presented.

Architecture and Number of Parameters: . BERT has the fewest number of parameters because it only uses a Transformer Encoder architecture. Note that we initialize the encoder side of a full Transformer with BERT in the next experimental step; hence, the total number of parameters should be larger than 110M. T5 and BART adopt the full Transformer architecture, and BART has the largest number of parameters because it has the largest hidden size and head number.

Tokenization: . The tokenization of BERT and BART is character-based, in which all Chinese strings are divided into characters. The tokenization of T5 is word/character-based. It creates a vocabulary that contains the top 50,000 frequent words. Any out-of-vocabulary words are divided into characters.

Masking Strategy: . BERT adopts a masking method called whole word masking (WWM). In WWM, when a Chinese character is masked, other Chinese characters that belong to the same word are also masked. BART adopts a masking strategy called token infilling, in which the whole word is replaced by a single [MASK] token when a character is masked. T5 does not employ a masking strategy.

Pre-training Task: . For BERT, Cui et al. [4] removed the next sentence prediction task and used only the masked language model task following Liu et al. [17]. T5 adopts summarization as the pre-training task following Zhang et al. [37]. The input is a document, and the output is its summary in this task. BART employs a pre-training task called denoising autoencoding (DAE), in which the model reconstructs the original document based on the corrupted input.

Pre-training Data: . BERT uses Chinese Wikipedia (0.4B tokens) and an extended corpus (5.0B tokens) that consists of Baidu Baike (a Chinese encyclopedia) and Question & Answer data during pre-training. T5 uses 30 GB of pre-training data collected from the internet. BART adopts the pre-training data that contains 200 GB of text from Chinese Wikipedia and a part of WuDaoCorpora [36].

Skip 4GENERATING PSEUDO DATA FOR CHINESE GEC Section

4 GENERATING PSEUDO DATA FOR CHINESE GEC

Although the pseudo data generation method for English GEC has been extensively studied [14], few studies have examined the effect of pseudo data on the performance of a seq2seq model for Chinese GEC. Wang et al. [33] combined both rule-based and backtranslation methods to generate the pseudo data. However, they mixed the pseudo data generated by the two methods and used non-public data for generation; hence, it is difficult to analyze the contribution of pseudo data to the final performance. Therefore, we conducted a series of thorough experiments to investigate the effects of pseudo data generated via rule-based and backtranslation methods when they are combined with the pre-trained models.

4.1 Rule-based Method (MaskGEC)

We used a rule-based method called MaskGEC [39] to generate the rule-based pseudo data. The pseudo data are generated by replacing tokens in the original sentence. There are four kinds of strategies for token replacement: (1) selected token is substituted with a padding symbol; (2) selected token is substituted with a random token from the vocabulary; (3) selected token is substituted with a token from the vocabulary according to frequency; and (4) selected token is substituted with a homophone according to frequency. For every sentence in the corpus used to generate pseudo data, one of the four strategies is randomly applied to the sentence; for every character in the sentence, the character is selected as a candidate with a probability \(\delta\). To help readers understand this algorithm, a pseudo code is included in supplementary materials. Dynamic masking and static masking strategies are also utilized. In the dynamic masking strategy, the pseudo data are generated in every epoch; hence, each training instance may be seen with a different mask in different epochs. In the static masking strategy, the pseudo data are generated only once; hence, each training instance remains unchanged across different epochs. We adopt the static masking strategy in this work for simplicity, as there is no apparent difference between the two strategies based on the experimental results.

4.2 Backtranslation Method

Backtranslation was originally proposed by Sennrich et al. [27] to generate the pseudo data for the machine translation task. In the GEC setting, the input of the backtranslation model is a correct sentence, and the output is an erroneous sentence. Following Xie et al. [34], we first train a backtranslation model using the Chinese GEC training data. Then, we apply the backtranslation model to a seed corpus to generate the pseudo data. For inference, we adopt the random noising method from Xie et al. [34] to generate diverse noise. Every hypothesis is penalized by adding \(r\beta _{\mathit {\mathit {random}}}\) to its score, where r is drawn uniformly from the interval [0, 1]. \(\beta _{\mathit {random}}\) is a hyper-parameter greater than or equal to 0. Sufficiently large \(\beta _{\mathit {random}}\) results in a random shuffling of the ranks of the hypotheses according to their scores. \(\beta _{\mathit {random}}=0\) implies that it is identical to standard backtranslation. We set \(\beta _{\mathit {random}}=6\) following Kiyono et al. [14].

Skip 5EXPERIMENTS Section

5 EXPERIMENTS

To investigate methodologies for utilizing pre-trained models and pseudo data in Chinese GEC, we design our experiments by using three large-scale pre-trained models (described in Section 3) and two pseudo data generation methods (described in Section 4).

5.1 Data

We train and evaluate our models using the data provided by the NLPCC 2018 GEC shared task. We first segment all sentences into characters because the Chinese pre-trained models that we used are character-based. The training data consist of 1.2 million sentence pairs extracted from the language learning website Lang-8.2 For development data, we randomly extracted 5,000 sentences from the training data as the development data following Ren et al. [26], because the NLPCC 2018 GEC shared task did not provide development data. The test data consist of 2,000 sentences extracted from the PKU Chinese Learner Corpus. According to Zhao et al. [38], the annotation guidelines follow the minimum edit distance principle [19], which selects the edit operation that minimizes the edit distance from the original sentence.

Following Zhao and Wang [39], we use the source side of the NLPCC 2018 GEC training data to generate pseudo data. Before generation, we used a tokenization script from the BERT project3 to tokenize the Chinese texts into characters and keep the non-Chinese tokens unchanged. We set the substitution probability \(\delta\)=0.1, which achieves the best perplexity on the development data (the perplexity for each \(\delta\) is depicted in Figure 2 of the supplementary materials). This is different from Zhao and Wang [39], where they set \(\delta\)=0.3, which achieved the best \(\mathrm{F_{0.5}}\) score on the test set. We name the generated pseudo data Lang-8 (MaskGEC) in the remaining parts of this article. Note that we did not perform backtranslation for Lang-8 because there are only few unannotated Chinese learners’ sentences in the Lang-8 corpus.

We also utilize Chinese Wikipedia4 as a seed corpus to generate the pseudo data. We download the preprocessed Chinese Wikipedia data from nlp_chinese_corpus5 and use tools from Chinese-wikipedia-corpus-creator6 to split the document into sentences. We acquire approximately nine million sentences after the aforementioned steps. Then, we apply the rule-based and backtranslation methods to those Wikipedia sentences. We treat the generated noisy sentences as erroneous sentences and the original Wikipedia sentences as correct sentences. We call the generated pseudo data using a rule-based method as Wiki (MaskGEC) and the generated pseudo data using backtranslation method as Wiki (Backtranslation).

5.2 Model

We used Transformer as our baseline model. Transformer offers excellent performance in sequence-to-sequence tasks, such as machine translation, and has been widely adopted in recent studies on English GEC [9, 14].

A BERT-based pre-trained model only uses the encoder of Transformer; therefore, it cannot be directly applied to sequence-to-sequence tasks that require both an encoder and a decoder, such as GEC. Hence, we initialized the encoder of Transformer with the parameters of Chinese BERT; the decoder is initialized randomly. Finally, we train the initialized model on Chinese GEC data.

As for Chinese T5 and Chinese BART, because they are both encoder–decoder architectures, we could fine-tune them on the Chinese GEC dataset.

Finally, we have the following models trained on different data:

  • Baseline: A plain Transformer model that is initialized randomly without using a pre-trained model. This model is trained on the original Lang-8 data.

  • BERT-encoder, T5, BART: The models finetuned on the original Lang-8 data.

  • Baseline, BERT-encoder, T5, BART + Lang-8 (MaskGEC): The models are fine-tuned on Lang-8 (MaskGEC) pseudo data.

  • Baseline, BERT-encoder, T5, BART + Wiki (MaskGEC): The models are first warmed up on Wiki (MaskGEC) pseudo data until convergence and then fine-tuned on Lang-8 (MaskGEC) pseudo data. We adopt this setting because to avoid the appearance of [MASK] token in both the finetuning steps.

  • Baseline, BERT-encoder, T5, BART + Wiki (Backtranslation): The models are first warmed up on Wiki (Basktranslation) pseudo data until convergence and then fine-tuned on the original pseudo data.

We implement the baseline, BERT-encoder, T5, and BART models based on following projects, respectively: awesome-transformer,7 BERT-encoder-ChineseGEC,8 t5-pegasus-chinese9 and CPT.10 Readers can refer to these URLs and supplementary materials to acquire more details about implementations.

5.3 Evaluation

As the evaluation is performed on word units, we strip all delimiters from the system output sentences and segment the sentences using the pkunlp11 provided in the NLPCC 2018 GEC shared task.

Based on the setup of the NLPCC 2018 GEC shared task, the evaluation is conducted using MaxMatch (\(M^2\)).12 The MaxMatch algorithm computes the phrase-level edits between the source sentence and the system output. Then, it finds the overlaps between the system edits and gold edits.

5.4 Evaluation Results

Table 2 summarizes the experimental results of our models. We run the single models three times and report the average score. For comparison, we also cite the results of recent works [8, 39] as well as those of the models developed by two teams [7, 26] in the NLPCC 2018 GEC shared task.

Table 2.
Original DataPR\(\mathbf {\mathrm{F_{0.5}}}\)Lang-8 (MaskGEC)PR\(\mathbf {\mathrm{F_{0.5}}}\)
Baseline37.7816.9930.23Baseline36.4623.0332.66
BERT-encoder39.7820.8433.66BERT-encoder36.0224.1032.77
T541.6120.2234.34T539.0724.1034.73
BART39.5030.0137.15BART41.0832.1838.93
Hinson et al. [8]36.7927.8234.56Zhao and Wang [39]
Fu et al. [7]35.2418.6429.91dynamic masking44.3622.1836.97
Ren et al. [26]41.7313.0829.02static masking43.7321.7136.35
Ren et al. [26] (4-ens)47.6312.5630.57
Wiki (Backtrans.)PR\(\mathbf {\mathrm{F_{0.5}}}\)Wiki (MaskGEC)PR\(\mathbf {\mathrm{F_{0.5}}}\)
Baseline36.2821.1431.74Baseline34.0321.1830.33
BERT-encoder37.6622.7633.29BERT-encoder35.2324.5032.39
T542.6120.5635.08T539.3524.8435.23
BART40.3230.6837.94BART41.4131.7939.05
  • The top column of the first table shows the results of our models, and the bottom column of the first table shows the results of previous works trained on original data and Lang-8 (MaskGEC) pseudo data. The second table presents the results of our models trained on Wiki (MaskGEC) and Wiki (Backtranslation) pseudo data.

Table 2. Experimental Results of the NLPCC 2018 GEC Shared Task

  • The top column of the first table shows the results of our models, and the bottom column of the first table shows the results of previous works trained on original data and Lang-8 (MaskGEC) pseudo data. The second table presents the results of our models trained on Wiki (MaskGEC) and Wiki (Backtranslation) pseudo data.

For the models trained on the original data, the performance of all our models using pre-trained models is superior to that of the baseline model and those achieved by the two teams in the NLPCC 2018 GEC shared task, indicating the effectiveness of adopting the pre-trained model. Moreover, the BART model yields an \(\mathrm{F_ {0.5}}\) score that is seven points higher than the baseline model and achieves the best result among all models. This result indicates the effectiveness of BART owing to its larger parameters, larger pre-training data size, and more suitable pre-training task for GEC.

For two recent related works, results from Zhao and Wang [39] adequately balanced precision and recall, thus achieving a comparatively high \(\mathrm{F_{0.5}}\) score. Hinson et al. [8] achieved a comparatively high recall. However, our BART model still exceeds their results. This result is in agreement with Katsumata and Komachi [13], showing that BART is a simple but strong baseline for English, German, and Czech.13

When the models are combined with pseudo data, almost all models (except for BERT-encoder) benefit from the Lang-8 (MaskGEC) pseudo data. These results confirm the effectiveness of Zhao and Wang [39], which uses only Lang-8 to generate pseudo data by a rule-based method. Comparing the performances of models trained on Lang-8 (MaskGEC) and Wiki (MaskGEC), there were no significant differences, although the latter uses 10\(\times\) more pseudo data. Additionally, comparing the performances of Wiki (MaskGEC) and Wiki (Backtranslation), we obtained a mixed result; MaskGEC is better for T5 and BART, whereas backtranslation is better for the baseline and BERT-encoder. Considering the training cost and final performance, incorporating the pre-trained models with the Lang-8 (MaskGEC) pseudo data is the ideal setting at the present stage.

Skip 6ANALYSIS Section

6 ANALYSIS

To thoroughly understand the performance of models in the Chinese GEC setting, we conduct qualitative and quantitative analysis.

6.1 System Output

Table 3 presents the sample outputs of our models we trained in experiments.

Table 3.

Table 3. Source Sentence, Gold Edit, and Output from Baseline, BERT, T5, and BART Models

In the first example, the spelling error 持别 is accurately corrected to 特别 (which means especially) by all the pre-trained models, whereas it is not corrected by the baseline model. This example shows the effectiveness of pre-trained models for handling easy errors.

In the second example, the models were required to change the word order (没有说服 which means didn’t convince), delete the redundant word 对 (to) and insert the missing word 的 (of). T5 and BART perform well in this case; their outputs are nearly the same as the gold correction, except for inserting the missing word. This may be because T5 and BART have a pre-trained decoder that makes the output more fluent.

In the third example, the output made by T5 seems different from others. It copies the original source sentence and appends its correction after the source sentence. This is because there is some noise in the Lang-8 training data in that some native speakers copy the original erroneous sentence and append their corrections or comments after it [18], and T5 is pre-trained using a summarization task; hence, it is sensitive to length changes and tends to detect these patterns. Compared with T5, BART gives an ideal correction that is almost the same as the gold correction. Considering BART is pre-trained using a DAE task, which is similar to the GEC task, we can conclude that we should select the pre-trained models that are pre-trained by a task similar to the downstream task.

In the last example, we observe that BART rewrites the sentence, and its output is more fluent than the gold correction. The meaning of BART’s output is Happiness is good medicine that can cure millions of diseases. Compared with mood can be effective for diseases from gold correction, good medicine can cure diseases is more fluent. Moreover, 情绪 (mood) appears two times in the gold correction, which makes the whole sentence verbose. However, this type of fluent change may affect the precision because the gold correction followed the principle of minimum edit distance [38]. This motivates us to propose and evaluate models on a new dataset for the Chinese GEC, such as Wang et al. [32], which can evaluate the fluent changes suitably.

6.2 Error Type

To understand the error distribution of Chinese GEC, we annotated 66 sentences of development data and obtained 100 errors (one sentence may contain more than one error). We referred to the annotation of the HSK learner corpus14 and adopted eight most frequent categories of errors in the corpus: B, CC, CD, CQ, CJwo, CJ+, CJ,- and CJetc. B denotes character-level errors, which are primarily spelling errors. CC, CD, and CQ are word-level errors, which are word selection, redundant word, and missing word errors, respectively. CJ denotes sentence-level errors, which contain several complex errors: CJwo is word order errors, CJ+ and CJ- represent redundant and missing sentence constituents errors, and CJetc contains other sentence-level errors such as wrong usage of 有 (there is) and 是 (is). Several examples are presented in Table 6.2. Based on the number of errors, it is evident that word-level errors (CC, CQ, and CD) are the most frequent.

Table 4.
Error TypeFrequencyExamples
B (spell)8关主{关注} 天气 预报 。 (Pay attention to the weather forecast.)
CC (word selection)28古书店{旧书店} , 买 了 十 本 书 。(Bought ten books at a second-hand bookstore.)
CD (redundant word)8我 很 喜欢 {NONE}读 小说 。 (I like to read novels.)
CQ (missing word)24在 上海 我 总是 住 NONE{} 一家 特定 NONE{} 酒店 。 (I always stay in the same hotel in Shanghai.)
CJwo (word order)10我 决定 学习 努力{努力 学习} 。 (I decided to study hard.)
CJ+ (redundant constituents)3去年 我 到 克拉克夫 来了{NONE} 读书 。 (I went to study in Krakow last year.)
CJ- (missing constituents)9NONE{打算} 在 夏天 好好学 汉语 。 (I plan to study Chinese hard in the summer.)
CJetc (other sentence-level errors)10{今年} 23 岁 。 (I am 23 years old.)
  • The underlined tokens are detected errors that should be replaced with the tokens in braces.

Table 4. Examples of Each Error Type

  • The underlined tokens are detected errors that should be replaced with the tokens in braces.

Figure 1 presents the correction results of the three pre-trained models for each error type. We report the recall score here for simplicity, which reflects the proportion of gold edits that are the same as edits made by systems. These results indicate that BART offers the best performance compared with the other two pre-trained models on every error type. This is consistent with the previous evaluation results presented in Section 5.4 and shows the effectiveness of BART. All the systems achieve a comparatively high score on B (spelling errors) and CJetc (other sentence-level errors), showing that the two error types are somewhat easy for the systems. By contrast, considering that the systems perform poorly on CC (word selection errors), which has the largest number among all error types, we can conclude that CC is the most crucial error type and must be addressed in future work. We can also observe that T5 performs better than BERT on CQ (missing word errors) and CJwo (word order errors). This may be because T5 has a pre-trained decoder, and thus can solve the errors that are related to word insertion and word order more efficiently.

Fig. 1.

Fig. 1. Recall score of three pre-trained models on each error type.

Skip 7CONCLUSION Section

7 CONCLUSION

In this study, we developed Chinese grammatical error correction (GEC) models based on three pre-trained models: Chinese BERT, Chinese T5, and Chinese BART. Among these models, Chinese BART achieved state-of-the-art results. The experimental results demonstrated the usefulness of pre-trained models on the Chinese GEC task. We combined the pre-trained model with pseudo data and found that the BART+Lang-8 (MaskGEC) was the ideal setting in terms of accuracy and training efficiency. Additionally, the error type analysis showed that word selection errors remain to be addressed.

The majority of the methods proposed in the NLPCC 2018 GEC shared task are simply based on the methods of English GEC; however, Chinese GEC has its own characteristics. For example, spelling errors primarily arise from the similarity of the glyph and pronunciation, and sentence-level errors often depend on word position. Therefore, we plan to study and improve the Chinese GEC system while considering these characteristics, using methods such as incorporating the glyph and pronunciation information into the system [35], or adopting the neural model whose positional embeddings can capture word order more efficiently [31].

Skip ACKNOWLEDGMENTS Section

ACKNOWLEDGMENTS

We would like to thank all editors and reviewers for their constructive comments and kind helps.

Footnotes

  1. 1 https://lang-8.com/.

    Footnote
  2. 2 https://lang-8.com/.

    Footnote
  3. 3 https://github.com/google-research/bert.

    Footnote
  4. 4 https://dumps.wikimedia.org/zhwiki/latest/.

    Footnote
  5. 5 https://github.com/brightmart/nlp_chinese_corpus.

    Footnote
  6. 6 https://github.com/howl-anderson/chinese-wikipedia-corpus-creator.

    Footnote
  7. 7 https://github.com/ictnlp/awesome-transformer.

    Footnote
  8. 8 https://github.com/wang136906578/BERT-encoder-ChineseGEC.

    Footnote
  9. 9 https://github.com/SunnyGJing/t5-pegasus-chinese.

    Footnote
  10. 10 https://github.com/fastnlp/CPT.

    Footnote
  11. 11 http://59.108.48.12/lcwm/pkunlp/downloads/libgrass-ui.tar.gz.

    Footnote
  12. 12 https://github.com/nusnlp/m2scorer.

    Footnote
  13. 13 Note that they did not conduct an experiment with Chinese, and they reported that the pre-trained model alone (without any pseudo data) did not yield satisfactory performance for Russian.

    Footnote
  14. 14 http://hsk.blcu.edu.cn/.

    Footnote
Skip Supplemental Material Section

Supplemental Material

REFERENCES

  1. [1] Cao Yongchang, He Liang, Ridley Robert, and Dai Xinyu. 2020. Integrating BERT and score-based feature gates for Chinese grammatical error diagnosis. In NLPTEA. 4956.Google ScholarGoogle Scholar
  2. [2] Chen Mengyun, Ge Tao, Zhang Xingxing, Wei Furu, and Zhou Ming. 2020. Improving the efficiency of grammatical error correction with erroneous span detection and correction. In EMNLP. 71627169.Google ScholarGoogle Scholar
  3. [3] Chollampatt Shamil and Ng Hwee Tou. 2018. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In AAAI. 57555762.Google ScholarGoogle Scholar
  4. [4] Cui Yiming, Che Wanxiang, Liu Ting, Qin Bing, Wang Shijin, and Hu Guoping. 2020. Revisiting pre-trained models for Chinese natural language processing. In Findings of EMNLP.Google ScholarGoogle Scholar
  5. [5] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. 41714186.Google ScholarGoogle Scholar
  6. [6] Fang Meiyuan, Fu Kai, Wang Jiping, Liu Yang, Huang Jin, and Duan Yitao. 2020. A hybrid system for NLPTEA-2020 CGED shared task. In NLPTEA. 6777.Google ScholarGoogle Scholar
  7. [7] Fu Kai, Huang Jin, and Duan Yitao. 2018. Youdao’s winning solution to the NLPCC-2018 task 2 challenge: A neural machine translation approach to Chinese grammatical error correction. In NLPCC. 341350.Google ScholarGoogle Scholar
  8. [8] Hinson Charles, Huang Hen-Hsen, and Chen Hsin-Hsi. 2020. Heterogeneous recycle generation for Chinese grammatical error correction. In COLING. 21912201.Google ScholarGoogle Scholar
  9. [9] Junczys-Dowmunt Marcin, Grundkiewicz Roman, Guha Shubha, and Heafield Kenneth. 2018. Approaching neural grammatical error correction as a low-resource machine translation task. In NAACL-HLT. 595606.Google ScholarGoogle Scholar
  10. [10] Kaneko Masahiro, Hotate Kengo, Katsumata Satoru, and Komachi Mamoru. 2019. TMU transformer system using BERT for re-ranking at BEA 2019 grammatical error correction on restricted track. In BEA. 207212.Google ScholarGoogle Scholar
  11. [11] Kaneko Masahiro, Mita Masato, Kiyono Shun, Suzuki Jun, and Inui Kentaro. 2020. Encoder-decoder models can benefit from pre-trained masked language models in grammatical error correction. In ACL. 42484254.Google ScholarGoogle Scholar
  12. [12] Kantor Yoav, Katz Yoav, Choshen Leshem, Cohen-Karlik Edo, Liberman Naftali, Toledo Assaf, Menczel Amir, and Slonim Noam. 2019. Learning to combine grammatical error corrections. In BEA. 139148.Google ScholarGoogle Scholar
  13. [13] Katsumata Satoru and Komachi Mamoru. 2020. Stronger baselines for grammatical error correction using a pretrained encoder-decoder model. In AACL-IJCNLP. 827832.Google ScholarGoogle Scholar
  14. [14] Kiyono Shun, Suzuki Jun, Mita Masato, Mizumoto Tomoya, and Inui Kentaro. 2019. An empirical study of incorporating pseudo data into grammatical error correction. In EMNLP-IJCNLP. 12361242.Google ScholarGoogle Scholar
  15. [15] Lewis Mike, Liu Yinhan, Goyal Naman, Ghazvininejad Marjan, Mohamed Abdelrahman, Levy Omer, Stoyanov Veselin, and Zettlemoyer Luke. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL. 78717880.Google ScholarGoogle Scholar
  16. [16] Liang Deng, Zheng Chen, Guo Lei, Cui Xin, Xiong Xiuzhang, Rong Hengqiao, and Dong Jinpeng. 2020. BERT enhanced neural machine translation and sequence tagging model for Chinese grammatical error diagnosis. In NLPTEA. 5766.Google ScholarGoogle Scholar
  17. [17] Liu Y., Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis M., Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A robustly optimized BERT pretraining approach. ArXiv abs/1907.11692 (2019), 13 pages.Google ScholarGoogle Scholar
  18. [18] Mizumoto Tomoya, Komachi Mamoru, Nagata Masaaki, and Matsumoto Yuji. 2011. Mining revision log of language learning SNS for automated Japanese error correction of second language learners. In IJCNLP. 147155.Google ScholarGoogle Scholar
  19. [19] Nagata Ryo and Sakaguchi Keisuke. 2016. Phrase structure annotation and parsing for learner English. In ACL. 18371847.Google ScholarGoogle Scholar
  20. [20] Náplava Jakub and Straka Milan. 2019. Grammatical error correction in low-resource scenarios. In W-NUT. 346356.Google ScholarGoogle Scholar
  21. [21] Ng Hwee Tou, Wu Siew Mei, Briscoe Ted, Hadiwinoto Christian, Susanto Raymond Hendy, and Bryant Christopher. 2014. The CoNLL-2014 shared task on grammatical error correction. In CoNLL. 114.Google ScholarGoogle Scholar
  22. [22] Ng Hwee Tou, Wu Siew Mei, Wu Yuanbin, Hadiwinoto Christian, and Tetreault Joel. 2013. The CoNLL-2013 shared task on grammatical error correction. In CoNLL. 112.Google ScholarGoogle Scholar
  23. [23] Omelianchuk Kostiantyn, Atrasevych Vitaliy, Chernodub Artem, and Skurzhanskyi Oleksandr. 2020. GECToR – grammatical error correction: Tag, not rewrite. In BEA. 163170.Google ScholarGoogle Scholar
  24. [24] Qiu Zhaoquan and Qu Youli. 2019. A two-stage model for Chinese grammatical error correction. IEEE Access 7 (2019), 146772146777.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21, 140 (2020), 167.Google ScholarGoogle Scholar
  26. [26] Ren Hongkai, Yang Liner, and Xun Endong. 2018. A sequence to sequence learning for Chinese grammatical error correction. In NLPCC. 401410.Google ScholarGoogle Scholar
  27. [27] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2016. Improving neural machine translation models with monolingual data. In ACL. 8696.Google ScholarGoogle Scholar
  28. [28] Shao Yunfan, Geng Zhichao, Liu Yitao, Dai Junqi, Yang Fei, Zhe Li, Bao Hujun, and Qiu Xipeng. 2021. CPT: A pre-trained unbalanced transformer for both Chinese language understanding and generation. arXiv abs/2109.05729 (2021), 9 pages.Google ScholarGoogle Scholar
  29. [29] Su Jianlin. 2021. T5 PEGASUS - ZhuiyiAI. Technical Report. https://github.com/ZhuiyiTechnology/t5-pegasus.Google ScholarGoogle Scholar
  30. [30] Sun Xin, Ge Tao, Wei Furu, and Wang Houfeng. 2021. Instantaneous grammatical error correction with shallow aggressive decoding. In ACL-IJCNLP. 59375947.Google ScholarGoogle Scholar
  31. [31] Wang Benyou, Zhao Donghao, Lioma Christina, Li Qiuchi, Zhang Peng, and Simonsen Jakob Grue. 2020. Encoding word order in complex embeddings. In ICLR. 15 pages.Google ScholarGoogle Scholar
  32. [32] Wang Yingying, Kong Cunliang, Yang Liner, Wang Yijun, Lu Xiaorong, Hu Renfen, He Shan, Liu Zhenghao, Chen Yuxiang, Yang Erhong, and Sun Maosong. 2021. YACLC: A Chinese learner corpus with multidimensional annotation. ArXiv abs/2112.15043 (2021).Google ScholarGoogle Scholar
  33. [33] Wang Yi, Yuan Ruibin, Luo Yan‘gen, Qin Yufang, Zhu NianYong, Cheng Peng, and Wang Lihuan. 2020. Chinese grammatical error correction based on hybrid models with data augmentation. In BEA. 7886.Google ScholarGoogle Scholar
  34. [34] Xie Ziang, Genthial Guillaume, Xie Stanley, Ng Andrew, and Jurafsky Dan. 2018. Noising and denoising natural language: Diverse backtranslation for grammar correction. In NAACL-HLT. 619628.Google ScholarGoogle Scholar
  35. [35] Xu Heng-Da, Li Zhongli, Zhou Qingyu, Li Chao, Wang Zizhen, Cao Yunbo, Huang Heyan, and Mao Xian-Ling. 2021. Read, listen, and see: Leveraging multimodal information helps Chinese spell checking. In Findings of ACL. 13 pages.Google ScholarGoogle Scholar
  36. [36] Yuan Sha, Zhao Hanyu, Du Zhengxiao, Ding Ming, Liu Xiao, Cen Yukuo, Zou Xu, Yang Zhilin, and Tang Jie. 2021. WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models. AI Open 2 (2021), 6568.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Zhang Jingqing, Zhao Yao, Saleh Mohammad, and Liu Peter J.. 2020. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In ICML. 12 pages.Google ScholarGoogle Scholar
  38. [38] Zhao Yuanyuan, Jiang Nan, Sun Weiwei, and Wan Xiaojun. 2018. Overview of the NLPCC 2018 shared task: Grammatical error correction. In NLPCC. 439445.Google ScholarGoogle Scholar
  39. [39] Zhao Zewei and Wang Houfeng. 2020. MaskGEC: Improving neural grammatical error correction via dynamic masking. In AAAI. 12261233.Google ScholarGoogle Scholar

Index Terms

  1. Chinese Grammatical Error Correction Using Pre-trained Models and Pseudo Data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
      March 2023
      570 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3579816
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 March 2023
      • Online AM: 2 November 2022
      • Accepted: 11 October 2022
      • Revised: 29 July 2022
      • Received: 25 July 2021
      Published in tallip Volume 22, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper
    • Article Metrics

      • Downloads (Last 12 months)581
      • Downloads (Last 6 weeks)78

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!