Abstract
This work presents the task of text polishing, which generates a sentence that is more graceful than the input sentence while retaining its semantic meaning. Text polishing has great value in real usage and is an important component in modern writing assistance systems. However, the task is still not well studied in the literature. Further research in this important direction requires more formal task definitions, benchmark datasets, and powerful baseline models. In this work, we formulate the task as a context-dependent text generation problem and conduct a case study on the text polishing with Chinese idiom. To circumvent the difficulties of task data annotation, we propose a semi-automatic data construction pipeline based on human-machine collaboration, and establish a large-scale text polishing dataset consisting of 1.5 million instances. We propose two types of task-specific pre-training objectives for the text polishing task and implement a series of Transformer-based models pre-trained on a massive Chinese corpus as baselines. We conduct extensive experiments with the baseline models on the constructed text polishing datasets and have some major findings. The human evaluation further reveals the polishing ability of the final system.
- . 2019. The BEA-2019 shared task on grammatical error correction. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications. 52–75.Google Scholar
Cross Ref
- . 2022. Grammatical error correction: A survey of the state of the art. arXiv preprint arXiv:2211.05166 (2022).Google Scholar
- . 2020. Revisiting pre-trained models for Chinese natural language processing. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 657–668. Google Scholar
Cross Ref
- . 2021. Pre-training with whole word masking for Chinese BERT. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 3504–3514. Google Scholar
Digital Library
- . 2013. Building a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications. 22–31.Google Scholar
- . 2014. WINGS: Writing with intelligent guidance and suggestions. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Baltimore, Maryland, 25–30. Google Scholar
Cross Ref
- . 2021. The automated writing assistance landscape in 2021. Natural Language Engineering 27, 4 (2021), 511–518.Google Scholar
Cross Ref
- . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. Google Scholar
Cross Ref
- . 2020. Enabling language models to fill in the blanks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2492–2501.Google Scholar
Cross Ref
- . 2022. Understanding iterative revision from human-written text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3573–3590.Google Scholar
Cross Ref
- . 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 5 (1971), 378.Google Scholar
Cross Ref
- . 2011. Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Edinburgh, Scotland, UK., 1168–1179. https://aclanthology.org/D11-1108.Google Scholar
- . 2017. 中文近义词工具包Synonyms. https://github.com/chatopera/Synonyms.Google Scholar
- . 2000. Intelligent writing assistance. Handbook of Natural Language Processing (2000), 181–207.Google Scholar
- . 2014. Identifying Idioms in Chinese translations. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland, 716–721. http://www.lrec-conf.org/proceedings/lrec2014/pdf/462_Paper.pdf.Google Scholar
- . 2015. ADAM: A method for stochastic optimization. In Conference Track Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), (San Diego, CA, May 7–9), and (Eds.). http://arxiv.org/abs/1412.6980.Google Scholar
- . 1972. On the effects of non-normality on the distribution of the sample product-moment correlation coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics) 21, 1 (1972), 1–12.Google Scholar
Cross Ref
- . 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (ACL 2020), (Online, July 5–10, 2020), , , , and Eds.. Association for Computational Linguistics, 7871–7880. Google Scholar
Cross Ref
- . 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (San Diego, CA, June 12–17, 2016) , , and (Eds.). The Association for Computational Linguistics, 110–119. Google Scholar
Cross Ref
- . 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016) (Austin, Texas, November 1–4), , , and (Eds.). The Association for Computational Linguistics, 2122–2132. Google Scholar
Cross Ref
- . 2018. Modelling context with neural networks for recommending idioms in essay writing. Neurocomputing 275 (2018), 2287–2293. Google Scholar
Cross Ref
- . 2019. Neural-based Chinese Idiom recommendation for enhancing elegance in essay writing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 5522–5526. Google Scholar
Cross Ref
- . 2010. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics 36, 3 (
Sept. 2010), 341–387. Google ScholarDigital Library
- . 2017. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, Valencia, Spain, 881–893. https://aclanthology.org/E17-1083.Google Scholar
Cross Ref
- . 2013a. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR 2013) (Scottsdale, AZ, May 2–4), and (Eds.). http://arxiv.org/abs/1301.3781.Google Scholar
- . 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, , , , and (Eds.). 3111–3119. https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.Google Scholar
- . 2017. JFLEG: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Valencia, Spain, 229–234. https://aclanthology.org/E17-2037.Google Scholar
Cross Ref
- . 2020. GECToR - grammatical error correction: Tag, not rewrite. In Proceedings of the 15th Workshop on Innovative Use of NLP for Building Educational Applications ([email protected] 2020) (Online, July 10, 2020), , , , , , , and (Eds.). Association for Computational Linguistics, 163–170. Google Scholar
Cross Ref
- . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6–12, 2002, (Philadelphia, PA, July 6–12).ACL, 311–318. Google Scholar
Digital Library
- . 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), (Doha, Qatar, October 25-29) A meeting of SIGDAT, a Special Interest Group of the ACL, , , and (Eds.). ACL, 1532–1543. Google Scholar
Cross Ref
- . 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018) (New Orleans, La., June 1–6). Volume 1 (Long Papers), , , and (Eds.). Association for Computational Linguistics, 2227–2237. Google Scholar
Cross Ref
- . 2018. Style transfer through back-translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 866–876. Google Scholar
Cross Ref
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google Scholar
- . 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html.Google Scholar
- . 2016. Reassessing the goals of grammatical error correction: Fluency instead of grammaticality. Transactions of the Association for Computational Linguistics 4 (2016), 169–182.Google Scholar
Cross Ref
- . 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 86–96.Google Scholar
Cross Ref
- . 2018. Evaluating machine translation performance on Chinese Idioms with a blacklist method. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1005.Google Scholar
- . 2020. Blank language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 5186–5198.Google Scholar
Cross Ref
- . 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014 (December 8–13 2014, Montreal, Quebec, Canada, December 8-13 , , , , and (Eds.). 3104–3112. https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html.Google Scholar
- . 2021. A BERT-based two-stage model for Chinese chengyu recommendation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 6, Article
92 (Aug. 2021), 18 pages. Google ScholarDigital Library
- . 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, (Long Beach, CA, December 4-9), , , , , , , and (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.Google Scholar
- Yingying Wang, Cunliang Kong, Liner Yang, Yijun Wang, Xiaorong Lu, Renfen Hu, Shan He, Zhenghao Liu, Yun Chen, Erhong Yang, and Maosong Sun. 2021. YACLC: A Chinese learner corpus with multidimensional annotation. arXiv preprint arXiv:2112.15043 (2021).Google Scholar
- . 2017. Learning paraphrastic sentence embeddings from back-translated bitext. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 274–285. Google Scholar
Cross Ref
- . 2019. HuggingFace’s transformers: State-of-the-art natural language processing. CoRR abs/1910.03771 (2019).
arXiv:1910.03771 http://arxiv.org/abs/1910.03771.Google Scholar - . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).
arXiv:1609.08144 http://arxiv.org/abs/1609.08144.Google Scholar - . 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 180–189.Google Scholar
Digital Library
- . 2019. PKU paraphrase bank: A sentence-level paraphrase corpus for Chinese. In Natural Language Processing and Chinese Computing, , , , , and (Eds.). Springer International Publishing, Cham, 814–826.Google Scholar
- . 2020. BERTScore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net. https://openreview.net/forum?id=SkeHuCVFDr.Google Scholar
- . 2022. MuCGEC: A multi-reference multi-source evaluation dataset for Chinese grammatical error correction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10–15, 2022, , , and (Eds.). Association for Computational Linguistics, 3118–3130. Google Scholar
Cross Ref
- . 2019. Text infilling. arXiv preprint arXiv:1901.00158 (2019).Google Scholar
- . 2016. The United nations parallel corpus v1.0. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, 3530–3534. https://aclanthology.org/L16-1561.Google Scholar
- . 1999. CRC Standard Probability and Statistics Tables and Formulae. CRC Press.Google Scholar
Index Terms
Text Polishing with Chinese Idiom: Task, Datasets and Pre-trained Baselines
Recommendations
Pre-trained Language Models for Tagalog with Multi-source Data
Natural Language Processing and Chinese ComputingAbstractPre-trained language models (PLMs) for Tagalog can be categorized into two kinds: monolingual models and multilingual models. However, existing monolingual models are only trained in small-scale Wikipedia corpus and multilingual models fail to ...
A Prompt-Based Representation Individual Enhancement Method for Chinese Idiom Reading Comprehension
Database Systems for Advanced ApplicationsAbstractChinese idiom is a distinctive language phenomenon, which usually consists of four Chinese characters and expresses a non-compositional and metaphorical meaning. Therefore, Chinese idioms pose unique challenges for Chinese machine reading ...
Bulgarian-Polish-Lithuanian corpus: current development
MRTECEEL '09: Proceedings of the Workshop on Multilingual Resources, Technologies and Evaluation for Central and Eastern European LanguagesThis paper discusses the building of the first Bulgarian---Polish---Lithuanian (for short, BG---PL---LT) experimental corpus. The BG---PL---LT corpus (currently under development only for research) contains more than 3 million words and comprises two ...






Comments