skip to main content
research-article

Text Polishing with Chinese Idiom: Task, Datasets and Pre-trained Baselines

Published:19 June 2023Publication History
Skip Abstract Section

Abstract

This work presents the task of text polishing, which generates a sentence that is more graceful than the input sentence while retaining its semantic meaning. Text polishing has great value in real usage and is an important component in modern writing assistance systems. However, the task is still not well studied in the literature. Further research in this important direction requires more formal task definitions, benchmark datasets, and powerful baseline models. In this work, we formulate the task as a context-dependent text generation problem and conduct a case study on the text polishing with Chinese idiom. To circumvent the difficulties of task data annotation, we propose a semi-automatic data construction pipeline based on human-machine collaboration, and establish a large-scale text polishing dataset consisting of 1.5 million instances. We propose two types of task-specific pre-training objectives for the text polishing task and implement a series of Transformer-based models pre-trained on a massive Chinese corpus as baselines. We conduct extensive experiments with the baseline models on the constructed text polishing datasets and have some major findings. The human evaluation further reveals the polishing ability of the final system.

REFERENCES

  1. Bryant Christopher, Felice Mariano, Andersen Øistein E., and Briscoe Ted. 2019. The BEA-2019 shared task on grammatical error correction. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications. 5275.Google ScholarGoogle ScholarCross RefCross Ref
  2. Bryant Christopher, Yuan Zheng, Qorib Muhammad Reza, Cao Hannan, Ng Hwee Tou, and Briscoe Ted. 2022. Grammatical error correction: A survey of the state of the art. arXiv preprint arXiv:2211.05166 (2022).Google ScholarGoogle Scholar
  3. Cui Yiming, Che Wanxiang, Liu Ting, Qin Bing, Wang Shijin, and Hu Guoping. 2020. Revisiting pre-trained models for Chinese natural language processing. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 657668. Google ScholarGoogle ScholarCross RefCross Ref
  4. Cui Yiming, Che Wanxiang, Liu Ting, Qin Bing, and Yang Ziqing. 2021. Pre-training with whole word masking for Chinese BERT. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 35043514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Dahlmeier Daniel, Ng Hwee Tou, and Wu Siew Mei. 2013. Building a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications. 2231.Google ScholarGoogle Scholar
  6. Dai Xianjun, Liu Yuanchao, Wang Xiaolong, and Liu Bingquan. 2014. WINGS: Writing with intelligent guidance and suggestions. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Baltimore, Maryland, 2530. Google ScholarGoogle ScholarCross RefCross Ref
  7. Dale Robert and Viethen Jette. 2021. The automated writing assistance landscape in 2021. Natural Language Engineering 27, 4 (2021), 511518.Google ScholarGoogle ScholarCross RefCross Ref
  8. Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 41714186. Google ScholarGoogle ScholarCross RefCross Ref
  9. Donahue Chris, Lee Mina, and Liang Percy. 2020. Enabling language models to fill in the blanks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 24922501.Google ScholarGoogle ScholarCross RefCross Ref
  10. Du Wanyu, Raheja Vipul, Kumar Dhruv, Kim Zae Myung, Lopez Melissa, and Kang Dongyeop. 2022. Understanding iterative revision from human-written text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 35733590.Google ScholarGoogle ScholarCross RefCross Ref
  11. Fleiss Joseph L.. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 5 (1971), 378.Google ScholarGoogle ScholarCross RefCross Ref
  12. Ganitkevitch Juri, Callison-Burch Chris, Napoles Courtney, and Durme Benjamin Van. 2011. Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Edinburgh, Scotland, UK., 11681179. https://aclanthology.org/D11-1108.Google ScholarGoogle Scholar
  13. Wang Hu Ying Xi Hai Liang. 2017. 中文近义词工具包Synonyms. https://github.com/chatopera/Synonyms.Google ScholarGoogle Scholar
  14. Heidorn George. 2000. Intelligent writing assistance. Handbook of Natural Language Processing (2000), 181207.Google ScholarGoogle Scholar
  15. Ho Wan Yu, Kng Christine, Wang Shan, and Bond Francis. 2014. Identifying Idioms in Chinese translations. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland, 716721. http://www.lrec-conf.org/proceedings/lrec2014/pdf/462_Paper.pdf.Google ScholarGoogle Scholar
  16. Kingma Diederik P. and Ba Jimmy. 2015. ADAM: A method for stochastic optimization. In Conference Track Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), (San Diego, CA, May 7–9), Bengio Yoshua and LeCun Yann (Eds.). http://arxiv.org/abs/1412.6980.Google ScholarGoogle Scholar
  17. Kowalski Charles J.. 1972. On the effects of non-normality on the distribution of the sample product-moment correlation coefficient. Journal of the Royal Statistical Society: Series C (Applied Statistics) 21, 1 (1972), 112.Google ScholarGoogle ScholarCross RefCross Ref
  18. Lewis Mike, Liu Yinhan, Goyal Naman, Ghazvininejad Marjan, Mohamed Abdelrahman, Levy Omer, Stoyanov Veselin, and Zettlemoyer Luke. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (ACL 2020), (Online, July 5–10, 2020), Jurafsky Dan, Chai Joyce, Schluter Natalie, and Tetreault, Joel R. Eds.. Association for Computational Linguistics, 78717880. Google ScholarGoogle ScholarCross RefCross Ref
  19. Li Jiwei, Galley Michel, Brockett Chris, Gao Jianfeng, and Dolan Bill. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (San Diego, CA, June 12–17, 2016) Knight Kevin, Nenkova Ani, and Rambow Owen (Eds.). The Association for Computational Linguistics, 110119. Google ScholarGoogle ScholarCross RefCross Ref
  20. Liu Chia-Wei, Lowe Ryan, Serban Iulian, Noseworthy Michael, Charlin Laurent, and Pineau Joelle. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016) (Austin, Texas, November 1–4), Su Jian, Carreras Xavier, and Duh Kevin (Eds.). The Association for Computational Linguistics, 21222132. Google ScholarGoogle ScholarCross RefCross Ref
  21. Liu Yuanchao, Liu Bingquan, Shan Lili, and Wang Xin. 2018. Modelling context with neural networks for recommending idioms in essay writing. Neurocomputing 275 (2018), 22872293. Google ScholarGoogle ScholarCross RefCross Ref
  22. Liu Yuanchao, Pang Bo, and Liu Bingquan. 2019. Neural-based Chinese Idiom recommendation for enhancing elegance in essay writing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 55225526. Google ScholarGoogle ScholarCross RefCross Ref
  23. Madnani Nitin and Dorr Bonnie J.. 2010. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics 36, 3 (Sept.2010), 341387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mallinson Jonathan, Sennrich Rico, and Lapata Mirella. 2017. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, Valencia, Spain, 881893. https://aclanthology.org/E17-1083.Google ScholarGoogle ScholarCross RefCross Ref
  25. Mikolov Tomás, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013a. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR 2013) (Scottsdale, AZ, May 2–4), Bengio Yoshua and LeCun Yann (Eds.). http://arxiv.org/abs/1301.3781.Google ScholarGoogle Scholar
  26. Mikolov Tomás, Sutskever Ilya, Chen Kai, Corrado Gregory S., and Dean Jeffrey. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, Burges Christopher J. C., Bottou Léon, Ghahramani Zoubin, and Weinberger Kilian Q. (Eds.). 31113119. https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.Google ScholarGoogle Scholar
  27. Napoles Courtney, Sakaguchi Keisuke, and Tetreault Joel. 2017. JFLEG: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Valencia, Spain, 229234. https://aclanthology.org/E17-2037.Google ScholarGoogle ScholarCross RefCross Ref
  28. Omelianchuk Kostiantyn, Atrasevych Vitaliy, Chernodub Artem N., and Skurzhanskyi Oleksandr. 2020. GECToR - grammatical error correction: Tag, not rewrite. In Proceedings of the 15th Workshop on Innovative Use of NLP for Building Educational Applications ([email protected] 2020) (Online, July 10, 2020), Burstein Jill, Kochmar Ekaterina, Leacock Claudia, Madnani Nitin, Pilán Ildikó, Yannakoudakis Helen, and Zesch Torsten (Eds.). Association for Computational Linguistics, 163170. Google ScholarGoogle ScholarCross RefCross Ref
  29. Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6–12, 2002, (Philadelphia, PA, July 6–12).ACL, 311318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), (Doha, Qatar, October 25-29) A meeting of SIGDAT, a Special Interest Group of the ACL, Moschitti Alessandro, Pang Bo, and Daelemans Walter (Eds.). ACL, 15321543. Google ScholarGoogle ScholarCross RefCross Ref
  31. Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018) (New Orleans, La., June 1–6). Volume 1 (Long Papers), Walker Marilyn A., Ji Heng, and Stent Amanda (Eds.). Association for Computational Linguistics, 22272237. Google ScholarGoogle ScholarCross RefCross Ref
  32. Prabhumoye Shrimai, Tsvetkov Yulia, Salakhutdinov Ruslan, and Black Alan W.. 2018. Style transfer through back-translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 866876. Google ScholarGoogle ScholarCross RefCross Ref
  33. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google ScholarGoogle Scholar
  34. Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html.Google ScholarGoogle Scholar
  35. Sakaguchi Keisuke, Napoles Courtney, Post Matt, and Tetreault Joel. 2016. Reassessing the goals of grammatical error correction: Fluency instead of grammaticality. Transactions of the Association for Computational Linguistics 4 (2016), 169182.Google ScholarGoogle ScholarCross RefCross Ref
  36. Sennrich Rico, Haddow Barry, and Birch Alexandra. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8696.Google ScholarGoogle ScholarCross RefCross Ref
  37. Shao Yutong, Sennrich Rico, Webber Bonnie, and Fancellu Federico. 2018. Evaluating machine translation performance on Chinese Idioms with a blacklist method. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1005.Google ScholarGoogle Scholar
  38. Shen Tianxiao, Quach Victor, Barzilay Regina, and Jaakkola Tommi. 2020. Blank language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 51865198.Google ScholarGoogle ScholarCross RefCross Ref
  39. Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014 (December 8–13 2014, Montreal, Quebec, Canada, December 8-13 Ghahramani Zoubin, Welling Max, Cortes Corinna, Lawrence Neil D., and Weinberger Kilian Q. (Eds.). 31043112. https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html.Google ScholarGoogle Scholar
  40. Tan Minghuan, Jiang Jing, and Dai Bing Tian. 2021. A BERT-based two-stage model for Chinese chengyu recommendation. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20, 6, Article 92 (Aug.2021), 18 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, (Long Beach, CA, December 4-9), Guyon Isabelle, Luxburg Ulrike von, Bengio Samy, Wallach Hanna M., Fergus Rob, Vishwanathan S. V. N., and Garnett Roman (Eds.). 59986008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.Google ScholarGoogle Scholar
  42. Yingying Wang, Cunliang Kong, Liner Yang, Yijun Wang, Xiaorong Lu, Renfen Hu, Shan He, Zhenghao Liu, Yun Chen, Erhong Yang, and Maosong Sun. 2021. YACLC: A Chinese learner corpus with multidimensional annotation. arXiv preprint arXiv:2112.15043 (2021).Google ScholarGoogle Scholar
  43. Wieting John, Mallinson Jonathan, and Gimpel Kevin. 2017. Learning paraphrastic sentence embeddings from back-translated bitext. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 274285. Google ScholarGoogle ScholarCross RefCross Ref
  44. Wolf Thomas, Debut Lysandre, Sanh Victor, Chaumond Julien, Delangue Clement, Moi Anthony, Cistac Pierric, Rault Tim, Louf Rémi, Funtowicz Morgan, and Brew Jamie. 2019. HuggingFace’s transformers: State-of-the-art natural language processing. CoRR abs/1910.03771 (2019). arXiv:1910.03771 http://arxiv.org/abs/1910.03771.Google ScholarGoogle Scholar
  45. Wu Yonghui, Schuster Mike, Chen Zhifeng, Le Quoc V., Norouzi Mohammad, Macherey Wolfgang, Krikun Maxim, Cao Yuan, Gao Qin, Macherey Klaus, Klingner Jeff, Shah Apurva, Johnson Melvin, Liu Xiaobing, Kaiser Lukasz, Gouws Stephan, Kato Yoshikiyo, Kudo Taku, Kazawa Hideto, Stevens Keith, Kurian George, Patil Nishant, Wang Wei, Young Cliff, Smith Jason, Riesa Jason, Rudnick Alex, Vinyals Oriol, Corrado Greg, Hughes Macduff, and Dean Jeffrey. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). arXiv:1609.08144 http://arxiv.org/abs/1609.08144.Google ScholarGoogle Scholar
  46. Yannakoudakis Helen, Briscoe Ted, and Medlock Ben. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 180189.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zhang Bowei, Sun Weiwei, Wan Xiaojun, and Guo Zongming. 2019. PKU paraphrase bank: A sentence-level paraphrase corpus for Chinese. In Natural Language Processing and Chinese Computing, Tang Jie, Kan Min-Yen, Zhao Dongyan, Li Sujian, and Zan Hongying (Eds.). Springer International Publishing, Cham, 814826.Google ScholarGoogle Scholar
  48. Zhang Tianyi, Kishore Varsha, Wu Felix, Weinberger Kilian Q., and Artzi Yoav. 2020. BERTScore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net. https://openreview.net/forum?id=SkeHuCVFDr.Google ScholarGoogle Scholar
  49. Zhang Yue, Li Zhenghua, Bao Zuyi, Li Jiacheng, Zhang Bo, Li Chen, Huang Fei, and Zhang Min. 2022. MuCGEC: A multi-reference multi-source evaluation dataset for Chinese grammatical error correction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10–15, 2022, Carpuat Marine, Marneffe Marie-Catherine de, and Ruíz Iván Vladimir Meza (Eds.). Association for Computational Linguistics, 31183130. Google ScholarGoogle ScholarCross RefCross Ref
  50. Zhu Wanrong, Hu Zhiting, and Xing Eric. 2019. Text infilling. arXiv preprint arXiv:1901.00158 (2019).Google ScholarGoogle Scholar
  51. Ziemski Michał, Junczys-Dowmunt Marcin, and Pouliquen Bruno. 2016. The United nations parallel corpus v1.0. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA), Portorož, Slovenia, 35303534. https://aclanthology.org/L16-1561.Google ScholarGoogle Scholar
  52. Zwillinger Daniel and Kokoska Stephen. 1999. CRC Standard Probability and Statistics Tables and Formulae. CRC Press.Google ScholarGoogle Scholar

Index Terms

  1. Text Polishing with Chinese Idiom: Task, Datasets and Pre-trained Baselines

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
        June 2023
        635 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3604597
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 June 2023
        • Online AM: 21 April 2023
        • Accepted: 11 April 2023
        • Revised: 9 January 2023
        • Received: 31 August 2022
        Published in tallip Volume 22, Issue 6

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)61
        • Downloads (Last 6 weeks)19

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!