skip to main content
short-paper

Automated Generation of Human-readable Natural Arabic Text from RDF Data

Published:25 March 2023Publication History
Skip Abstract Section

Abstract

With the advances in Natural Language Processing (NLP), the industry has been moving towards human-directed artificial intelligence (AI) solutions. Recently, chatbots and automated news generation have captured a lot of attention. The goal is to automatically generate readable text from tabular data or web data commonly represented in Resource Description Framework (RDF) format. The problem can then be formulated as Data-to-text (D2T) generation from structured non-linguistic data into human-readable natural language. Despite the significant work done for the English language, no efforts are being directed towards low-resource languages like the Arabic language. This work promotes the development of the first RDF data-to-text (D2T) generation system for the Arabic language while trying to address the low-resource limitation. We develop several models for the Arabic D2T task using transfer learning from large language models (LLM) such as AraBERT, AraGPT2, and mT5. These models include a baseline Bi-LSTM Sequence-to-Sequence (Seq2Seq) model, as well as encoder-decoder transformers like BERT2BERT, BERT2GPT, and T5. We then provide a detailed comparative study highlighting the strengths and limitations of these methods setting the stage for further advancement in the field. We also introduce a new Arabic dataset (AraWebNLG) that can be used for new model development in the field. To ensure a comprehensive evaluation, general-purpose automated metrics (BLEU and Perplexity scores) are used as well as task-specific human evaluation metrics related to the accuracy of the content selection and fluency of the generated text. The results highlight the importance of pre-training on a large corpus of Arabic data and show that transfer learning from AraBERT gives the best performance. Text-to-text pre-training using mT5 achieves second best performance results even with multilingual weights.

REFERENCES

  1. [1] Antoun Wissam, Baly Fady, and Hajj Hazem. 2020. AraBERT: Transformer-based model for Arabic language understanding. In LREC 2020 Workshop Language Resources and Evaluation Conference 11.16 May 2020. 9.Google ScholarGoogle Scholar
  2. [2] Antoun Wissam, Baly Fady, and Hajj Hazem. 2021. AraGPT2: Pre-Trained transformer for Arabic language generation. In Proceedings of the 6th Arabic Natural Language Processing Workshop. Association for Computational Linguistics, Kyiv (Virtual), 196–207.Google ScholarGoogle Scholar
  3. [3] Bizer Christian. 2009. The emerging web of linked data. IEEE Intelligent Systems 24, 5 (2009), 8792.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, 5016–5026. Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Conneau Alexis, Khandelwal Kartikay, Goyal Naman, Chaudhary Vishrav, Wenzek Guillaume, Guzmán Francisco, Grave Edouard, Ott Myle, Zettlemoyer Luke, and Stoyanov Veselin. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8440–8451. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Darwish Kareem and Mubarak Hamdy. 2016. Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the 10th International Conference on Language Resources and Evaluation. 10701074.Google ScholarGoogle Scholar
  7. [7] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, 4171–4186. Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] ElJundi Obeida, Antoun Wissam, Droubi Nour El, Hajj Hazem, El-Hajj Wassim, and Shaban Khaled. 2019. hulmona: The universal language model in arabic. In Proceedings of the 4th Arabic Natural Language Processing Workshop. 6877.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Ferreira Thiago, Gardent Claire, Ilinykh Nikolai, Lee Chris van der, Mille Simon, Moussallem Diego, and Shimorina Anastasia. 2020. The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (webnlg+ 2020). In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web.Google ScholarGoogle Scholar
  10. [10] Ferreira Thiago Castro, Lee Chris van der, Miltenburg Emiel Van, and Krahmer Emiel. 2019. Neural data-to-text generation: A comparison between pipeline and end-to-end architectures. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, 552–562. Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Gardent Claire, Shimorina Anastasia, Narayan Shashi, and Perez-Beltrachini Laura. 2017. The WebNLG challenge: Generating text from RDF data. In Proceedings of the 10th International Conference on Natural Language Generation. 124133.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Kale Mihir and Rastogi Abhinav. 2020. Text-to-text pre-training for data-to-text tasks. In Proceedings of the 13th International Conference on Natural Language Generation, Association for Computational Linguistics, Dublin, 97–102. https://aclanthology.org/2020.inlg-1.14.Google ScholarGoogle Scholar
  13. [13] Klein Guillaume, Kim Yoon, Deng Yuntian, Senellart Jean, and Rush Alexander. 2017. OpenNMT: Open-source toolkit for Neural Machine Translation. In Proceedings of ACL 2017, System Demonstrations. Association for Computational Linguistics, Vancouver, 67–72. https://aclanthology.org/P17-4012.Google ScholarGoogle Scholar
  14. [14] Li Xintong, Maskharashvili Aleksandre, Stevens-Guille Symon Jory, and White Michael. 2020. Leveraging large pretrained models for WebNLG 2020. In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web. 117124.Google ScholarGoogle Scholar
  15. [15] Moryossef Amit, Goldberg Yoav, and Dagan Ido. 2019. Step-by-step: Separating planning from realization in neural data-to-text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, 2267–2277. Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Mota Abelardo Vieira, Silva Ticiana Linhares Coelho da, and Macêdo José Antônio Fernandes De. 2020. Template-based multi-solution approach for data-to-text generation. In Proceedings of the European Conference on Advances in Databases and Information Systems. Springer, 157170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Parikh Ankur, Wang Xuezhi, Gehrmann Sebastian, Faruqui Manaal, Dhingra Bhuwan, Yang Diyi, and Das Dipanjan. 2020. Totto: A controlled table-to-text generation dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 1173–1186. Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Puduppully Ratish, Dong Li, and Lapata Mirella. 2019. Data-to-text generation with content selection and planning. In Proceedings of the AAAI Conference on Artificial Intelligence. 69086915.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Puzikov Yevgeniy and Gurevych Iryna. 2018. E2e nlg challenge: Neural models vs. templates. In Proceedings of the 11th International Conference on Natural Language Generation. 463471.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.Google ScholarGoogle Scholar
  21. [21] Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, and Liu Peter J.. 2022. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1 (Jun 2022), 67 pages.Google ScholarGoogle Scholar
  22. [22] Rebuffel Clément, Soulier Laure, Scoutheeten Geoffrey, and Gallinari Patrick. 2020. A hierarchical model for data-to-text generation. Advances in Information Retrieval 12035 (2020), 65.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Ribeiro Leonardo F. R., Schmitt Martin, Schütze Hinrich, and Gurevych Iryna. 2021. Investigating pretrained language models for graph-to-text generation. In Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI, Association for Computational Linguistics, Online, 211–227. Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Rothe Sascha, Narayan Shashi, and Severyn Aliaksei. 2020. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics 8 (2020), 264280.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  26. [26] Vrandečić Denny and Krötzsch Markus. 2014. Wikidata: A free collaborative knowledgebase. Communications of the ACM 57, 10 (2014), 7885.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Werlen Lesly Miculicich, Marone Marc, and Hassan Hany. 2019. Selecting, planning, and rewriting: A modular approach for data-to-document generation and translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation. 289296.Google ScholarGoogle Scholar
  28. [28] Wiseman Sam, Shieber Stuart, and Rush Alexander. 2017. Challenges in data-to-document generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, 2253–2263. Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 483–498. Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Ye Rong, Shi Wenxian, Zhou Hao, Wei Zhongyu, and Li Lei. 2020. Variational template machine for data-to-text generation. In International Conference on Learning Representations. https://openreview.net/forum?id=HkejNgBtPB.Google ScholarGoogle Scholar

Index Terms

  1. Automated Generation of Human-readable Natural Arabic Text from RDF Data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
      April 2023
      682 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3588902
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 March 2023
      • Online AM: 25 January 2023
      • Accepted: 16 January 2023
      • Received: 12 June 2022
      Published in tallip Volume 22, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper
    • Article Metrics

      • Downloads (Last 12 months)170
      • Downloads (Last 6 weeks)26

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!