Abstract
With the advances in Natural Language Processing (NLP), the industry has been moving towards human-directed artificial intelligence (AI) solutions. Recently, chatbots and automated news generation have captured a lot of attention. The goal is to automatically generate readable text from tabular data or web data commonly represented in Resource Description Framework (RDF) format. The problem can then be formulated as Data-to-text (D2T) generation from structured non-linguistic data into human-readable natural language. Despite the significant work done for the English language, no efforts are being directed towards low-resource languages like the Arabic language. This work promotes the development of the first RDF data-to-text (D2T) generation system for the Arabic language while trying to address the low-resource limitation. We develop several models for the Arabic D2T task using transfer learning from large language models (LLM) such as AraBERT, AraGPT2, and mT5. These models include a baseline Bi-LSTM Sequence-to-Sequence (Seq2Seq) model, as well as encoder-decoder transformers like BERT2BERT, BERT2GPT, and T5. We then provide a detailed comparative study highlighting the strengths and limitations of these methods setting the stage for further advancement in the field. We also introduce a new Arabic dataset (AraWebNLG) that can be used for new model development in the field. To ensure a comprehensive evaluation, general-purpose automated metrics (BLEU and Perplexity scores) are used as well as task-specific human evaluation metrics related to the accuracy of the content selection and fluency of the generated text. The results highlight the importance of pre-training on a large corpus of Arabic data and show that transfer learning from AraBERT gives the best performance. Text-to-text pre-training using mT5 achieves second best performance results even with multilingual weights.
- [1] . 2020. AraBERT: Transformer-based model for Arabic language understanding. In LREC 2020 Workshop Language Resources and Evaluation Conference 11.16 May 2020. 9.Google Scholar
- [2] . 2021. AraGPT2: Pre-Trained transformer for Arabic language generation. In Proceedings of the 6th Arabic Natural Language Processing Workshop. Association for Computational Linguistics, Kyiv (Virtual), 196–207.Google Scholar
- [3] . 2009. The emerging web of linked data. IEEE Intelligent Systems 24, 5 (2009), 87–92.Google Scholar
Digital Library
- [4] Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, 5016–5026. Google Scholar
Cross Ref
- [5] . 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8440–8451. Google Scholar
Cross Ref
- [6] . 2016. Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the 10th International Conference on Language Resources and Evaluation. 1070–1074.Google Scholar
- [7] . 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, 4171–4186. Google Scholar
Cross Ref
- [8] . 2019. hulmona: The universal language model in arabic. In Proceedings of the 4th Arabic Natural Language Processing Workshop. 68–77.Google Scholar
Cross Ref
- [9] . 2020. The 2020 bilingual, bi-directional webnlg+ shared task overview and evaluation results (webnlg+ 2020). In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web.Google Scholar
- [10] . 2019. Neural data-to-text generation: A comparison between pipeline and end-to-end architectures. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, 552–562. Google Scholar
Cross Ref
- [11] . 2017. The WebNLG challenge: Generating text from RDF data. In Proceedings of the 10th International Conference on Natural Language Generation. 124–133.Google Scholar
Cross Ref
- [12] . 2020. Text-to-text pre-training for data-to-text tasks. In Proceedings of the 13th International Conference on Natural Language Generation, Association for Computational Linguistics, Dublin, 97–102. https://aclanthology.org/2020.inlg-1.14.Google Scholar
- [13] . 2017. OpenNMT: Open-source toolkit for Neural Machine Translation. In Proceedings of ACL 2017, System Demonstrations. Association for Computational Linguistics, Vancouver, 67–72. https://aclanthology.org/P17-4012.Google Scholar
- [14] . 2020. Leveraging large pretrained models for WebNLG 2020. In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web. 117–124.Google Scholar
- [15] . 2019. Step-by-step: Separating planning from realization in neural data-to-text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, 2267–2277. Google Scholar
Cross Ref
- [16] . 2020. Template-based multi-solution approach for data-to-text generation. In Proceedings of the European Conference on Advances in Databases and Information Systems. Springer, 157–170.Google Scholar
Digital Library
- [17] . 2020. Totto: A controlled table-to-text generation dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 1173–1186. Google Scholar
Cross Ref
- [18] . 2019. Data-to-text generation with content selection and planning. In Proceedings of the AAAI Conference on Artificial Intelligence. 6908–6915.Google Scholar
Digital Library
- [19] . 2018. E2e nlg challenge: Neural models vs. templates. In Proceedings of the 11th International Conference on Natural Language Generation. 463–471.Google Scholar
Cross Ref
- [20] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.Google Scholar
- [21] . 2022. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1 (Jun 2022), 67 pages.Google Scholar
- [22] . 2020. A hierarchical model for data-to-text generation. Advances in Information Retrieval 12035 (2020), 65.Google Scholar
Digital Library
- [23] . 2021. Investigating pretrained language models for graph-to-text generation. In Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI, Association for Computational Linguistics, Online, 211–227. Google Scholar
Cross Ref
- [24] . 2020. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics 8 (2020), 264–280.Google Scholar
Cross Ref
- [25] . 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [26] . 2014. Wikidata: A free collaborative knowledgebase. Communications of the ACM 57, 10 (2014), 78–85.Google Scholar
Digital Library
- [27] . 2019. Selecting, planning, and rewriting: A modular approach for data-to-document generation and translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation. 289–296.Google Scholar
- [28] . 2017. Challenges in data-to-document generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, 2253–2263. Google Scholar
Cross Ref
- [29] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 483–498. Google Scholar
Cross Ref
- [30] . 2020. Variational template machine for data-to-text generation. In International Conference on Learning Representations. https://openreview.net/forum?id=HkejNgBtPB.Google Scholar
Index Terms
Automated Generation of Human-readable Natural Arabic Text from RDF Data
Recommendations
A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features
This paper introduces a large-scale dual-mode stochastic system to automatically diacritize raw Arabic text. The first of these modes determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum ...
A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such ...
Natural Language Morphology Integration in Off-Line Arabic Optical Text Recognition
In this paper, we propose a new linguistic-based approach called the affixal approach for Arabic word and text image recognition. Most of the existing works in the field integrate the knowledge of the Arabic language in the recognition process in two ...






Comments