A Review of Greek NLP Technologies for Chatbot Development

The advent of Generative AI has certainly boosted the interest in developing innovative chatbot applications. Despite a vast amount of machine learning (ML) and natural language processing (NLP) research and English language resources that greatly improve chatbot technology, the corresponding research and resources for the Greek language are limited. The contribution of this paper is twofold: (i) it reports on the state-of-the-art research in Greek NLP, as far as language resources, embeddings-based techniques, deep learning models, and existing chatbot applications are concerned; (ii) it offers a set of insights on current NLP models and chatbot implementation methodologies, and outlines a set of pending issues and future research directions.


INTRODUCTION
The recent advent of OpenAI's ChatGPT1 , an artificial intelligence (AI) chatbot that uses state-of-the-art deep learning (DL) and machine learning (ML) techniques, has leveraged the interest 2 and advanced research in the chatbot technology.Existing surveys mention many successful chatbot applications in both the private [1] and the public sector [2].These applications range across many fields, including e-government [2], education, healthcare, and customer relationship management [3].In any case, the majority of existing chatbot systems use a neural classification architecture to predict the user intent and answer based on a list of pre-determined information [3].However, this approach leads to the development of chatbots that fail to understand the nuances of human communication, which is a crucial shortcoming; as reported in [4], users prefer chatbots with 'human-like' conversation skills that offer an engaging experience through a familiar turn-based messaging interface.
We argue that the integration of recent DL, ML and Natural Language Processing (NLP) advancements will significantly benefit the current chatbot technology, enabling an efficient analysis of textual data, while also taking into consideration important semantic and contextual information.Of particular importance is the utilization of the Transformer architecture [5], and particularly BERT [6], which rendered a significant increase in the accuracy achieved in various NLP tasks.With respect to the Greek language, early research work focusing on Greek NLP has been thoroughly presented in [7]; admittedly, compared to widely spoken languages such as English, there exist much fewer NLP resources and limited research on the incorporation of DL techniques and models [8].The reviewed Greek NLP techniques and approaches are in accordance with global developments, which use model architectures originally built for the English language and adapt them to a multilingual setting.
In line with the above, this paper reports on Greek NLP technologies for chatbot development, as far as language resources, embeddingsbased techniques, and DL models for the Greek language are concerned.The overall contribution of our work is twofold: (i) we thoroughly present the state-of-the-art in NLP and chatbot approaches for the Greek language; (ii) we offer a set of insights on the utilization and required extension of current NLP methodologies and models.For the needs of the research reported here, we used the search term "greek" along with the terms "chatbot", "classification", "summarization", "deep learning", "machine learning", "survey" and "study".Our search was conducted in Google Scholar, Elsevier Scopus, and the ACM Digital Library.We narrowed down the findings of our search by only considering papers that have received more than 10 citations, or papers from a journal with an impact factor bigger than 2.0, the only exception being some notable papers published in 2023, which introduce state-of-the-art DL models for the Greek Language.It is noted that commercial chatbot applications for the Greek language (e.g., those developed by Crowdpolicy3 ) fall out of the scope of the research reported in this paper.
The remainder of this paper is organized as follows: recent NLP resources for the Greek Language are presented in Section 2; Greek chatbot applications are reviewed in Section 3; finally, concluding remarks, issues requiring attention, and future research directions are outlined in Section 4.

GREEK NLP RESOURCES
Classical NLP approaches focus on lexical, grammatical and syntactical methodologies.However, these methodologies do not encapsulate textual semantic context.This context is captured by word embedding models, such as Word2Vec [9] and FastText [10].Classical NLP approaches also use simple neural ML architectures, which are less accurate than their DL counterparts.Considering the above, notable modern NLP research efforts for the Greek language are presented below.Specifically, this section is organized as follows: Section 2.1 presents research works pertaining to Greek embeddings; Section 2.2 showcases prominent Greek DL language models; Section 2.3 highlights important Greek DL applications.

Greek Embeddings
spaCy is an open-source Python library, which supports many NLP tasks and provides pre-trained NLP models for more than 73 languages.Specifically, support for the Greek language was introduced in 2018 4 .This addition includes a pre-trained Greek NLP model (el_core_news_lg 5 ), which provides word embeddings, built on the Common Crawl and Wikipedia datasets, using the FastText continuous bag-of-words (CBOW) model [10].
The work described in [11] focuses on the creation of word embedding representations by assembling, through a web-crawling process, a corpus of Greek web documents.This process lasted for 45 days, resulting in a 10TB unprocessed HTML corpus.The researchers applied pre-processing techniques, to remove unwanted textual data (i.e., HTML tags, JavaScript code, non-Greek characters etc.) from the initial corpus.In addition, they removed duplicate sentences from a total of 498 million sentences, resulting in 118 million unique sentences.The authors utilized the processed corpus and the FastText skip-gram model [10] to produce Greek word embeddings.Sentence-Transformers is an open-source Python library, which provides state-of-the-art sentence embeddings from the Sentence-BERT model [12] building on BERT.Sentence-Transformers offers a list of pre-trained embedding models, which were reported to achieve top performance in various NLP tasks, including extracting similar sentences, paraphrase mining, semantic search and text classification.Some of these models are multilingual and support more than 100 languages, including Greek.
Finally, the work described in [13] deals with the problem of training and evaluating word embeddings for the Greek language.The authors build five Greek word embedding models, which were trained on a Greek web pages corpus [11], using the FastText and Word2Vec models.They evaluated and compared these models with a FastText skip-gram model trained on Wikipedia data and a FastText CBOW model with position weights [10] trained on Common Crawl and Wikipedia data.The authors also introduce two evaluation datasets for word embeddings, the former focuses on Greek word analogies, while the latter focuses on Greek word similarities.

Greek Deep Learning Models
GREEK-BERT [14] is a pre-trained Greek language model based on the BERT implementation.This model was trained for 5 days on a single Google Cloud TPU instance using a 29 GB Greek dataset with a vocabulary of 35,000 tokens.This dataset is compiled from three sources including: (i) articles from Greek Wikipedia6 ; (ii) Greek documents collected from European Parliament Proceedings and (iii) the Greek part of the OSCAR7 dataset.The dataset tokens were pre-processed to be lowercased and without diacritics.During pretraining two learning tasks (Masked Language Modelling and Next Sequence Prediction) were employed by the model.GREEK-BERT was also fine-tuned on various downstream tasks, including Partof-Speech (POS) Tagging, Named Entity Recognition (NER), and Natural Language Inference).The experimental results showed that GREEK-BERT outperforms other multilingual transformer models supporting the Greek Language, as well as other neural models utilizing pre-trained Greek word embeddings.The authors have published their model online 8 .
GREEK-LEGAL-BERT [15] is a DL language model created for the NER task on Greek legal texts.To pre-train this model, a 5 GB dataset containing all legal documents from Greek Legislation was used.Similarly, to GREEK-BERT, the model was trained by: (i) using the BERT implementation; (ii) building a vocabulary of 35,000 tokens; (iii) having similar training parameters; and (iv) training in a single Google Cloud TPU instance.GREEK-LEGAL-BERT is evaluated alongside GREEK-BERT using precision, recall and F1 metrics for each class (NER entity type).Both models achieve similar results, namely a weighted F1 average score of 75% for all predicted classes.GREEK-BART [8] is the first pretrained sequence-to-sequence monolingual (Greek) model based on BART [16].Unlike GREEK-BERT, it can be used for textual generative tasks (e.g., text summarization etc.).Although, it can also be used for discriminative tasks (e.g., classification etc.).This model is pre-trained on a large 87.6 GB Greek corpus, with a vocabulary of 50,000 sub-words extracted from a 20 GB random corpus sample.This corpus was made using the same datasets as GREEK-BERT, also including the Greek web corpus dataset [11].The authors report that the corpus intentionally includes diverse Greek textual types (encyclopedic, political, journalistic etc.), as well as formal and informal text, as to make the pre-training of GREEK-BART more robust.The authors also pre-processed the pre-training corpus by removing URLs, emojis, tags, and hashtags, and manually removing sentences with no additional contextual meaning.They also removed short documents and duplicated information.

Greek Deep Learning Applications
During our research of the Greek NLP Literature, we discovered notable recent works that build domain-specific Greek DL approaches and, in some cases, add pertinent language resources.These works are categorized to one of three distinct application areas (i.e., General Language Processing, Law, and Toxic / Offensive Language Detection).

General Language Processing. The Neural NLP toolkit for
Greek [17] is an NLP library, based on a deep neural network architecture.It currently incorporates many features such as POS tagging, lemmatization, dependency parsing and text classification.This library is based on linguistic resources such as web mined texts, word embedding representations and collections of manually annotated texts on different levels of linguistic analysis.The authors report that their library achieves high accuracy scores for its features.The linguistic complexity of Greek textbooks as a readability classification task is elaborated in [18].The authors analyze textbook data from five different types of corpora.Three of these corpora, deal with data from different school subjects (Greek Language, History, Science), divided into three educational levels: primary, lower secondary (gymnasium), and upper secondary education (lyceum).The two remaining corpora include textbooks and coursebooks used for teaching Greek as a secondary language, while their data are classified into three language proficiency levels: Basic (A), Independent (B) and Proficient (C).Regarding pre-processing, a Greek dependency parser is utilized for annotating the corpora.The parser produced files that were used by their code to extract "complexity" features, to return a vector representation for each document and to capture lexical, morphological and syntactic features.In the classification experiments several ML algorithms were evaluated using the accuracy and F1 scores combined with 10-fold cross validation.PENELOPIE -Parallel EN-EL Open Information Extraction [19] is an approach to bridge the gap between high and low potential languages in the context of Open Innovation Extraction (OIE).The objectives of this work are twofold.First, Neural Machine Translation (NMT) transformer-based models are used to translate English-Greek and conversely.Secondly, NMT models are exploited to generate English translations of Greek text for an NLP model, to which a series of pre-processing and triplet extraction tasks are applied.Finally, the extracted triplets are retranslated into Greek.PENELOPIE aims to enhance OIE from Greek text corpora.Their proposed approach for OIE on translated texts combines three discrete modules, namely a coreference resolution, a summarization and a parallel triple extraction module.
Regarding Greek extractive text summarization (TS), two opensource libraries exist: pyTextRank 9 [20] and sumy 10 [21].Both libraries implement various extractive TS approaches, that rely on various statistical or graph-based algorithms.These extractive approaches can also be used for the sibling task of Keyphrase Extraction (KE); an NLP task that aims to extract the top-n most salient keyphrases from one or multiple documents.However, newer KE approaches that build on transformer-based embedding models mostly support English, as their underlying models (e.g., BERT) were trained only in English corpora.In contrast, LMRank11 [22] is a transformer-based KE approach that currently supports 14 languages, including Greek.LMRank was evaluated for English where it surpassed other state-of-the-art embeddings-based KE approaches in accuracy and computational performance.[23] is a legal document classification dataset, which contains more than 47,000 categorized resources of Greek legislation.The authors retrieve these texts from the Official Government Gazette of the Hellenic Parliament, where Greek legislation is published.The authors transform the original data into a JSON file that comprises 47,563 legal documents that can be classified into 47 legislative volumes, 389 chapters and 2,285 subject categories.In addition, the introduced dataset is used to evaluate several classification models for Greek legal texts ranging from traditional ML and RNN-based ones to Transformer-based ones (i.e., GREEK-BERT and GREEK-LEGAL-BERT).The results of the experimental evaluation show that the two BERT models outperform the other considered ones.

Law. Greek Legal Codes dataset
GreekLegalSum [24] is a legal document summarization dataset for the Greek language.The dataset contains 8,395 Court decisions from the Criminal and Supreme Civil Court of Greece along with their summaries.6,370 decisions are also classified with one or more case tags.To collect the dataset, the authors performed webscraping on the AreiosPagos website 12 , which is the website of one of Greece's High Supreme Courts of Justice.The site contains information on both civil and criminal law cases, while providing additional metadata for parts of the cases (e.g., summaries, case category, case tags, etc.).The dataset contains long multi-page documents abounding with formal language and terminology that requires legal expertise.Overall, the authors focused their efforts on building a dataset that covers a wide variety of legal domains (e.g., civil, criminal, administrative law etc.).The authors also use this corpus to fine-tune GREEK-BERT for extractive summarization.

Toxic / Offensive Language Detection. The Offensive Greek
Tweet Dataset -OGTD [25] is a manually created dataset containing 4,779 Greek Twitter classified as offensive and non-offensive.Along with a detailed description of the dataset, the authors evaluate several computational models, which are trained and tested on the dataset.To create OGTD a total of 49,154 tweets were collected from popular Greek hashtags at the time of writing, using the Twitter API.These tweets were then pre-processed to remove duplicate information, emoticons, double punctuation marks and usernames.A random sample was manually annotated by three volunteers, A hate speech detection approach for Greek, combining Computer Vision and NLP models, is presented in [26].Their research focuses on xenophobic and racist Twitter posts directed towards immigrants and refugees.Their proposed approach uses pre-trained embeddings from GREEK-BERT, and a dataset of ∼23 million Greek tweets generated over a 10-year time period (2008-2018) by 5,000 users, to develop a new language model 13 .In addition, their proposed approach utilizes image classification networks.In the final stage, the derived embeddings from the language models and image classification networks are fed into a neural network, which classifies tweets as toxic or non-toxic.

GREEK CHATBOT APPLICATIONS
As mentioned in recent chatbot surveys [1,3], the most common techniques incorporated in the development of chatbots are: (i) rule-based, where a chatbot is designed around certain rules and constraints; (ii) NN-based, where a chatbot employs a neural network (NN) classification approach to infer the intention of the user and respond accordingly; (iii) knowledge-based, where a chatbot utilizes a (generic or domain-specific) knowledge base to infer certain facts by considering a line of questions asked by the user; (iv) semantic-based, where the chatbot uses an ontology that captures certain entities and their intermittent relations (e.g., the Core Public Service Vocabulary -CPSV standard); (v) deep learning techniques, where the chatbot uses transformer embeddings to augment its semantic capabilities or deep learning models that can handle complex tasks (such as Natural Language Inference and Natural Language Understanding) with human-like accuracy.Although the literature on chatbot development and associated applications is quite extensive, only a few works report on chatbots supporting the Greek Language.These works are briefly presented in the rest of this section; a comparative view of their constituent technologies appears in Table 1.
The approach described in [2] focuses on improving digital communication channels between citizens and the government, by leveraging chatbot technology in combination with NLP, ML and data science techniques.As argued, digital channels are of lower cost compared to traditional ones (e.g., personal visits to government offices, phone calls etc.); however, they are often prone to miscommunication and lack of expressiveness (e.g., citizens can only enter certain keywords to search for the information they 13 https://huggingface.co/Konstantinos/BERTaTweetGR need or can only fill in certain fields in online forms).To remedy such issues, the proposed solution concerns a smart digital communication channel, which is based on chatbots with advanced capabilities (compared to previous rule-based chatbots).Overall, the main contribution of this solution is the effective integration of a set of well-tried tools and services to meet the variety of requirements related to citizen-government communication.The overall approach was elaborated and validated in close cooperation with three Greek governmental bodies (i.e., the Ministry of Finance, a social security agency, and a large local government organization).
A chatbot aiming to help citizens find public sector services, which appears in a Greek website 14 , is described in [27].After comparing existing chatbots focusing on public sector services, several missing features were identified, such as real-time data retrieval, multilingual support, the use of the CPSV standard to represent semantic data, appropriate assistance to users to find the desired services.This work proposes a modern chatbot system that incorporates all the above features satisfying user needs.In addition, the advantages of using the CPSV standard for annotating public services are reported, which -according to the author -had not been used before in chatbot implementations.
Another pilot application is available in the ERMIS Greek egovernment website [28].The application follows an architecture comprising four layers: (i) the graphical user interface, (ii) the chatbot engine, (iii) the application programming interface, and (iv) linked data repositories.The chatbot engine integrates the method of Life Events (LEs), which is a method of personalizing public services.LEs are described by the authors as a set of government services required at specific life stages.The authors argue that the integration of LEs leads to personalized and user-friendly interfaces.This work also demonstrates two usage scenarios for the pilot application along with its evaluation.
The importance of chatbot technologies as a means of providing citizens with accurate, accessible and personalized information about public services is discussed in [29].In this context, the authors examined several chatbot platforms to develop a chatbot application called "PassBot", which provides personalized information regarding the public service of "obtaining a Greek Passport".The description of this public service was developed using the CPSV-AP standard.
Finally, the application of chatbots for providing personalized information in public services has been elaborated in [30].The focus of this work is on the integration of chatbot and knowledge graph (KG) technologies, as a means to overcome limitations of previous works regarding flexibility and reusability.The proposed approach utilizes the public service of "Getting a Passport" as a basis to develop a proof-of-concept chatbot-KG integration.The approach was evaluated with respect to its ease of operation, usefulness and usability, and yielded promising results.

DISCUSSION
This paper has evaluated a set of prominent recent works on Greek NLP technologies and resources (including embeddings, DL language models, and language data sources), as well as Greek chatbot applications (i.e., e-government and public services, e.g., issuing a passport), aiming to capture the state-of-the-art for NLP and chatbot research for the Greek language.Admittedly, language resources are essential for training robust Greek models that in turn can be integrated into chatbots and advance their communication skills towards the automation of tedious tasks.It became clear that the technologies presented in this paper can further aid the development of chatbots for the Greek language.
Our review leads to the following set of observations: • Since the introduction of the transformer architecture and particularly BERT, there is a paradigm shift towards DL in recently published Greek NLP works.• The use of pre-trained Greek embeddings increases accuracy due to the introduced semantic context, which is not available in traditional approaches.• More Greek language resources are required, especially in the case of domain-specific applications (Law, offensive language detection etc.), to fine-tune the corresponding domainspecific language models.• Most Greek NLP chatbots have been developed to assist citizens accessing services offered by the Greek public sector.• Current Greek NLP chatbots support the semantic-based technique of the CPSV standard, which improves the quality of public services offered by them; this comes in contrast with previous chatbot applications that utilized only rulebased and neural-based techniques.• The advancement of NLP and Chatbot technologies coincide.Our review also revealed the following list of issues requiring further attention: • Greek language models and language resources for various domains (healthcare, law, business etc.) and tasks (fake news detection, sentiment analysis, etc.) need to be introduced.• When a new DL model architecture is introduced, which provides significant improvement over previous architectures, NLP researchers should either pre-train this architecture for the Greek language or fine-tune it for various domains (healthcare, law, finance etc.) and tasks (fake news detection, sentiment analysis, etc.).• Current chatbot applications have yet to integrate state-ofthe-art DL Greek NLP techniques.• There is a limited amount of Greek NLP works that successfully utilize a knowledge base (e.g., knowledge graphs), which could semantically enrich the overall NLP approach.Based on the above remarks, we propose the following list of future research directions: • Introduce more language resources for the Greek language, including (i) domain specific datasets that have specific terminology (healthcare, law etc.); (ii) datasets of diverse writing styles (e.g., Greek literature, news, forums, finance), and (iii) datasets that cover various tasks (fake news detection, sentiment analysis, etc.).• Fine-tune existing Greek DL models for other domains that have specific terminology (e.g., fine-tuning models for healthcare, law or offensive language detection, using publicly accessible data).• Future Greek NLP chatbots could benefit from recent advancements in DL techniques to improve their conversational abilities (e.g., transformer embeddings to augment their semantic capabilities or deep learning models able to handle complex tasks that belong to the thematic areas of Natural Language Inference and Understanding, with human-like accuracy).Some conversation examples for these tasks include: (i) question answering: (e.g., "how is item A related to item B?", "how can I access it?","can you explain action C to me?"); (ii) summarization [31] (e.g., "give me a summary of the discussion items C, D, E and F").

Table 1 :
Overview of Greek Chatbot applications and their technologies.