Abstract
This article tackles the problem of sentiment analysis in the Arabic language where a new deep learning model has been put forward. The proposed model uses a hybrid bidirectional gated recurrent unit (BiGRU) and bidirectional long short-term memory (BiLSTM) additive-attention model where the Bidirectional GRU/LSTM reads the individual sentence input from left to right and vice versa, enabling the capture of the contextual information. However, the model is trained on two types of embeddings: FastText and local learnable embeddings. The BiLSTM and BiGRU architectures are put into competition to identify the best hyperparameter set for the model. The developed model has been tested on three large-scale commonly employed Arabic sentiment dataset: large-scale Arabic Book Reviews Dataset (ABRD), Hotel Arabic-Reviews Dataset (HARD), and Books Reviews in the Arabic Dataset (BRAD). The testing results demonstrate that our model outperforms both the baseline models and the state-of-the-art models reported in the original references of these datasets, achieving accuracy scores of 98.6%, 96.19%, 95.65% for LARB, HARD, and BRAD, respectively. Furthermore, to demonstrate the generalization capabilities of our model, the performances of the model have been evaluated on three other natural language processing tasks: news categorization, offensive speech detection, and Russian sentiment analysis. The results demonstrated the developed model is language- and task-independent, which offers new perspectives for the application of the developed models in several other natural language processing challenges.
1 INTRODUCTION
In the era of social media, sentiment analysis (SA) emerged as an active research area in natural language processing. SA aims to identify and monitor sentiment polarity/strength of user-generated text that encapsulates his/her opinion, emotion, and/or attitude towards entities such as services, organizations, products, and events, among others. The interest in SA grew with the exponential increase in a recorded number of opinionated data in blogs and news. This renders SA analytics central in many disciplines and organizations. For instance, nowadays, many customers consult users’ comments and sentiment polarity before making a purchasing decision on a newly introduced product or service. Organizations use SA either as a substitute or a supplement to standard surveys and opinion pools performed using questionnaire and field observation methods. Consequently, SA provides organizations and policy-makers with valuable insights to reshape their policy, business, marketing, and/or communication strategies in a way to accommodate citizens’ concerns, increase their organizational efficiency, and create a societal impact. This makes SA almost a necessity in various disciplines where capturing citizen’s views is deemed important [Liu 2016]. This ranges from traditional computer science fields to humanities and medical fields. This is because the study of opinion trends becomes central to almost all human activities whenever there is a need to make a decision through either a semi-automated or a fully manual process. The variety of electronic documents used as input for SA, in terms of their structures, access policy, semantics, and content structure, among others, creates continuous challenges to automatic sentiment analysis tasks. Among these challenges, one distinguishes language-related patterns. Indeed, the efficiency of the existing natural language processing (NLP) parsers differs from one language to another. This is due to the quality of the training dataset and methodological approaches employed in the development of the underlined parser as well as the complexity of the social norms and stylist cues embedded in the language. However, a simple statistical count of publication numbers per language indicates clearly that the Arabic language is underrepresented (see Figure 1). This also holds when comparing the maturity of the technology and the level of performance obtained. For instance, the best state-of-the-art sentiment polarity accuracy achieved in competition (Rosenthal et al. [2017]) associated with Arabic SA is only 58.1% as compared to more than 96% in Al-Dabet and Tedmori [2019]!
Fig. 1. Number of NLP publications each year and per language [Farghaly and Shaalan 2009].
This motivates the current research, which aims to investigate a new deep learning approach for Arabic sentiment analysis tasks. Loosely speaking, Arabic language groups cover nearly 500 million speakers worldwide [Boudad et al. 2018], making it the fourth common spoken language [Guellil et al. 2021] and the largest member of the Semitic Language Family [Zitouni 2014]. Besides, the Arab world has recently witnessed a series of events (e.g., Arab Spring) and fast-growing e-commerce trading activities in the area as well as the proliferation of social media users, which raise the prospects and the importance of Arabic sentiment analysis.
Technology for unfolding Arabic sentiment analysis is ultimately linked with that of sentiment analysis and natural language processing regardless of the language context. In this context, one distinguishes at least three streams of approaches: machine learning (both supervised and unsupervised), lexicon-based, and hybrid methods [Zhang et al. 2011]. Supervised machine learning approaches attempt to use appropriate feature engineering methods (e.g., tf-idf features, N-grams, selected adjective/adverbs, specialized dictionaries, \(\dots\)), a given classifier (e.g., Naives Bayes’ (NB), Support Vector Machine (SVM), Linear Regression (LR), \(\dots\)) and appropriate training/testing dataset. Lexicon-based methods use a collection of sentiment terms that are precompiled into a sentiment lexicon. This is further divided into the dictionary- and corpus-based approaches that use either semantic or statistical methods to gauge the extent of a given sentiment polarity by accounting for various grammatical constructs and syntactic patterns. Hybrid approaches involves a combination of machine-learning and lexicon-based approaches.
A special case of machine learning-based methods that has shown state-of-the-art results in many SA-related competitions is the deep learning methods. For instance, all the top winners in recent SemEval Sentiment Analysis competitions used deep learning models (e.g., CNN, RNN, LSTM) [Rebiai et al. 2019]. Recent studies on deep-learning Arabic sentiment analysis have focused on using recurrent neural networks (RNN) [Heikal Maha 2018; Al-Azani and El-Alfy 2018] that are specialized in processing sequential data. Nevertheless, a major limitation of RNNs is their incapacity to memorize longer sequences and the vanishing and/or exploding gradient estimate [Bengio et al. 1994], which restrict their capability to account for the discourse effect. RNN with memory gates such as LSTM (Long-short Term memory) [Hochreiter and Schmidhuber 1997], GRU (Gated Recurrent Units) [Chung et al. 2014], and Bidirectional RNN [Schuster and Paliwal 1997] are often put forward as alternative strategies to tackle this issue. Still, their capability of memorizing long sequences is also questioned, because the gated networks’ sentence representation depends only on past and current data states, which is not the case for most sentiment analysis problems, especially in the Arabic SA context. While BiRNN is found to suffer from sequential bias as well as a lack of interpretability, Attention Mechanism [Bahdanau et al. 2015] was introduced to allow RNNs to focus on the input sequence’s relevant segments. Its performance is found to surpass that of the recurrent network model in memorizing longer sequences. Nowadays attention mechanism plays a dominant role in most NLP tasks, especially in the many state-of-the-art models such as the Transformer [Vaswani et al. 2017], BERT [Devlin et al. 2019], GPT-2 [Radford et al. 2019], and XL-NET [Yang et al. 2019]. Beyond improving the neural nets’ performance, the attention mechanism brings an interpretable aspect to neural networks-based models. We, therefore, hypothesize that the contribution of the attention mechanism to Arabic SA tasks can be positive as well. In this article, we thereby advocate a new deep learning model that uses the attention mechanism as an additive layer to the BiGRU model. The proposed model is next tested using three Arabic labeled datasets: LABR (Large-scale Arabic Book Review) [Aly and Atiya 2013], HARD (Hotel Arabic Review Dataset) [Elnagar et al. 2018], and BRAD (Book Review in Arabic Dataset) [Elnagar and Einea 2016]. Furthermore, the generalization of the model beyond the SA tasks is investigated by testing the model on other NLP tasks; mainly, a hate speech dataset and a Russian SA dataset.
Research Objectives to the relatively low accuracy and the inherent limitation of tools employed in Arabic language analysis such as morphological analyzers, PoS taggers, and Stemmers, there is a potential for systems that can automatically extract and classify opinions present in user-generated documents. This article aims to contribute to this overall goal. More specifically, the proposed research objectives are to:
• | Comprehensively review the challenges associated with Arabic SA in both MSA and DA, | ||||
• | Review deep learning approaches for Arabic SA, | ||||
• | Propose and implement a new additive-sequence level model for Dialect Arabic sentiment polarity detection, | ||||
• | Propose an approach for tuning the parameters of the developed model, | ||||
• | Demonstrate the feasibility of the proposal through comparison with other state-of-the-art models, | ||||
• | Demonstrate the extension of the models to other applications. | ||||
Contributions The main contributions of this article are summarized below:
• | We trained both BiLSTM and BiGRU models on four datasets by performing a grid search to select the best hyperparameter set and select the winning architecture to train the proposed model, using two sets of embeddings: Learnable embeddings and FastText embeddings. | ||||
• | We proposed a BiGRU additive-attention sequence-level model to detect and analyze sentiments from Arabic reviews. Then, we tested its performance on three supervised Arabic datasets labeled as Positive and Negative. | ||||
• | We experimentally verified that our proposed model outperformed the baseline and some state-of-the-art models and demonstrated that our model is language-independent by testing it on English and Russian datasets. | ||||
• | We empirically demonstrated that our model can be extended to other NLP challenges by testing it on hate speech detection and news categorization tasks. | ||||
The article is organized as follows: Section 2 outlines the challenges associated to natural language processing of Arabic language. In Section 3, we introduce the most recent related works targeting the Arabic sentiment analysis problem. Section 4 highlights the general methodology, including the motivation grounds, the overall architecture, and its different components. Section 5 emphasizes the experimental setting and the associated results, highlighting the dataset employed, the tuning of hyper-parameters of the model, the baseline models, the performance metrics, and the obtained results. Finally, Section 6 discusses the implications of the results and inherent limitations. Section 7 summarizes the major findings and lists the perspective works.
2 NLP CHALLENGES OF THE ARABIC LANGUAGE
More formally, in contrast to other big-four EU languages (English, Spanish, French, German), Arabic SA is challenged by at least three key challenges. First, the Arabic language has a standard version that is well-understood across the Arab world, known as Modern Standard Arabic (MSA). However, most social media content is rather associated to Dialectal Arabic (DA), which often substantially differs from the MSA, while DA is strongly hit by the lack of standards and scarcity of tools employed in the processing pipeline [Abdul-Mageed et al. 2020; Meftouh et al. 2015]. This negatively impacts the contextual understanding of the content, and, thereby, the efficiency of the performance of the SA tools. Second, the figurative linguistic is very rich in the Arabic language in both MSA and DA, as manifested by its various linguistic devices such as metaphors, analogy, irony, sarcasm, euphemism, hyperbole, context shift, false assertions, oxymorons/paradox, and rhetorical questions, to communicate more complicated meanings [Farha and Magdy 2021]. This issue is widely unexplored in the current state-of-the-art of Arabic sentiment analysis tools. Third, as a result of the preceding, conveying a fine-tuned evaluation of Arabic sentiments often requires a high-level understanding of the content, which may go beyond the boundary of the given short text message, requiring, for instance, the knowledge of prior texts and sometimes even the subsequent and following textual messages as well. This especially holds for polyseme words, where word sense disambiguation requires discourse analysis to reveal the correct sense of the target word. The above makes the already available Arabic parsers highly complex for simple NLP tasks. For instance, Khalifa et al. [2020] pointed out that a complete part-of-speech (POS) tagset in MSA has over 300k tags and would require 12.2 morphological analysis per word, compared to just 50 tags and 1.25 analysis per word for English language. This high ambiguity is primarily due to Arabic orthography, which almost always omits the diacritics that are used to specify short vowels and consonantal doubling. Furthermore, the Arabic language has complex morpho-syntactic agreement rules and a lot of irregular forms. However, the lack of large-scale relevant benchmark dataset and ground truth restricts the development of efficient machine learning methods. This motivated some scholars to consider the Arabic language among low-resource languages on the semantic side because of the limitation of current parsers. For instance, the Arabic WordNet project [Jha et al. 2016] contains less than 30\(\%\) synsets of the English WordNet project. Similarly, given the rich morphology of Arabic language associated with the inherent challenges caused by Arabic dialects, one expects difficulties with the choice of an appropriate pre-processing pipeline, including text segmentation, normalization, choice of stopword list, and stemming, among others. Table 1 illustrates some examples of these challenges.
Building on work carried out by Oueslati et al. [2020] and the above discussion, Figure 2 provides a graphical illustration of the main challenges impacting the development of Arabic sentiment analysis.
Fig. 2. Arabic Sentiment Analysis Challenges (extended from Oueslati et al. [2020]).
3 RELATED WORKS
3.1 Models for Arabic Sentiment Analysis
Traditionally, research studies targeting the Arabic sentiment analysis use simple supervised models with conventional feature extraction techniques. The latter include Bag-of-Words (BOW), Term-Frequency-Inverse-Document Frequency (TFIDF), and N-grams [Gamal et al. 2019; Altowayan and Tao 2016]. The main drawback of these models is that their performances decrease when the data became large and complex. Besides, the features do not encode or represent the semantic relationships among the tokens. A comparative study on Arabic Sentiment Analysis (ASA) was conducted by Farha and Magdy [2021], where the authors replicated some recent state-of-the-art methods on Arabic sentiment analysis and examined the effectiveness of using Transformer-based models, especially BERT models, on Arabic SA tasks. The fine-tuned models reported a better classification accuracy over three Arabic SA datasets: ASTD [Nabil et al. 2015], SemEVAL17 [Rosenthal et al. 2017], and ArSAS [Elmadany et al. 2018]. Heikal Maha [2018] trained different neural network models such as LSTM, CNN, and RCNN on a collected dataset of 40k (positive/negative) labeled sentences. Their LSTM model achieved 81.3% accuracy. The authors then applied data augmentation to increase the size of the vocabulary, which enhanced the accuracy by +8.3%. Rosenthal et al. [2017] in SemEval 2017 hosted a shared Arabic SA task where El-Beltagy et al. [2017] ranked first. Their system used a set of hand-engineered and lexicon-based features and a Naive Bayes classifier for training. Elmadany et al. [2018] presented the first Arabic sentiment analysis online system that accommodates both MSA and Arabic dialects. Their model was composed of a CNN layer followed by a pooling and LSTM block and was trained on the SemEval2017 dataset [Rosenthal et al. 2017]. The model achieved 62% accuracy score. Tested on other datasets, their model achieved accuracy scores of 66% on ASDT dataset [Nabil et al. 2015] and 92% on ArSAS dataset [Elmadany et al. 2018] datasets, respectively. It is worth pointing out that word embeddings were used in most of the recent Arabic SA studies. For instance, Dahou et al. [2019] built a neural word embeddings using Word2vec [Mikolov et al. 2013b] based on CBOW and Skip-Gram architectures on a vocabulary composed of 3.4 billion tokens from a 10 billion crawled corpus. The authors then trained a CNN model on the top of these embeddings and evaluated their quality on five Arabic sentiment analysis datasets, which are then found to outperform four out of five previous works. Ombabi et al. [2020] used the FastText embeddings [Mikolov et al. 2018] skip-gram model trained using a CNN-LSTM-based model. The extracted features are then passed to SVM classifier to generate the final classification. The CNN layer is inspired by the work presented by Kim [2014] for local feature extraction followed by two LSTM layers to represent long-term dependencies. The model is then validated on two datasets, Nabil et al. [2015] and LABR [Aly and Atiya 2013], achieving 89.72%, 90.20%, and 88.52% in terms of precision, recall, and F1-measure, respectively. Owing to the emergence of several annotated datasets using dialectal Arabic, sentiment analysis of DA has seen a renewal of interest [Hossain et al. 2019; Moudjari and Akli-Astouati 2020; Nabil et al. 2015; Elnagar et al. 2018]. For instance, Abu Kwaik et al. [2019] trained a BiLSTM-CNN model to analyze different Arabic dialects sentiments. The model reported a better result compared to two baselines that use LSTM and CNN models. In another research, Soumeur et al. [2018] studied sentiment analysis of the Algerian dialect. They worked on a collected dataset containing 100,000 comments, then labeled 25k reviews as negative, neutral, and positive. In their study, a CNN-based model achieved 89.5% accuracy. Authors in Moudjari and Akli-Astouati [2020] collected two Algerian dialects corpora from Facebook and Twitter and a combined Facebook/Twitter dataset. Next, the authors trained several ternary classification models, where the CNN-based model reported a better performance across all datasets. The attention mechanism was first applied to NLP by Bahdanau et al. [2015] in machine translation tasks to overcome the Seq2seq model’s drawback in memorizing longer sequences. In text classification and, more specifically, sentiment analysis, Yang et al. [2016] proposed a hierarchical attention network (HAN) with two levels of attention mechanisms. At the word-level, the model processes input words and aligns them to a sentence of interest, while at the sentence level, it aligns these sentences with the final class of positive-negative sentiment polarity. The authors demonstrated the effectiveness of their approach and its application to document classification and showed that it outperformed some previous methods. In essence, the attention mechanism relies on creating a context vector by taking a weighted average of feature vectors that could be the hidden units of the neural nets whose weights concentrate around a specific feature vector, enabling it to focus on particular parts of the input sequence at each timestep [Goodfellow et al. 2016].
Al-Dabet and Tedmori [2019] used an SRU (Simple Recurrent Unit) layer followed by an attention layer to weight the important terms. Their model achieved an accuracy of 94.53% in Arabic sentiment analysis task. We shall also mention the KAUST-sponsored competition on Arabic Twitter sentiment analysis1 [Alamro et al. 2021]. The top-ranked team used the AraBERT model [Antoun et al. 2020] and achieved an accuracy score of 84.5%. Besides, all the top three best-performing participating teams utilized the AraBERT model to generate the tweet embedding representation. The second team utilized the ACWE, which combines static character and word-level models as suggested in Abdullah et al. [2021]. The training and testing used the ASDA dataset [Basma et al. 2021], which is a large collection of 100k Arabic tweets, annotated for sentiment analysis tasks. Recently, the sixth Workshop on Arabic Natural Language Processing (WANLP’2021) organized a task on sentiment detection [Ibrahim Abu Farha and Walid 2021] on DAICT dataset [Abbes et al. 2020], which contains 5,358 tweets. The top-performing team utilized MARBERT model (Accuracy score of 71.1%), while the second and third winning teams used AraBERT in combination with other deep learning models, achieving 70.4% and 69.5% accuracy score, respectively. Loosely speaking, the recent advancement in pretraining large Arabic language models has boosted the performance of many models in Arabic sentiment analysis. Fine-tuning the latest Arabic BERT-based model MARBERT that was pretrained on 6B Arabic tweets [Abdul-Mageed et al. 2021] has reported a better sentiment analysis performance on several Arabic SA datasets than the previous AraBERT model [Baly et al. 2020]. Also, the recent dialectal level DziriBERT model [Abdaoui et al. 2021], which is pretrained on large Algerian tweets corpus, outperformed previous neural nets models on various dialectal datasets. Table 2 summarizes some of the key research in this field, highlighting the employed architecture, the dataset, the number of classes, and the accuracy score achieved by each model.
| Reference | Architecture | Dataset | Acc(%) | # class |
|---|---|---|---|---|
| Alayba et al. [2018] | LSTM, CNN, RCNN | 40k Arabic tweets from [Alayba et al. 2018] | 65.05 | 3 |
| El-Beltagy et al. [2017] | NB | SemEVAL17 [Rosenthal et al. 2017] 9,655 (Task A) | 58.1 | 3 |
| Abu Farha and Magdy [2019] | CNN-LSTM | SemEval2017 [Rosenthal et al. 2017] (9,655) 3,315 Samples from ASTD [Nabil et al. 2015] ArSAS [Elmadany et al. 2018] (21k) | 62 66 92 | 3 3 4 |
| Altowayan and Tao [2016] | CNN (Word2vec+LR) | LABR Balanced [Aly and Atiya 2013] 16,024 ASTD Balanced [Nabil et al. 2015] 1,330 | 88 80.21 | 2 |
| Ombabi et al. [2020] | CNN-LSTM FastText(Skip-gram) | ASTD LABR ,TwitterASA [Abdulla et al. 2013] total of 15,100 reviews | 90.75 | 2 |
| Abu Kwaik et al. [2019] | BiLSTM-CNN | LABR Trinary [Aly and Atiya 2013] 19,738 LABR Binary Balanced 13,160 LABR Binary UnBalanced 51,054 ASTD [Nabil et al. 2015] 2,899 Shami-Senti [Abu Kwaik et al. 2019] 2,242 (binary) | 66.42 81.14 80.2 85.58 93.5 | 3 2 2 3 3 |
| Soumeur et al. [2018] | CNN | A collected dataset contains 25,475 sentences. | 92.03 | 3 |
| Moudjari and Akli-Astouati [2020] | CNN | Corpus 1: Mataoui et al. [2016] 5,039 tweets Corpus2: Mataoui et al. [2016] 22,761 tweets Combined corpus 1 and 2 : 27,800 tweets | 79 79 80 | 2 3 3 |
| Al-Dabet and Tedmori [2019] | Attention + Recurrent Units | Binary version of LABR dataset (51,054) | 95.1 | 2 |
| Alamro et al. [2021] | MARBERT + CNN | 100k Arabic tweets ASAD dataset [Basma et al. 2021] | 84.9 | 3 |
| Ibrahim Abu Farha and Walid [2021] | MARBERT | 5k Arabic tweets from Abbes et al. [2020] | 71.1 | 3 |
| Abdul-Mageed et al. [2021] | MARBERT & ARBERT | LABR :Aly and Atiya [2013] HARD: Elnagar et al. [2018] SemEVAL: Rosenthal et al. [2017] ASTD-B: Nabil et al. [2015] | 92.51 96.17 71 96.24 | 2 2 3 2 |
Table 2. Summary of Related Works on Arabic Sentiment Analysis
As it can be observed from the above table, deep learning models tend to give better performances in terms of classification accuracy across many Arabic sentiment analysis datasets than standard machine-learning-based classifiers such as Logistic regression and Linear SVC. This empirically supports the claim that deep-learning approaches become the state-of-the art in the SA field. Therefore, our direction in this research article is to employ and fine-tune several deep learning models for a better classification performance in Arabic sentiment analysis.
3.2 Benchmarking and Open Source Resources
The emergence of deep learning approaches has been boosted by the existence of open-source neural network libraries such as Keras, which attracted more researchers to work in this area. Nevertheless, it should also be emphasized that most deep learning approaches for Arabic sentiment analysis suffer from limited availability of large-scale Arabic sentiment annotated corpora for learning accurate models. Likewise, we also acknowledge the limited size of Arabic sentiment Treebank compared to English, as well as the increased algorithmic complexity due to a substantial increase of modalities embedded in the parser. This raises the importance to build on existing benchmark datasets to ease comparison and identify room for improvement.
In parallel to deep-learning approaches, one shall also mention the progress in metaheuristic approaches, including genetic algorithms, as in Abualigah [2018], who suggested an efficient metaheuristic algorithm for feature selection employing multi-objective hybrid Krill-Herd algorithm for document clustering.
In the area of data benchmarking, one notices the emergence of some useful relatively good size resources that can improve distributional models. For instance, Alayba et al. [2018] put forward a 1.5-billion-word corpus, using 10 newspapers from different Arab countries with different Arabic dialects to generate a distributed representation that has been used for sentiment classification purposes.
4 PROPOSED SYSTEM
4.1 Motivation
The proposed system builds on the merits of deep learning methods as the currently acknowledged state-of-the-art approach for sentiment analysis, as pointed out in the previous section. More specifically, our approach advocates the use of the attention mechanism, which becomes one of the core technologies in deep learning after its huge success in neural machine translation [Yang et al. 2016]. This follows the intuition of human visual attention where the user can focus on a certain region of an image with “high resolution” while perceiving the surrounding image in “low resolution,” and then adjusting the focal point over time. The main motivations for the choice of attention mechanism in our proposal are fourfold. First, the inherent property of the attention mechanism to focus on most of the salient parts of the input-space, instead of encoding the full sentence length as in typical RNN architectures, can substantially enhance the optimization performance of the deep learning model.
Second, given the sparse structure and the complexity of the Arabic language morphology and its semantics, as pointed out in the introduction section of this article, any incremental gain in optimization will be very welcomed and crucial for the overall system performance.
Third, the attention weights can be used as a tool to interpret the behavior of the associated neural network architecture, which is notoriously difficult to comprehend, and, thereby, add an interpretability dimension to the underlined neural architecture. Fourth, the success of attention-mechanism architectures in sentiment analysis in various languages [Yang et al. 2016], including Arabic language [Al-Dabet and Tedmori 2019], provides a good indicator of the potential merits and promises of such research direction.
Reviewing the existing architectures applied to Arabic sentiment analysis using the attention mechanism-based approach, one distinguishes the deep attention-based review level sentiment analysis put forward by Almani and Tang [2020]. Their model uses a multi-layer architecture in the following way: First, the embedding layer passes the distributed word representation of the input textual review to the GRU-based layer to produce a hidden review representation. Second, a soft attention layer is embedded on the top of the GRU layer to perform the sum of GRU hidden representations according to the generated weights of each word and output a distributed vector representation of the input review according to the identified salient words. Third, the review vector representation outputted by the attention layer is fed to a fully connected sigmoid logistic regression layer to generate the final polarity classification. In parallel, Al-Dabet and Tedmori [2019] used a three-layer attention mechanism architecture where the first layer corresponds to the word-embedding created using a Wikipedia dataset whose vector representation is fed to a simple recurrent model, which utilizes LSTM and Gated Recurrent Units (GRU). The output of the recurrent model is then fed to the attention layer, which produces, for each input sentence, a weighted sum of the attention weights and the words’ hidden vectors, which are then passed to a sigmoid layer that performs the binary classification (positive versus negative polarity) task.
In comparison to the above architectures, our model is rather close to Yang et al. [2016], where a Bidirectional GRU reads the individual sentence input from left to right and vice versa to capture the contextual information and whose results are then concatenated. Besides, for learning long and short dependencies, we used two sets of recurrent gated networks; namely, LSTM and GRU, where an empirical approach was employed to choose between the LSTM or GRU layer according to their training performance on some experimental data. The motivation for using a such training approach is rooted back in the well-known vanishing and exploding gradient phenomenon when using backpropagation with RNN structures. At the input embedding layer, we used two types of embeddings. The first one consists of the FastText embedding, which enables us to overcome the difficulty of out-of-vocabulary observed with commonly employed word2vec distributional representation [Mikolov et al. 2013b; , 2013a]. The second one uses a learnable embedding approach that is inferred from the training samples and varies at each time increment. The motivation for doing so is to capture the increasing variations of Arabic language structures when dealing with sentiment as well as the potential limitations of the pre-trained models employed in generating FastText. Similarly to Al-Dabet and Tedmori [2019], we also employed a sigmoid layer to perform the binary classification (positive/negative polarity). The gradients of our model are backpropagated through both the Bidirectional recurrent net and the attention blocks, making the different parameters updated at each iteration. Furthermore, the model takes special care of the preprocessing stage where noisy terms can play a central role in guiding the sentiment polarity. For instance, removal of negation characters can turn a negative (respectively, positive) statement into a positive (respectively, negative) statement. The next subsection details the different components of the architecture of our model.
4.2 Overall Architecture
Figure 3 summarizes the main components of our architecture. After a preprocessing stage and appropriate data representation, the architecture includes an encoding section, which encapsulates the various embedding whose outcome is fed to Bidirectional RNN (BiLSTM and BiGRU) layer. The latter is then fed to the attention layer, which is, in turn, fed to the classification module that assigns contextual representation of the input sequences to their corresponding classes using a sigmoid activation function.
Fig. 3. Generic flow graph of the proposed system.
More formally, the input sequences are tokenized, padded according to the value of MaxLen (maximum sentence length) to ensure coherence, then passed to the embedding layer. This generates embeddings for each received token, which are then inputted to the Bidirectional RNN with a forward and backward pass to effectively capture and represent the information context. The encoded vectors are then passed to the attention layer. The latter assigns a weight to each token according to its contribution to a given sentiment polarity class. The attention layer’s output is then passed to the feedforward layer, which, in turn, passes it to the sigmoid layer that assigns each sequence to its corresponding class.
4.3 Detailed Architecture
4.3.1 Preprocessing and Normalization.
Data collected from social media and online blogs are often accompanied by unwanted characters due to the presence of misspellings, URLs, and hashtags, among others, which can negatively impact the output of the natural language processing modules. This holds mainly because the used dataset is issued from both MSA and Arabic dialects, which contain many unstructured and noisy constructs. Therefore, the pre-processing stage contributes to reduce the impact of such implicit noisy representation. After conducting several trials, we found that the following pre-processing pipeline helped us to enhance both the quality of the training of the employed deep learning models and the accuracy result, while it reduced the vocabulary size:
• | Remove unstructured and unmatched Arabic diacritics terms, | ||||
• | Remove URLs, hashtags, and special characters. | ||||
• | Remove both Indian and Arabic digits, | ||||
• | Remove (after two successive occurrences) characters that have more than two successive repetitions within the same sequence as in <جمييييل> becomes <جمييل>, | ||||
• | Use classical MSA structure to normalize unstructured Arabic letters often employed in DA (e.g., normalize <إأٱآا>) by replacing it by the unique form (Alif) <ا>). | ||||
4.3.2 Encoder Module.
As a first step, we obtain the embeddings of the different tokens in each corpus. In our study, we used two sets of embeddings. The first one, referred to as learnable embeddings, consists of word embeddings learned from the training data, while the second one consists of FastText pretrained embeddings [Mikolov et al. 2018] trained using CBOW model on a large Arabic corpus from Wikipedia, where the surrounding contexts are used to predict the target word.
More specifically, for the learnable embeddings, we used Keras Embedding layer,2 where the embedding layer is initialized with random weights, which are then updated with the learning process and the gradient descent for all words in the training dataset. Our implementation assumes a maximum of 150 tokens per post. Padding strategy was then employed to accommodate the discrepancy of the size of individual posts. We also set the size of the generated embedding vector to 150 per individual token.
From an implementation perspective, in the case of FastText embeddings, we created the vocabulary index of each dataset (detailed in the next section), then each token index \(X_i\) is mapped to the associated FastText embedding vector. At the same time, we prevent the model to update the embeddings matrix while training.
The embeddings (both FastText embeddings and trainable embeddings) are therefore passed to the Bidirectional recurrent network block to obtain the annotation of the embedded tokens.
In our study, we used two sets of recurrent gated networks: Bidirectional Long-Short Term Memory (BiLSTM) and Bidirectional Gated Recurrent Units (BiGRU) to learn long and short dependencies. Specifically, a simple heuristic voting strategy was employed between BiGRU and BiLSTM for each input sequence. This voting scheme relies on the training performance and the performance obtained by the BiLTM and BiGRU alone, which are also employed as baseline classifiers, for the same input sequence. Such a voting scheme, although simple, can be rooted back to the theory of dynamic selection of classifiers; see Alceu S. Britto [2014] for an overview.
Loosely speaking, this follows the spirit of the majority voting classification scheme of multi-classifiers [Kuncheva 2004], where BiLSTM and BiGRU act as individual classifiers and the information about accuracy and epoch training time serve as a tool to determine the weight of the classifier. Therefore, if the training accuracy of BiLTSM (respectively, BiGRU) is more significant than that of BiGRU (respectively, BiLSTM), then the weight of BiLSTM is deemed more important. Otherwise, if the training accuracy of the two classifiers are close to each other, then the classifier who has the smallest training time is deemed favorable.
4.3.3 Modelling.
More formally, at a given time stamp t, given an input text sequence \(X^t = (X_1^t, X_2^t,\ldots ,X_n^t)\), the Bidirectional recurrent network block uses two parallel RNN layers that process the output of the embeddings (context vector) from left to right \(\overrightarrow{H^t} = (\overrightarrow{H}_{1} ,\overrightarrow{H}_{2},\overrightarrow{H}_{3},\ldots ,\overrightarrow{H}_{n})\) and from right to left \(\overleftarrow{H_t} = (\overleftarrow{H}_{1} ,\overleftarrow{H}_{2},\overleftarrow{H}_{3},\ldots ,\overleftarrow{H}_{n})\), making the model able to use context from previous and later timesteps such as \(H^t=(H_1,H_2,\ldots ,H_n)\) where \(H_i = [\overrightarrow{H_i},\overleftarrow{H_i}]\).
In short, \(H^t\) summarizes the neighbor posts (sentences) around the post encapsulated by the input sequence \(X^t\). Linking the current state to previous state, the BiLSTM processes the encoded word vector as follows, assuming an input sequence of fixed size \(n\)): (1) \(\begin{equation} \overrightarrow{H_i,}_{LSTM}= \overrightarrow{LSTM}(X_i^t,\overrightarrow{H_{i-1}}),i \in [1,n], \end{equation}\) (2) \(\begin{equation} \overleftarrow{H_i,}_{LSTM}= \overrightarrow{LSTM}(X_i^t, \overleftarrow{H_{i-1}}),i \in [1,n]. \end{equation}\)
The same reasoning applies to the BiGRU layer, when BiGRU encoding was employed instead of BiLSTM: (3) \(\begin{equation} \overrightarrow{H_i,}_{GRU}= GRU(X_i^t,\overrightarrow{H_{i-1}}),i \in [1,n], \end{equation}\) (4) \(\begin{equation} \overleftarrow{H_i,}_{GRU}= GRU(X_i^t, \overleftarrow{H_{i-1}}) ,i \in [n,1]. \end{equation}\)
The output is therefore constituted of the concatenation of the two layers (forward and backward layers). (5) \(\begin{equation} {H}_{LSTM} = [\overrightarrow{H}_{LSTM} ,\overleftarrow{H}_{LSTM} ] \end{equation}\) (6) \(\begin{equation} {H}_{GRU} = [\overrightarrow{H}_{GRU} ,\overleftarrow{H}_{GRU} ] \end{equation}\)
4.3.4 Attention Block.
The attention mechanism used in our study is mainly inspired by the hierarchical attention architecture proposed in Yang et al. [2016] that has a two-level hierarchical structure: word-level attention and sentence-level attention. In our approach, we only considered the sentence-level attention. The latter follows the same spirit as in Yang et al. [2016]. In essence, the attention mechanism assigns a hidden representation \(u_i\) to the annotation \(h_i\) (generated from either BiLSTM or BiGRU encoder) of the \(i\)th sentence and the associated weight \(\alpha _i\) is constructed as the normalized version (using softmax function) of the similarity between the \(i\)th hidden representation \(u_i\) and some sentence-level context vector \(U_s\) as follows: (7) \(\begin{equation} \mathit {u_i} = \tanh (W_s h_i+b_s), \end{equation}\) (8) \(\begin{equation} \alpha _i=\frac{\exp ({u_{i}}^T {U_s})}{\sum _t\exp ({u_{t}}^T {U_s})} , \end{equation}\) (9) \(\begin{equation} v = \sum \limits _{t} \alpha _t h_i, \end{equation}\) where \(\mathit {v}\) is the document vector that summarizes all the information of sentences in a document. \(W_s\) and \(b_s\) are the weight and the bias from the attention layer, respectively. The sentence-level context vector \(U_s\) is seen as a high-level representation of a fixed query “what is the informative sentence” over the whole set of sentences. We randomly initialize the context vector \(U_s\) and jointly learned during the training process.
In Equation (9), a weighted sum of the word annotations based on the learned weights is computed as a representative for all sentences of the corpus. See Figure 4 for a detailed graphical representation of BiGRU additive-attention model (similar representation applied to BiLSTM attention model).
Fig. 4. The proposed BiGRU additive-attention model.
4.3.5 Classification Module.
We pass the output vector to a non-linear activation function to assign to each vector the corresponding class. This is performed using a simple sigmoid layer with two neurons (one for positive sentiment and another one for negative sentiment). The output of this layer is a probability distribution that sums up to 1, so the neuron having the higher probability value corresponds to the label of the sentence.
5 EXPERIMENTATION
5.1 Datasets
Deep learning models usually perform well on large-scale datasets [Goodfellow et al. 2016], but collecting and annotating such data could be challenging and time-consuming. Therefore, we used some of existing Arabic datasets for sentiment analysis. The data selection criteria are based on the data size (large enough to make the reasoning sound) and popularity within the Arabic NLP research community. For this purpose, we used three large-scale Arabic sentiment analysis datasets whose reviews were written in MSA (Modern Standard Arabic) and Arabic dialects.
5.1.1 LABR Dataset.
The Large-scale Arabic Book Reviews dataset (LABR) [Aly and Atiya 2013] compiles a list of Arabic book reviews and has 63,257 reviews taken from the Goodread book’s website. In this work, we used the binary annotated version (positive or negative labels) of the dataset, which contains 51,056 reviews. The annotations were made based on the reviewers’ ratings, such that 4–5 stars reviews were labeled as positive, 1–2 stars reviews were labeled as negative, and, finally, reviews of 3 stars were dropped.
5.1.2 HARD Dataset.
The Hotel Arabic-Reviews Dataset [Elnagar et al. 2018] is a large-scale Arabic dataset that is widely used in the Arabic sentiment analysis research community. The dataset contains 105,698 Arabic reviews written in both MSA and Arabic dialects on hotels collected from Booking.com. In this study, we used the balanced binary version of the Elnagar et al. [2018] dataset, which groups 52,849 for each positive and negative review.
5.1.3 BRAD Dataset.
BRAD [Elnagar and Einea 2016] is a large-scale annotated dataset of almost 510,600 book records in Arabic language where each record corresponds to a single review and the reviewer’s rating on a scale of 1 to 5 stars. We used the balanced version of BRAD [Elnagar and Einea 2016], which contains 156k positive and negative reviews.
Table 3 summarizes the main statistics of the aforementioned three datasets.
5.2 Baselines
To validate the performance of our model, we first select a set of commonly employed architectures, which would form our baseline, as alternatives to attention-mechanism-based architecture sentiment analysis. The choice of these models is motivated by the desire to seek the contribution of the attention-mechanism module alone. Therefore, we selected deep learning architectures with an encoding layer and a fully connected layer that perform the classification task, but without an attention mechanism module. This consists of the following:
• | Bidirectional RNN. More specifically, we have chosen two models (without the attention layer): the BiGRU and the BiLSTM. Both models were trained with the same hyperparameters to ensure a fair comparison. | ||||
• | We trained these models on different embedding configurations using the Learnable embedding extracted from each dataset, and we compared the result to the performance (with regard to accuracy and training time) of the FastText pretrained embeddings [Mikolov et al. 2018] for the Arabic language. | ||||
• | Across all experiments, we used five performance metrics: accuracy, micro-Accuracy, micro-Precision, micro-F1 measures, and epoch training time: \(\begin{equation*} Accuracy =\frac{\text{T}P + TN}{\text{TP+FN+TN+FP} } \end{equation*}\) \(\begin{equation*} Precision =\frac{\text{T}P }{\text{TP} + \text{FP}} \end{equation*}\) \(\begin{equation*} Recall=\frac{TP}{\text{TP} + \text{FN}} \end{equation*}\) \(\begin{equation*} F_{1}=\frac{\text{2} \times \text{Precision} \times \text{Precision}}{\text{Precision} + \text{Recall}} \end{equation*}\) \(\begin{equation*} Micro-Precision=\frac{TP_{1}+ TP_{2} + \ldots +TP_{n} }{TP_{1}+ TP_{2} + \ldots +TP_{n} + FP_{1}+ FP_{2} + \ldots +FP_{n}} \end{equation*}\) \(\begin{equation*} Micro-F_{1}=\frac{\text{2} \times \text{PrecisionMicro} \times \text{PrecisionMicro}}{\text{PrecisionMicro} + \text{RecallMicro}}, \end{equation*}\) where TP refers to the number of positive sentences (input sequences whose sentiment class is positive) and was predicted as positive as well by the model. TN is the number of negative sequences that were classified by the model as negative as well. FP is the number of negative sequences that are wrongly classified as positive. FN is the number of positive sequences that are wrongly misclassified as negative sequences. The higher the precision value, the more accurate the prediction of the positive class. Similarly, a high recall value indicates that a high number of sentences from the same class are labeled to their exact class. F1-measure is a weighted average of Precision and Recall, summarizing the ratio of the correctly classified sentences regardless of their class. Epoch training time corresponds to the time that the model took to train and validate on a single epoch (single pass over the dataset). | ||||
5.3 Hyperparameters Tuning
Deep learning algorithms, including BiLSTM, BiGRU, and Attention mechanism, have various hyperparameters that may control and affect the training behavior, memory allocation, execution time, and even the models’ performance. A common practice is to perform a grid search on a small finite set of parameters to select optimal values [Goodfellow et al. 2016]. Typically, the hyperparameter search ensures that we do not make an opportunistic selection. Therefore, we performed a grid search that trains the associated model(s) for every joint specification of hyperparameter values on the four model configurations (BiLSTM and BiGRU (baseline), BiLSTM-Attention, BiGRU-Attention) with each of the three datasets. The hyperparameter configuration that achieves the minimal validation error was then chosen as corresponding to the best hyperparameters set (see Table 4).
| Hyperparameters | Value |
|---|---|
| Dropout_rate | 0.3 | 0.2 | 0.4 |
| Learning rate | 0.0001 | 0.01| 0.001 |
| Optimization algorithm | Rmsprop| Adam | Adagrad |
| EarlyStoping (monitoring validation loss) | 10 epochs |
| Batch size | 128 | 64| 250 |
| Number of Recurrent cells | 250| 12 8 |
| Number of epochs | 25 |
The Bold Item is the Chosen HP.
Table 4. Summary of Hyperparameter (HP) Selection
The Bold Item is the Chosen HP.
5.4 Results
5.4.1 Training and Accuracy on Sentiment Dataset.
After training our baselines on the best hyperparameter set highlighted in Table 4, we compared the outcomes of each baseline model using learned embedding and FastText embedding on each dataset. The results in terms of Accuracy, Precision, and F1-score of the three sentiment datasets are summarized in Table 5 for LABR and HARD dataset and Table 6 for the BRAD dataset. Graphical illustration of training time of the various models when using FastText and Learned embeddings is provided in Figure 5, while the accuracy is highlighted in Figure 6.
Fig. 5. Training time per epoch for each model on different datasets.
Fig. 6. Classification report of our baselines compared to our model.
| LABR dataset | HARD dataset | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | Embeddings | Acc(%) | MA-F1(%) | MA-Precision(%) | MA-Recall(%) | Training Time | Acc(%) | MA-F1(%) | MA-Precision(%) | MA-recall(%) | Training Time |
| Learned Embed. | 83.22 | 90.62 | 83.30 | 96.16 | 261 | 94.80 | 95.10 | 96.30 | 96.31 | 449 | |
| BiGRU | FastText Embed. | 83.04 | 89.1 | 86.03 | 96.38 | 25 | 94.84 | 95.59 | 96.21 | 96.38 | 175 |
| Learned Embed. | 84.2 | 89.10 | 89.90 | 95.96 | 26 | 94.11 | 94.92 | 96.31 | 96.38 | 175 | |
| BiLSTM | FastText Embed. | 85.83 | 90.10 | 90.03 | 96.10 | 265 | 94.77 | 95.32 | 96.38 | 96.38 | 456 |
| Learned Embed. | 95.71 | 96.3 | 96.81 | 96.88 | 23 | 95.79 | 95.91 | 96.88 | 96.1 | 169 | |
| Our model | FastText Embed. | 95.73 | 97.13 | 97.41 | 97.10 | 258 | 96.29 | 96.28 | 97.03 | 96.14 | 436 |
Table 5. Sentiment Analysis Results for the LABR and HARD Datasets
Initially, we looked at the training performance of the two baseline models BiGRU and BiLSTM, where it showed, possibly because of their simple architecture, that BiGRU tends to outperform BiLSTM, as illustrated in Tables 5 and 6. Besides, the use of pretrained embeddings made the training of the gated networks smoother and much faster. Therefore, we trained our additive-attention model on the BiGRU configuration, and we regularly monitored the model’s performance.
| Model | Embeddings | Acc(%) | MA-F1(%) | MA- Prec (%) | MA- Recall (%) | Train. Time(s) |
|---|---|---|---|---|---|---|
| Bigru | Learnable embed. | 93.23 | 91.12 | 94.12 | 94.32 | 372 |
| FastText | 93.23 | 91.10 | 94.19 | 94.25 | 163 | |
| BiLSTM | FastText | 92.88 | 92.17 | 94.10 | 93.87 | 374 |
| Learnable embed. | 93.05 | 92.08 | 97.13 | 93.69 | 165 | |
| Our model | FastText | 95.10 | 95.94 | 96.73 | 96.96 | 165 |
| Learnable embed. | 95.65 | 96.15 | 97.21 | 97.10 | 375 |
Table 6. Sentiment Analysis Results for the BRAD Dataset
One notices, for instance, that our approach achieves 14.9%, 1.4%, and 2.5% improvement over (best) baseline in the case of LABR, HARD, and BRAD dataset, respectively. The results in terms of Micro-Precision, Micro-Recall, and Micro-F1 scores reported in Tables 5 and 6 also indicated that the same trend of superiority of our developed model is still noticeable.
Besides, to compare our results with some state-of-the-art results reported in the analysis of the three datasets, Table 7 summarizes the comparison in terms of the accuracy of our model and previous works using the same dataset.
| Dataset | Previous works | Result (Acc) | Our Model (Acc) |
|---|---|---|---|
| SVM [Elnagar et al. 2018] | 92.7% | ||
| HARD [Elnagar et al. 2018] | LR [Elnagar et al. 2018] | 93.5% | 96.29% |
| ULMFiT [Eljundi et al. 2019] | 95.7% | ||
| AraBERT [Baly et al. 2020] | 96.1% | ||
| mBERT [Devlin et al. 2019] | 95.7% | ||
| SVM [ElSahar and El-Beltagy 2015] | 78.3% | ||
| LABR [Aly and Atiya 2013] | AraBERT [Baly et al. 2020] | 89.6% | |
| Multi-chan CNN [Dahou et al. 2019] | 87.5% | 95.62% | |
| SRU-Attention[Al-Dabet and Tedmori 2019] | 95.1% | ||
| SVM [Altowayan and Tao 2016] | 81.27% | ||
| SVM [Elnagar and Einea 2016] | 85% | ||
| BRAD [Elnagar and Einea 2016] | LR [Elnagar and Einea 2016] | 84.4% | 95.65% |
Table 7. Comparison of Sentiment Analysis Results on Different Datasets
The results highlight that the developed model outperforms the state-of-the-art sometimes by a large margin (e.g., 12.5% in case of BRAD dataset). In the same table, we reported the result when using a non-deep learning model, where results using support vector machine classifier have been reported. The choice of this classifier is motivated by the fact that it outperforms all machine learning classifier models (non-deep learning models) implemented in Orange Data Mining library.3 As it can be noticed, our model also outperforms SVM by a large margin.
5.5 Extension to Language and Task-independent Tasks
To test the generalization capability of our developed model, we experimentally showed that it can be extended beyond the traditional Arabic sentiment analysis tasks conducted in previous sections. Therefore, we attempted to apply and validate our model using other publicly available datasets associated with hate speech, News classification, Russian sentiment dataset, and English sentiment analysis. From the implementation perspective, it is worth mentioning that we changed both the preprocessing (including the parser type to accommodate Russian and English languages processing) and the output layer to have the same number of classes as in the used datasets to accommodate the inherent properties of each dataset type. The same thing was done at the embedding layer level, where we used embeddings weights for the language of the dataset (English/Russian) and trained our model architectures using the embeddings of that specific language. For the Arabic news categorization dataset, we use the same FastText embeddings in the SA experiments. Different from Barhoumi et al. [2020] work, in our experiments, we did not pretrain word embeddings on the datasets, nor transferred the embeddings across different experiments or languages. In each case, we also compared the performance of our model with that of BiGRU (baseline) and some state-of-the-art models. The results proved the generalization capability of the developed model and the possibility to go beyond the language barrier. Next, we detail each of the aforementioned tasks.
5.5.1 Hate Speech Dataset in English.
We used the Toxic Comment Classification dataset available in Kaggle competition.4 The dataset contains 24,802 tweets labeled as normal and hate-speech. The hate-speech class distinguishes Toxic, severe_toxic, obscene, threat, insult, and identity hate. We compared our model with a BiGRU model. Besides, following Fan et al. [2021], we also compared against a set of state-of-the-art transformer architectures, consisting of multilingual BERT, which is trained on 104 languages using Wikipedia text and the MLM technique [Devlin et al. 2019]; RoBERTa, an optimized version of BERT [Liu et al. 2019], and DistilBERT, which is trained by distilling Google’s BERT and reducing the number of its parameters by 40%.
For data preprocessing stage, prior to using the dataset, we applied the following preprocessing steps:
• | Lowercase all the tokens. | ||||
• | Reduce text elongation. | ||||
• | Word filtering: The dataset contains offensive words described using a set of abbreviations. We mapped these abbreviations to their complete word form. | ||||
5.5.2 News Classification Dataset:.
We used Khaleej subset from SANAD dataset [Einea et al. 2019], which contains 47k news corresponding to seven classes. The task here consists in assigning a long news content conveyed in Arabic to one out of the seven labels. No pre-processing stage is applied to this dataset, in line with Einea et al. [2019] method. Besides, in addition to baseline BiGRU model, we reported some results using other state-of-the-art methods; namely, CNN and HANGRU architectures, as pointed out in Elnagar et al. [2019].
5.5.3 Sentiment Analysis in Russian Language.
We used a dataset provided in Kaggle competition.5 The dataset contains around 10k news corresponding to positive, negative, and neutral classes. Similarly to the previous case, no preprocessing stage was applied to the dataset. For the purpose of illustration of our result, we compared our result with two state-of-the-art transformer architectures: Multilingual BERT and RoBERTa model.
5.5.4 IMBD English Sentiment Dataset.
The IMDB dataset is a binary sentiment classification dataset consisting of (sometimes long) movie reviews retrieved from large-scale IMDB movie database. The dataset contains 25,000 training documents, 25,000 test documents, and 50,000 unlabeled documents [Maas et al. 2011]. There is a 1:1 ratio between negative and positive documents in the labelled documents. No processing stage was applied to the dataset. For illustration purpose, we compared our result with several state-of-the-art methods. This consists of W-Neural-BON Ensemble [Li et al. 2016], TGNR Ensemble [Li et al. 2017], TopicRNN [Dieng et al. 2017], BERT large finetune UDA [Xie et al. 2020], and NB-weighted-BON+Cosine Similarity [Thongtan and Phienthrakul 2019].
Table 8 summarizes the results of the application of our model on the above four distinct tasks.
| Task | Language | Model | Accuracy (%) |
|---|---|---|---|
| Sentiment Analysis | Russian | BiGRU | 69% |
| Sentiment Analysis (IMDB dataset) | English | BiGRU | 87.30% |
| Hate speech detection | English | BiGRU | 96% |
| News categorization [Einea et al. 2019] | Arabic | BiGRU [Einea et al. 2019] | 75.93% |
Table 8. Comparison of Our Model on Different Tasks in Various Languages
Reading the results highlighted in the table clearly shows that our developed model outperforms our baseline BiGRU, sometimes by a large margin, while it widely competes with state-of-the-art models in most of the tasks. Possibly, the only exception is the Arabic News categorization task, where both CNN and HANGRU show a clear superiority. Nevertheless, it should be mentioned that the implementation of CNN and HANGRU are not available and no details were about the structure of these models.
5.6 Statistical Evaluation
We would like to find out whether our model superiority in the three datasets (LABR, HARD, and BRAD) is statistically significant. For this purpose, we run a statistical test (t-test) to check that the population of results corresponding to our model is different from the population of results corresponding to the second-ranked one. The null hypothesis of the test assumes that the means of the two populations are equal. Therefore, rejecting the null hypothesis indicates that the results of the two close populations are different, which means that the order between the two results is statistically significant. Besides, as we are only interested in the significance of the difference regardless of its direction, a one-tailed t-test was employed. To create the population, we use various randomization of the training samples to create slightly distinct realizations of our model and alternative model. Table 9 summarizes the result of the t-test. Using a 95% confidence interval, it is easy to see that in all cases, the null hypothesis is rejected \(p\lt 0.0001\), which indicates that the initial statement of the superiority of our model is statistically significant as well.
6 DISCUSSIONS AND IMPLICATIONS
6.1 Significance of the Results
• | Unlike traditional methods presented by Elnagar and Einea [2016] and ElSahar and El-Beltagy [2015], which investigated the Arabic sentiment analysis problem with standard TF-IDF and N-grams features, our approach proposed a new deep learning approach that uses a combination of FastText and Learned embeddings in an attention layer architecture to handle the ASA problem. The results showed a clear outperformance compared to the state-of-the-art results presented in the reference papers of the employed datasets, which testify to the feasibility and the attractiveness of the proposed method in terms of accuracy. | ||||
• | The comparison between the bidirectional GRU and BiLSTM showed that the former model is less parametrized, which resulted in a smaller training time compared to that of the BiLSTM model. This result is in agreement with other findings in the literature as well (see, e.g., Yue Han and Jing [2020] and the whole special issue of the underlined IEEE Access journal. | ||||
• | The experiment results revealed that the attention-based model gave significantly better results than other recurrent nets-based models without attention mechanisms. | ||||
• | The implemented attention mechanism offers the possibility to zoom on specific wordings or constructs according to their contribution to the sentiment polarity as per their conceptual model. This provides the foundation for an enhanced visualization toolkit that exhibits some explanation to the findings. | ||||
• | The use of pretrained embeddings decreased the training time, because there is no learning in the embedding layer. This is highlighted during the comparison between FastText embeddings and Learned Embeddings when the gain in terms of learning time can be of the order of 2 to 10 times (as in the case of LABR dataset). Nevertheless, this does not mean that the fast learning model yields better accuracy results. | ||||
• | As pointed out in Table 7, our developed model outperformed all previous works on the three datasets, including the state-of-the-art models such as AraBERT [Baly et al. 2020]. Without questioning the theoretical foundation and sound empirical foundation of AraBERT, we believe that the fine-tuning mechanism of the weights brought by the attention mechanism together with the initial setting-up strategy played a central role in enhancing the result of our approach. | ||||
• | The conducted experiments strongly suggest that incorporating the attention mechanism with BiGRU can boost the training time of neural network models. The proper mechanism that allows them to pay attention to only parts of sequences and extract meaningful information can effectively detect texts’ sentiment. | ||||
• | When incorporated with BiGRU /BiLSTM networks, the attention mechanism highlights an essential part of the sentiment analysis task on the entire sequence. | ||||
• | It would have been interesting to see the performance of our model with respect to other related attention mechanism-based models mentioned in the Related Work section. Nevertheless, the absence of open-source implementation and a clear Arabic language focus renders the reproduction of such implementations quite difficult and possibly not faithful. | ||||
• | The testing of our model on different languages and tasks showed remarkable performance, which indicates that the model is language- and task-independent. This testifies as to the generalization capabilities of our model. | ||||
6.2 Research Implications
• | The developed attention mechanism architecture that can be accommodated with minimum changes to distinct NLP tasks offers a nice opportunity for the research community to handle several tasks simultaneously. This corresponds to a substantial shift from a single task model, which substantially dominated the current practice in natural language processing and machine learning tasks. | ||||
• | The finding that learnable embedding provides better results than large-scale FastText embeddings may challenge the global trend and thought that such large-scale embeddings (e.g., FastText, Word2vec, GloVe) are the preferred choice in the design of deep learning models. | ||||
• | The use of a simple voting scheme for choosing between BiLSTM and BiGRU layers in the developed model, although widely under-investigated in deep learning literature, should be taken with caution, where more elaborated schemes can also be implemented. Indeed, this can be cast under the general framework of a combination of independent classifiers or dynamic classifier selection. In this respect, several elaborated metrics have been put forward to guide such selection or hybridization scheme, which includes individual classifier’s accuracy, ranking, probabilistic-based-measures, behavior-based measures, and diversity-related measures, among others [P.R. Cavalin 2013]. Such measures can be defined on either the whole feature space or according to some predefined partition. The input space can also be partitioned according to other rational criteria. So far, our model uses the whole input post to dynamically choose between BiLSTM and BiGRU. Nevertheless, other researchers have considered a partition that involves linguistic constructs, attempting to look at whether, e.g., BiLSTM is better when the post contains negation, and/or, adjective/advert, location-named entity, and question mark, among others. | ||||
• | Alternative to dynamic classification systems, the voting strategy can also be cast into the framework of meta-learning theory [Vilalta and Drissi. 2002], where individual classifiers are considered as separate learners and the outcome of the voting scheme as the meta-learning result. Ultimately linked to this viewpoint are the meta-features that can be set up initially to guide the performance of the individual classifiers. Especially, it is worth identifying whether one can distinguish individual features that yield better performance for BiGRU and those for BiLSTM. Other potential extensions consist in creating several versions of BiLSTM and BiGRU by utilizing several learning algorithms on the same dataset and use a stacking framework for a weighted combination yielding a more elaborated voting scheme. | ||||
• | The use of the attention mechanism through its weighting scheme of the various rule offers a nice setting to contribute to the explainable AI, a field that is expected to substantially grow in the near future as per EU Artificial Intelligence new policy, for instance.6 | ||||
7 CONCLUSION
Recurrent neural networks have brought a breakthrough into the NLP field, where most state-of-the-art and recent studies on sentiment analysis used RNN and deep learning architectures. The use of attention layers has also shown a growing interest to enhance the explainability and interpretability of the results. This study contributes to this field with a focus on Arabic sentiment analysis where a new attention mechanism-based deep learning approach has been put forward and tested using publicly available datasets.
The proposed model assumes two types of embeddings, FastText and Learnable embeddings. Besides, Bidirectional GRU/LSTM, which reads the individual sentence input from left to right and vice versa to capture the contextual information, was employed. However, for learning long and short dependencies, we used two sets of recurrent gated networks; namely, BiLSTM and BiGRU where an ad hoc hybridization strategy has been devised through a simple voting strategy that makes use of training quality and performance of individual BiLSTM and BiGRU models on individual input.
The study also showcases the performance of the various embedding strategies.
A comparison between baseline models revealed the simplicity of GRU cell concept compared to LSTM, which yields a better training performance.
Our model has been tested on three wide-scale employed Arabic sentiment datasets: LABR, HARD, and BRAD. The testing demonstrated that our model outperforms both baseline models and state-of-the-art models reported in the original references of these datasets. Furthermore, to demonstrate the generalization capabilities of our model, the performance with respect to news categorization, offensive speech detection, and Russian sentiment analysis tasks has been carried out. Remarkable results have been noticed.
Future work
In terms of future work, we believe that there is room for further fine-tuning the parameters of the model to enhance its performance and provide a more theoretical foundation for the hybridization mechanism. For instance, the recent advances in application of meta-heuristic for parameter optimization, e.g., Laith et al. [2021b], Laith and Ali [2021], Laith et al. [2021a], can provide insights to a better fine-tuning of the deep-learning parameters of our attention-mechanism-based architecture, or optimizing postprocessing stages as in Abualigah [2018]; Abualigah et al. [2016]. Besides, attention weights can further be explored to guide interpretability and explainability mechanism. This can enhance the development of appropriate user interface.
The results in the case of the News Classification dataset revealed that our model is outperformed by several other state-of-the-art and deep-learning model architectures. This probably highlights the importance of the preprocessing stage, which has not been detailed in related studies, and therefore, kept minimal in our cases. Therefore, there is room for improvement in this regard. This possibly also shows the potential vulnerability of the architecture when trying to extend from binary class to multi-class cases.
Footnotes
1 https://www.kaggle.com/c/arabic-sentiment-analysis-2021-kaust.
Footnote2 https://keras.io/layers/embeddings/.
Footnote3 https://orange3.readthedocs.io/projects/orange-data-mining-library/en/latest/index.html.
Footnote4 https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge.
Footnote5 https://www.kaggle.com/c/sentiment-analysis-in-Russian.
Footnote6 https://digital-strategy.ec.europa.eu/en/policies/strategy-artificial-intelligence.
Footnote
- . 2020. DAICT: A dialectal Arabic irony corpus extracted from Twitter. In Proceedings of the 12th Language Resources and Evaluation Conference.European Language Resources Association, 6265–6271. Retrieved from: https://aclanthology.org/2020.lrec-1.768.Google Scholar
- . 2021. DziriBERT: A pre-trained language model for the Algerian dialect. arXiv preprint arXiv:2109.12346 (2021).Google Scholar
- . 2021. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 7088–7105.
DOI: Google ScholarCross Ref
- . 2020. NADI 2020: The first nuanced Arabic dialect identification shared task. In Proceedings of the 5th Arabic Natural Language Processing Workshop. Association for Computational Linguistics, 97–110. Retrieved from: https://www.aclweb.org/anthology/2020.wanlp-1.9.Google Scholar
- . 2013. Arabic sentiment analysis: Lexicon-based and corpus-based. In Proceedings of the IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT). 1–6.
DOI: Google ScholarCross Ref
- . 2021. Enhancing contextualised language models with static character and word embeddings for emotional intensity and sentiment strength detection in Arabic tweets. Procedia Comput. Sci. 189 (2021), 258–265.Google Scholar
Cross Ref
- . 2019. Mazajak: An online Arabic sentiment analyser. In Proceedings of the 4th Arabic Natural Language Processing Workshop. Association for Computational Linguistics, 192–198.
DOI: Google ScholarCross Ref
- . 2019. LSTM-CNN deep learning model for sentiment analysis of dialectal Arabic. In Proceedings of the International Conference on Arabic Language Processing.Springer Science and Business Media LLC, 108–121.Google Scholar
- . 2018. Feature Selection and Enhanced Krill Herd Algorithm for Text Document Clustering.
DOI: Google ScholarCross Ref
- . 2016. A Krill Herd algorithm for efficient text documents clustering.
DOI: Google ScholarCross Ref
- . 2018. Emojis-based sentiment classification of Arabic microblogs using deep recurrent neural networks. In Proceedings of the International Conference on Computing Sciences and Engineering (ICCSE). IEEE, 1–6.Google Scholar
Cross Ref
- . 2019. Sentiment analysis for Arabic language using attention-based simple recurrent unit. In Proceedings of the 2nd International Conference on New Trends in Computing Sciences (ICTCS). IEEE, 1–6.Google Scholar
Cross Ref
- . 2021. Overview of the Arabic sentiment analysis 2021 competition at KAUST. CoRR abs/2109.14456 (2021).Google Scholar
- . 2018. A combined CNN and LSTM model for Arabic sentiment analysis. In Proceedings of the International Cross-domain Conference for Machine Learning and Knowledge Extraction. Springer, 179–191.Google Scholar
Digital Library
- . 2014. Dynamic selection of classifiers-A comprehensive review. Pattern Recog. 47 (2014), 3665–3680.
DOI: Google ScholarDigital Library
- . 2020. Deep attention-based review level sentiment analysis for Arabic reviews. In Proceedings of the 6th Conference on Data Science and Machine Learning Applications (CDMA). IEEE, 47–53.
DOI: Google ScholarCross Ref
- . 2016. Word embeddings for Arabic sentiment analysis. In Proceedings of the IEEE International Conference on Big Data (Big Data). IEEE, 3820–3825.Google Scholar
Cross Ref
- . 2013. LABR: A large scale Arabic book reviews dataset. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 494–498.Google Scholar
- . 2020. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. European Language Resource Association, 9–15. Retrieved from: https://aclanthology.org/2020.osact-1.2.Google Scholar
- . 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR). 1–15.Google Scholar
- . 2020. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 9–15.Google Scholar
- . 2020. Toward qualitative evaluation of embeddings for Arabic sentiment analysis. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 4955–4963. Retrieved from: https://aclanthology.org/2020.lrec-1.610.Google Scholar
- . 2021. ASAD: A Twitter-based Benchmark Arabic Sentiment Analysis Dataset.
arxiv:2011.00578 [cs.CL]Google Scholar - . 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), 157–166.Google Scholar
Digital Library
- . 2018. Sentiment analysis in Arabic: A review of the literature. Ain Shams Eng. J. 9, 4 (2018), 2479–2490.Google Scholar
Cross Ref
- . 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the NIPS Workshop on Deep Learning.Google Scholar
- . 2019. Multi-channel embedding convolutional neural network model for Arabic sentiment classification. ACM Trans. Asian Low Resour. Lang. Inf. Process. 18 (2019), 1–41.Google Scholar
Digital Library
- . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4171–4186.
DOI: .Google ScholarCross Ref
- . 2017. TopicRNN: A recurrent neural network with long-range semantic dependency. In Proceedings of the 5th International Conference on Learning Representations.Google Scholar
- . 2019. SANAD: Single-label Arabic news articles dataset for automatic text categorization. Data Brief 25 (2019), 104076.Google Scholar
Cross Ref
- . 2017. NileTMRG at SemEval-2017 task 4: Arabic sentiment analysis. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval’17).Association for Computational Linguistics, 790–795.
DOI: Google ScholarCross Ref
- . 2019. hULMonA: The universal language model in Arabic. In Proceedings of the 4th Arabic Natural Language Processing Workshop. 68–77.
DOI: Google ScholarCross Ref
- . 2018. ArSAS: An Arabic speech-act and sentiment corpus of tweets. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC). 1–6. Google Scholar
- . 2016. BRAD 1.0: Book reviews in Arabic dataset. In Proceedings of the IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA). 1–8.Google Scholar
Cross Ref
- . 2019. Automatic text tagging of Arabic news articles using ensemble deep learning models. In Proceedings of the 2nd Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation. Retrieved from: https://aclanthology.org/W19-7409.pdf.Google Scholar
- . 2018. Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications. 35–52.
DOI: Google ScholarCross Ref
- . 2015. Building large Arabic multi-domain resources for sentiment analysis. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics (CICLing). 23–34.Google Scholar
Cross Ref
- . 2021. Social media toxicity classification using deep learning: Real-world application UK Brexit. Electronics 11, 10 (2021), 1332.
DOI: Google ScholarCross Ref
- . 2009. Arabic natural language processing: Challenges and solutions. ACM Trans. Asian Lang. Inf. Process. 8, 4 (2009), 1–22.Google Scholar
Digital Library
- . 2021. A comparative study of effective approaches for Arabic sentiment analysis. Inf. Process. Manag. 58, 2 (2021), 102438.Google Scholar
Cross Ref
- . 2019. Implementation of machine learning algorithms in Arabic sentiment analysis using N-gram features. Procedia Comput. Sci. 154 (2019), 332–340.
DOI: Google ScholarDigital Library
- . 2016. Deep Learning. Vol. 1. MIT Press, Cambridge, MA.Google Scholar
Digital Library
- . 2021. Arabic natural language processing: An overview. J. King Saud Univ. Comput. Inf. Sci. 33, 2 (2021), 497–505.
DOI: Google ScholarCross Ref
- . 2018. Sentiment analysis of Arabic tweets using deep learning. Procedia Comput. Sci. 142 (2018), 114–122.
DOI: Google ScholarDigital Library
- . 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- . 2019. A comprehensive survey of deep learning for image captioning. Comput. Surv. 51, 6 (2019), 1–36.Google Scholar
Digital Library
- . 2021. Overview of the WANLP 2021 shared task on sarcasm and sentiment detection in Arabic. In Proceedings of the 6th Arabic Natural Language Processing Workshop. 296–305.Google Scholar
- . 2016. A review paper on deep web data extraction using WordNet. Int. Res. J. Eng. Technol. 3, 3 (2016), 1003–1006.Google Scholar
- . 2020. Morphological analysis and disambiguation for Gulf Arabic: The interplay between resources and methods. In Proceedings of the 12th Language Resources and Evaluation Conference. 3895–3904.Google Scholar
- . 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 1746–1751.Google Scholar
Cross Ref
- . 2004. Combining Pattern Classifiers: Methods and Algorithms. Wiley, NY.Google Scholar
Digital Library
- . 2021. Advances in sine cosine algorithm: A comprehensive survey. Artif. Intell. Rev. 54 (2021), 2567–2608.Google Scholar
Digital Library
- 2021a. The arithmetic optimization algorithm. Comput. Meth. Appl. Mech. Eng. 376, 113609 (2021).
DOI: Google ScholarCross Ref
- 2021b. Aquila optimizer: A novel meta-heuristic optimization algorithm. Comput. Industr. Eng. 157, 107250 (2021).
DOI: Google ScholarCross Ref
- . 2017. Neural bag-of-ngrams. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. 3067–3074.Google Scholar
Cross Ref
- . 2016. Weighted neural bag-of-N-grams model: New baselines for text classification. In Proceedings of the 26th International Conference on Computational Linguistics. 1591–1600.Google Scholar
- . 2016. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge University Press.Google Scholar
- . 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.
arxiv:1907.11692 [cs.CL].Google Scholar - . 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. 142–150.Google Scholar
- . 2016. A proposed lexicon-based sentiment analysis approach for the vernacular Algerian Arabic. Res. Comput. Sci. 110 (2016), 55–70.
DOI: Google ScholarCross Ref
- . 2015. Machine translation experiments on PADIC: A parallel Arabic dialect corpus. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation. 26–34.Google Scholar
- . 2013a. Efficient Estimation of Word Representations in Vector Space. Retrieved from: http://arxiv.org/abs/1301.3781.Google Scholar
- . 2018. Advances in pre-training distributed word representations. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC).
DOI: Google ScholarCross Ref
- . 2013b. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26 (2013).Google Scholar
- . 2020. An experimental study on sentiment classification of Algerian dialect texts. Procedia Comput. Sci. 176 (2020), 1151–1159.Google Scholar
Cross Ref
- . 2015. ASTD: Arabic sentiment tweets dataset. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2515–2519.Google Scholar
Cross Ref
- . 2020. Deep learning CNN-LSTM framework for Arabic sentiment analysis using textual information shared in social networks. Soc. Netw. Anal. Mining 10, 1 (2020), 1–13.Google Scholar
- . 2020. A review of sentiment analysis research in Arabic language. Fut. Gen. Comput. Syst. 112 (2020), 408–430.
DOI: Google ScholarCross Ref
- . 2013. Dynamic selection approaches for multiple classifier systems. Neural Comput. Applic. 22 (2013), 673–688.Google Scholar
Cross Ref
- . 2019. Language models are unsupervised multitask learners. Tech. Rep. OpenAI 1 (2019), 1–24.Google Scholar
- . 2019. SCIA at SemEval-2019 task 3: Sentiment analysis in textual conversations using deep learning. In Proceedings of the 13th International Workshop on Semantic Evaluation. 297–301.Google Scholar
Cross Ref
- . 2017. SemEval-task 4: Sentiment analysis in twitter. In Proceedings of the 11th International Workshop on Semantic Evaluation. Association for Computational Linguistics, 502–518.
DOI: Google ScholarCross Ref
- . 1997. Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45, 11 (1997), 2673–2681.Google Scholar
Digital Library
- . 2018. Sentiment analysis of users on social networks: Overcoming the challenge of the loose usages of the Algerian dialect. Procedia Comput. Sci. 142 (2018), 26–37.Google Scholar
Digital Library
- . 2019. Sentiment classification using document Embeddings trained with cosine similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 407–414.Google Scholar
Cross Ref
- . 2017. Attention is all you need. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS). Retrieved from: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.Google Scholar
- 2002. A perspective view and survey of meta-learning. Artif. Intell. Rev. 18, 2 (2002), 77–95.Google Scholar
Digital Library
- . 2020. TopicRNN: Unsupervised data augmentation. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS’20).Google Scholar
- . 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 5753–5763.Google Scholar
- . 2016. Hierarchical attention networks for document classification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1480–1489.Google Scholar
Cross Ref
- . 2020. Aspect-level drug reviews sentiment analysis based on double BiGRU and knowledge transfer. IEEE Access 20 (2020), 21314–21325.Google Scholar
Cross Ref
- . 2011. Combining lexicon-based and learning-based methods for Twitter sentiment analysis. HP Lab., Report No: HPL-2011-89, 2011 (
01 2011), 1–8.Google Scholar - . 2014. Natural Language Processing of Semitic Languages. Springer, Berlin. Google Scholar
Cross Ref
Index Terms
Attention Mechanism Architecture for Arabic Sentiment Analysis
Recommendations
Arabic sentiment analysis using dependency-based rules and deep neural networks
AbstractWith the growth of social platforms in recent years and the rapid increase in the means of communication through these platforms, a significant amount of textual data is available that contains an abundance of individuals’ opinions. ...
Highlights- Innovative dependency rule-based approach for Arabic sentiment analysis. These rules overcome the limitation of word frequency based approaches by employing ...
A text sentiment analysis model based on self-attention mechanism
HP3C '19: Proceedings of the 3rd International Conference on High Performance Compilation, Computing and CommunicationsThis paper focuses on the problem of text sentiment analysis. The task of sentiment analysis is to extract structured and valuable information from the text data that people express on various platforms. The field of sentiment analysis is attracting ...
Bidirectional-GRU Based on Attention Mechanism for Aspect-level Sentiment Analysis
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and ComputingAspect-level sentiment analysis is a fine-grained natural language processing task. For traditional deep learning models, they cannot accurately construct the aspect-level sentiment features. Such as, for the sentence of "the movie is very funny, but ...













Comments