Arabic Sentiment Analysis with Noisy Deep Explainable Model

Sentiment Analysis (SA) is an indispensable task for many real-world applications. Compared to limited resourced languages (i.e., Arabic, Bengali), most of the research on SA are conducted for high resourced languages (i.e., English, Chinese). Moreover, the reasons behind any prediction of the Arabic sentiment analysis methods exploiting advanced artificial intelligence (AI)-based approaches are like black-box - quite difficult to understand. This paper proposes an explainable sentiment classification framework for the Arabic language by introducing a noise layer on Bi-Directional Long Short-Term Memory (BiLSTM) and Convolutional Neural Networks (CNN)-BiLSTM models that overcome over-fitting problem. The proposed framework can explain specific predictions by training a local surrogate explainable model to understand why a particular sentiment (positive or negative) is being predicted. We carried out experiments on public benchmark Arabic SA datasets. The results concluded that adding noise layers improves the performance in sentiment analysis for the Arabic language by reducing overfitting and our method outperformed some known state-of-the-art methods. In addition, the introduced explainability with noise layer could make the model more transparent and accountable and hence help adopting AI-enabled system in practice.


Introduction
Online social media platforms have become increasingly popular, leading to the emergence of various fields dedicated to analyzing the platforms and their content in order to extract useful information for individuals [1].Sentiment analysis (SA) is one of them.It is a branch of Natural Language Processing (NLP) which is concerned with identifying the feelings expressed in texts.
However, SA begins to thrive merely due to the emergence of the Web, which makes it possible to include interactive content.This implies that people are free to upload any kind of content, including their own ideas and beliefs.SA may indeed be used to investigate this enormous volume of raw text data in order to provide a concise summary of what the public believes about a specific topic or a product, or even about any opinion [2,3].
Prior research on sentiment analysis mostly focused on high-resourced languages.Since Arabic is not a high-resourced language, still there is a lack of attention as compared to other high-resourced languages.The prior methods for Arabic Sentiment Analysis (ASA) depended on sentiment lexicons like ArSenL [4], a large-scale MSA word lexicon.Various options for analyzing Arabic-specific data were examined using recurrent and recursive neural networks [5,6,7].Natural language procesing tasks got new dimension in terms of accuracy after the introduction of word-embedding techniques.The textual representation with pre-trained word-embedding trained on multiple large corpus were used for sentiment analysis tasks.Dahou et al. [8] trained CNN with the semantic representation using word-embedding for analysing sentiment on Arabic text.Farha et al. [9] proposed a hybrid model for ASA by employing LSTMs for sequence and context interpretation and CNNs for feature extraction.Then a BERT-based model for Arabic language representation, AraBERT was proposed for many Arabic language-specific tasks including ASA [10].
However, the proposed models work like a black-box, even the developers and AI practitioners do not exactly understand what are the causes for a specific prediction (positive or negative).This lack of transparency is a drawback for an efficient ASA system for adoption in real-world applications.Though some works can be found with explanations for SA prediction using XAI tool-kits for rich languages such as English [11,12], to the best of our knowledge, there is no single work of Arabic sentiment where XAI is utilized to explain the reasons of prediction of the complex models.In addition, deep learning (DL)-based models often show over-fitting characteristics due to the lack of sufficient amount of data used to train [13] including the Arabic SA task.This reduces the models' efficiency to determine the sentiment of low-resources languages like Arabic, Bengali, Hindi, etc.
To tackle these concerns, we propose a new interpretable 1 Arabic sentiment classification framework by adding a Gaussian noise layer to the DL-based models.We develop and train two DL models including Bidirectional LSTM (BiLSTM) and CNN-BiLSTM, CNN layer followed by BiLSTM layer for sentiment classification.The experiments results indicate that adding noise layer helps to resolve the overfitting problem of these models in SA.To explain particular prediction of our sentiment classification framework, we adopted LIME (Local Interpretable Model-agnostic Explanations), a prominent XAI method that can explain the predictions of any sentiment classifier in an interpretable and transparent manner, by learning an interpretable surrogate model locally around the prediction [11].For experimental purpose, we employed publicly available datasets including a Large Arabic Book Review (LABR) [14] and an Arabic Hotel Review (HTL) datasets [15].The experimental results verify our claims and mitigate above-mentioned concerns reducing potential overfitting problem for DL-based ASA models.The contribution of this paper can be summarized as follows: 1. We propose two different DL-based methods by introducing noise layer for Arabic SA to reduce over-fitting with improved performance.
2. To the best of our knowledge, this is the initial endeavor towards enhancing the explainability of Arabic sentiment classification models.
3. Our method consistently achieves competitive performance compared to state-of-the-art approaches in Arabic SA, and it can be applied to other regional Arabic languages.
The rest of the paper is organized as follows: We survey related literature on ASA in Section 2.
Then we present our method with explainability in Section 3. In Section 4, we discuss the findings from the experiments.Finally, Section 5 concludes our methods with some future plans.

Literature Review
Recently, like other languages, Arabic SA had gained the attention of the research community [1].Farra et al. [16] worked with SA utilizing Arabic sentence structure with grammatical approach and lexicon-based approach.Then Abdul-Mageed et al. [17] proposed methods to identify subjectivity and sentiment of standard Arabic.In the following years they proposed a corpus for sentiment analysis and a system to detect social media post's sentiment.In these works, the authors utilized a large set of features for the experiments with machine learning algorithms [18].Shoukry et al. [19] classified the sentiment of Egyptian Arabic tweets using SVM and NB classifiers.Later, they measured the performance of different machine learning models on preprocessed (i.e., stemming, stop words removal and normalization) tweets [20].Duwairi et al. [21] also employed classical ML-models including NB, K-Nearest Neighbors (KNN) and SVM classifiers to perform SA on Jordanian Arabic tweets.Nayel et al. [22] employed a classical machine learning algorithm including SVM (Support Vector Machine) for sentiment and sarcasm detection purposes.
However, the use of DL methods is less common in Arabic SA compared to English.An LSTM-CNN model is utilized by Sarah [23] for Arabic text to classify two unbalanced classes from the ASTD dataset among four classes.Similarly, a CNN model is used by [24] with Stanford segmenter for the purposes of tweets tokenization and normalization.They applied the CNN model on the ASTD dataset with a word-embedding model.Heikal et al. [25] proposed a method combining CNN and LSTM models with pre-trained word-embedding model to predict the sentiment of the tweets.Some more prominent works on ASA also came out from workshop and shared tasks [26].Hengle et al. [27] combined context-free and contextualized representations for Arabic sarcasm detection and sentiment analysis.Word-embedding model was also applied in multiple works [28,29,30,31,32] including sentiment analysis, textual similarity estimation and intent mining.
As BERT (Bidirectional Encoder Representations from Transformers) based models show very promising performance in English SA, Oueslati et al. [33] presented an Arabic language-specific universal language model (ULM), hULMonA by fine-tuning multi-lingual BERT (mBERT).
To evaluate the ULM, they collected a benchmark dataset for SA.Safaya et al. [34] proposed an ArabicBERT model by utilizing a pre-trained BERT model (bert-base-arabic) with CNNs.
Another BERT-based Arabic language representation model, AraBERT is developed by [10] to improve the state-of-the-art in many Arabic natural language understanding tasks.Wadhawan et al. [35] tried to classify the sentiment using the segmentation based method and two transformer based methods.Husain et al. [36] hypothesized that tweets are more likely to contain offensive content when the tweet is positive or negative.Therefore, they fine-tuned the AraBERT using offensive language for Arabic sarcasm detection and sentiment analysis.
However, XAI is being applied for SA analysis for some languages like English and Chinese [11,37].Though many works have been done on ASA, they did not employ XAI for explaining the reasons for a specific prediction of the used ML or DL models.Adding a noise layer to the DL-based models also helps reducing over-fitting and eventually enhance the performance of the DL models [38,39].In addition, there is no such work on ASA where a noise layer is added in the DL models to reduce over-fitting and improve the model's performance with XAI.

Proposed Method
This section presents our proposed method for Arabic Sentiment classification, which utilizes DL and provides explanations for each prediction that attempt to highlight the reasons for certain prediction.Initially, the Arabic reviews serve as input for the DL methods.Subsequently, the reviews undergo a preprocessing phase, where special characters are removed.Following this, a word tokenizer is employed to extract the list of words from the reviews.Arabic words are then assigned indexes, resulting in a sequential representation that mirrors the order of words in the reviews.In situations where some reviews have a smaller number of words, a padding technique is applied to ensure uniformity in the sequences.Consequently, the padded sequences of the reviews are obtained from the preprocessing phase.BiLSTM and CNN-BiLSTM models are trained using the padded sequences of the reviews.The performance of these trained models is evaluated using test data.Finally, to explain specific predictions, a locally trained surrogate model is employed, utilizing LIME.An overview of the overall framework is depicted in Fig. 1.

Classification Models
In the BiLSTM model, an Embedding layer is followed by a bidirectional LSTM layer and a global max pooling layer.In the CNN-BiLSTM model, an additional convolutional layer is introduced before the BiLSTM layer.Subsequently, two dense layers with dropout layers are appended after the max pooling layer, with each dense layer for both model.To mitigate overfitting, a noise layer is incorporated just before the final output layer in both case.The inclusion of the noise layer is motivated by the need to address overfitting in small neural networks trained with limited training data [38].Inspired by this approach, we integrate a Gaussian noise layer into both the BiLSTM and CNN-BiLSTM models to mitigate overfitting [38,39].The detail for each layer is mentioned in section 4.2.The computation of the surrogate model can be defined as follows:

Explainable
where g represents an explanation model for an instance x, G represents an explanation family, f is the original model (i.e., our BiLSTM model), and L is the loss function.Model complexity is Ω(g).LIME explains local predictions of the model.

Dataset
For experiment and verification purposes, we apply our methods on two Arabic benchmark datasets for sentiment analysis including LABR [14] and Hotel review Dataset (HIL) [15].
LABR Dataset: Large-scale Arabic Book Review (LABR) dataset contains 63257 reviews each with a rating of 1 (one) to 5 (five) on 2131 books by 16486 users.Reviews with rating 4 and 5 are considered as positive (1) sentiment and rating with 1 and 2 are counted as negative (0) sentiment [14].We eliminate the reviews with rating 3 as they indicate neutral reviews [14].
Then 51056 reviews are considered for our experiments among them 42832 are positive (1) and 8224 are negative (0).The unequal no. of 1 and 0 makes it a imbalance dataset.Finally, 40844 reviews are used for training purposes and the rest 10212 are for testing.
Hotel Review dataset (HTL): This dataset contains reviews written in the Arabic language.The reviews are for 8100 Hotels by 13K users [15].A total of 15572 reviews are there in the HTL dataset.Among (HTL) [15] that are written in the Arabic language.The reviews are for 8100 Hotels among them 10766 reviews are positive, 2645 are Negative and the rest are Neutral.
No. of positive reviews is almost four times higher than no.negative reviews, which makes it an imbalanced dataset.To make it a balanced and smaller dataset, we only select 2645 positive reviews randomly.Finally, 3967 reviews are used for training and the rest 1323 reviews for testing.

Experimental setup
We conducted experiments using different settings to evaluate the performance of our models.
For each proposed model, we considered three variations, each with specific characteristics: • M odel N D : This model includes both the noise and dropout layers before the output layer or final layer.
• M odel N : This model only contains the noise layer before the output layer and does not have an immediate dropout layer before the noise layer.
• M odel D : This model does not include the noise layer but incorporates the dropout layer.
For the BiLSTM model, the architecture follows the overview depicted in Figure 1.In this model, we enumerated 10,000 unique vocabularies, representing each vocabulary with a 100dimensional vector using the embedding layer of the Keras library.Subsequently, four fully connected layers (dense layers) were implemented, with each layer containing 128, 64, 32, and 1 neuron, respectively.All dense layers utilized the ReLU activation function, except for the last layer, which employed the Sigmoid activation function.
For the CNN-BiLSTM model, a one-dimensional convolutional layer with a ReLU activation function and a kernel size of 3 was included.A dropout layer with a value of 0.5 was used to randomly drop 50% of the features during training.Additionally, a Gaussian noise layer with a value of 0.75 was employed.The models were compiled using the Adam optimizer with the binary cross-entropy loss function, and accuracy was used as the evaluation metric.For training and testing, 80% of the reviews were used for training, while the remaining 20% were used for testing, with a batch size of 64.Each variant of the DL-based models underwent training for a total of 10 epochs.Finally, a local interpretable surrogate model was trained to mimic the original proposed models and provide explanations for specific predictions.

Experimental results
Tables 1 and 2 present the performance of our BiLSTM and CNN-BiLSTM models in determining Arabic sentiment, using different evaluation metrics on the LABR and HTL datasets, respectively.As LABR is an imbalanced dataset, in addition to accuracy, other evaluation metrics are used to assess the performance of our proposed models.Conversely, as HTL is a balanced dataset, the performance of the introduced BiLSTM and CNN-BiLSTM models is evaluated solely based on accuracy.
We have two DL-based models, each with three different setups.When the noise layer is added along with the dropout layer before the last layer of the models, they exhibit improved performance and reduced overfitting.Table 1 supports this claim by showing that when the noise layer is absent before the output layer (M odel D ), the overfitting (O.fit) is 12%.However, after adding the noise layer before the output layer (M odel N D ), the overfitting decreases by 2% and 1% for the BiLSTM and CNN-BiLSTM models, respectively.The precision of negative reviews in the BiLSTM model's M odel N D setup in Table 1 is 0.62, while for M odel D , it is 0.56.This indicates that the noise layer helps in identifying more negative reviews compared to without the noise layer.The same trend is observed for the CNN-BiLSTM model Table 2 illustrates that the overfitting (O.fit.) of the BiLSTM model is 6.08% when the noise and dropout layers are included (M odel N D ).Similarly, when the noise layer and dropout layer are integrated into the CNN-BiLSTM model, the overfitting is 5.07%, with training and testing accuracies of 99.85% and 94.78%, respectively.On the other hand, without the noise layer, the degree of overfitting increases.For example, in the CNN-BiLSTM model with the M odel D setup, the overfitting is 7.95%, which is approximately 2.88% higher than M odel N D (5.07%).These findings demonstrate the effectiveness of adding a noise layer to the models.When the dropout layer is removed before the output layer and the noise layer is added (M odel N ), the overfitting between training accuracy and testing accuracy remains moderate, at 6.35% for BiLSTM and 6.27% for CNN-BiLSTM.These results further emphasize that adding a noise layer, either with or without the dropout layer, improves the performance of the models.

Comparison with related works
Different prominent research works had been conducted on ASA on different dataset.Here, we compare proposed method with some existing methods that are on LABR dataset.Attention-BiGRU [40], SRU-Attention [41], and AraBERT [10] achieved higher accuracy than our proposed method as these are attention-based DL models.Attention BiGRU [40] employed a hybrid bidirectional gated recurrent unit (BiGRU) and bidirectional long short-term memory (BiLSTM) additive attention model with two types of embedding and achieved SOTA results on LABR  dataset.SRU-Attention [41] used simple recurrent unit with attention mechanism and obtained an accuracy score of 95.1.AraBERT [10] was introduced for different tasks including ASA.It is built especially for Arabic NLP.In ASA, they achieved 89.6% accuracy which is 1.6% more than our CNN-BiLSTM.Multilingual BERT (mBERT) also employed for ASA [10] and its accuracy is 83% which is 5% lesser than our method.Moreover, LSTM, BiLSTM and CNN, multi-channel CNN are also utilized without noise layer and their performance are quite promising but still the performance is lower than us.These findings illustrate the significant of adding noise layer in the DL models.Though our proposed CNN-BiLSTM could not outperform the attention based methods, However it performed consistently for two dataset.

XAI in Arabic Sentiment Analysis
We trained a local explainable surrogate model to get an understandable representation of the predicted sentiments of the Arabic reviews.Let's consider an review and see how our proposed method perform on this.The review is as follow: " " (Translation: Good, the hotel is very good in all respects, and also the staff are very helpful, and the room service is appropriate for the quality of the room and the facilities were very wonderful.I admire this hotel as it is very quiet also.I may recommend this hotel to my friend to enjoy his stay in Bahrain.)Fig. 4 and 5 illustrate the humans interpretable representation of the predicted positive sentiment of the same review using M odel N D and M odel D , respectively.Fig. 4 which is obtained using model N D highlights the words (e.g.(good), ( very), (calm/quiet), (helpful)) which are really contributing the review toward positive sentiment.On the other hand, LIME and model D interpreted the same review in Fig 5 and it shows some major differences such as (good), (helpful) are not highlighted as positive sentiment words.However, these words have real impact to make the review as positive sentiment.This phenomenon illustrate the effectiveness of adding noise layer in the DL model to make it more explainable and acceptable.

Conclusion and Future Work
This paper proposed explainable Arabic sentiment classification framework introducing noise layer in deep learning models including BiLSTM, and CNN-BiLSTM.Generally, DL-based models show overfitting characteristics when a small amount of data is used for training which makes the DL model's generalization capability poor.That is why a Gaussian noise layer is added to the proposed models to reduce overfitting and enhance performance.The experimental results also indicate that the noise layer helps to reduce the overfitting issue of the DL-based models and improve the performance.Again, these models work in a black-box manner in predicting the sentiment which is not understandable to humans.Therefore, to interpret the reasons for the particular sentiment predictions, a locally explainable surrogate model known as LIME is employed for the first time in this paper for ASA.LIME shows easy-to-understand explanations that provides an understandable representation to make users' sense for a prediction.
In the future, we plan to enhance the performance of the current explainable AI algorithms for a better understanding of ASA and other Arabic NLP tasks.We also would like to employ federated learning in ASA.

Figure 1 :
Figure 1: Overview diagram of our proposed Arabic Sentiment Analysis method

Figure 2 :
Figure 2: Explanation of a particular class (Positive) using LIME.

Figure 3 :
Figure 3: Explanation of a particular class (Negative) using LIME.

Figure 4 :
Figure 4: Explanation of a review with M odel N D using LIME.

Figure 5 :
Figure 5: Explanation of a review with M odel D using LIME.
[11]ogate Model with LIMEFor the explainability purpose, we trained a local surrogate model that mimics the classification performance of the original model.In this regard, we apply LIME (Local Interpretable Modelagnostic Explanations) introduced by[11].LIME is a local delegate model means it is a trained model used to explain the causes of the predictions of the underlying black-box complex structure.However, it also includes generating different versions of the data for the machine learning model and testing what happens to the predictions, utilizing this perturbed data as a training set instead of the initial training data.In another sense, LIME creates a new dataset using permuted data and

Table 1 :
Performance of BiLSTM and CNN-BiLSTM models on imbalance LABR dataset

Table 2 :
Performance of BiLSTM and CNN-BiLSTM models with different settings on balance