Explainable Machine Learning Models for Swahili News Classification

Although Swahili is considered a well-resourced language, challenges persist in utilizing it for Natural Language Processing (NLP) tasks, primarily due to the limited data availability required for these systems. For instance, obtaining sufficient Swahili news data for classification remains a significant obstacle. This paper addresses the problem of accurate Swahili news classification by leveraging classical machine learning (ML) models and deep neural networks (DNN). Our proposed method involves data acquisition, Exploratory Data Analysis (EDA), and employing modelling techniques using classical ML models, such as Support Vector Machine (SVM), Logistic Regression, Multinomial Naive Bayes, Random Forest, Gradient Boosting, Hard Voting, and Bagging, as well as DNN models including Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional LSTM (Bi-LSTM), and CNN-Bi-LSTM + Attention. The models were evaluated using Area Under the Curve (AUC) metrics and accuracy. Our results demonstrate commendable performance for classical ML classifiers and DNN models, with accuracies above 75%. Notably, the CNN-Bi-LSTM + Attention model achieved an impressive AUC score of 97%. Additionally, explainability using LIME (Local Interpretable Model-agnostic Explanations) provided valuable insights into model decisions. This research contributes to Swahili natural language processing and lays the foundation for further explorations into transformer-based models for improved classification.


INTRODUCTION
Language is crucial in allowing access to information in today's digitally connected world, especially for different linguistic populations [31] [11].However, many news organizations in East Africa, particularly Kenya and Uganda, have relied heavily upon English, a foreign language, for online news distribution [2].As a result, native languages, such as the widely spoken Swahili, one of the top Africa's most common native languages, are underrepresented [21].Given its widespread usage and integration into various sectors, including the military and education, Swahili is essential for Uganda's digitization efforts [28].Unfortunately, Swahili and other African original languages have not received enough attention from natural language processing (NLP), which limits their use in current apps and technological services.
Text classification, a fundamental NLP task, is crucial in managing extensive text data effectively [15][25].This study delved into text classification by exploring classical Machine Learning (ML) approaches and cutting-edge Deep Learning (DL) techniques.By harnessing the power of these methods, we aimed to develop adequate language models that can automate and streamline news publication procedures on various online platforms.NLP technology has advanced for English, but native languages have not seen the same level of automation [5].It places Swahili news at risk of disappearing from online venues, as impact and many more like manual content organization practices that lead to time delays and limited usage, highlighting the need for interpretable and explainable models explicitly designed for Swahili news classification.Automating the classifying of news can improve reader accessibility and help Swahili be more fully represented in upcoming apps and internet goods.The models will save the community, journalists' time, and news organizations' expenses by eliminating manual content arrangement procedures.This research focuses on news in the Swahili language and some of its categories(Kitaifa: National, Kimataifa: International, Burudani: Entertainment, Michezo: Sports, Biashara: Business, and Afya: Health).Considering the major differences in spoken and written Swahili throughout the countries where it is used, news articles from websites in Uganda, Kenya, Tanzania and international organizations were gathered.The main goal of this study was to develop explainable multi-class classification models using local interpretable model-agnostic explanations (LIME) as an explainable artificial intelligence (XAI) technique.
The Swahili language has a unique system of approximately fourteen noun classes, making it challenging to determine the class of a noun based solely on its meaning [17] [6].Swahili has a rich history of international trade, which has drawn the attention of scholars due to its impressive structures, maritime knowledge, and global trading partnerships [27].The Swahili civilization evolved over two thousand years, focusing on material culture, oral traditions, and the Swahili language [9][4].The Swahili people's origins are uncertain, with different theories suggesting indigenous and external influences [24].Given the language's unique linguistic complexities, our study aimed to address the challenges of working with Swahili in text classification and interpretable models.Swahili, being the most widely spoken Bantu language, features a complex noun class system, making text analysis more intricate.
The rest of the paper is organized as follows: Section II discusses the literature of the related work for Shahili news classification.Section III introduces and discusses the paper's contributions.Section IV explains the methodologies used to achieve explainable models for the research.Section V discusses the results.Finally, Section VII concludes the paper and suggests future work.

RELATED WORK
The literature review examined vital studies and research papers on classifying texts in native languages, focusing on Swahili.We begin with a study by Little et al. [13].They discovered that the SVM model with tokenization and stop word removal had the best maximum accuracy of 85.13% for Swahili news article categorization.This study gave benchmark performance for Swahili news article categorization and contributed to lean Swahili text classification research.G. Martin et al. [18] introduced "SwahBERT, " a monolingual Swahili model created using data from online forums and news outlets.They compared SwahBERT with multilingual BERT and found that SwahBERT outperformed it in practically all downstream tasks, including emotion classification.Wanjawa et al. [30] addressed the lack of resources for languages like Swahili by creating a Swahili Question Answering Dataset (KenSwQuAD) for machine cognition tasks.They gathered and annotated a sizable dataset suitable for training and evaluating machine learning applications in Swahili.For Jiang et al. [10], a parameter-free text classification approach based on gzip compression and k-nearest-neighbour classifiers was proposed.Their method produced competitive results on various datasets, including four low-resource languages, demonstrating the approach's efficacy even with insufficient labelled data.KINNEWS and KIRNEWS datasets were presented by Niyongabo et al. [23] for the multi-class classification of news articles in Kinyarwanda and Kirundi.Shikali and Mokhosi [27] addressed the lack of data for Swahili language modelling.They derived unannotated Swahili datasets, a syllabic alphabet, and a Swahili word analogy dataset to enhance language processing resources for native languages.Ma et al. [14] significantly contributed to overcoming the neglect of low-resource languages by developing Taxi1500, a multilingual dataset for classifying texts in over 1500 languages.The dataset was based on parallel translations of the Bible and offered a benchmark for evaluating multilingual language models.Fesseha et al. [7] tackled the challenge of text classification for Tigrigna, a low-resource language.Ghasemi et al. [8] proposed a cross-lingual deep-learning framework for sentiment analysis in Persian.By leveraging crosslingual embeddings, they demonstrated the model's effectiveness in improving sentiment analysis for low-resource languages.Kuriyozov et al. [12] contributed to text classification in Uzbek by creating a dataset and evaluating various models, including RNN, CNN, and BERT-based models.Their experiments provided a baseline for Uzbek text classification research.Chhatwal et al. [1] explored the importance of explainable AI in legal document review.They suggested a technique for explainable predictive coding that allows attorneys to examine the model's document classification decisions for greater confidence and speed in the review process.Allow AI to be used in legal document review.
Additionally, a connection to similar studies relevant to text classification was made.Vinh and Kha [29]introduced a dataset of Vietnamese online news articles, providing a valuable resource for text classification in Vietnamese, and also did fantastic work in classification.Murty and Rughani [20] focused on using Support Vector Machines (SVM) to classify dark web data, resulting in encouraging progress towards understanding and categorizing illicit activities on the dark web.Additionally, Schonle et al. [26] introduced the Weighted Unimportant Part-of-Speech Model (WUP-Model) for removing tokens in text corpora preprocessing.The model demonstrated significant advantages over traditional stop-word removal.Finally, Nguyen et al. [22] presented a crime prediction method.They forecast crime types using location and time data, highlighting the significance of data preprocessing and machine learning algorithms in predicting criminal activities.
In conclusion, these studies demonstrate techniques to improve text classification for languages such as Swahili and other complex data challenges.

PAPER CONTRIBUTION
NLP and high-resource languages have advanced literature, but native languages lag.As seen in the literature review, researchers are addressing the need for more resources and datasets for these languages to bridge the gap.These can provide valuable benchmarks and models to improve natural language processing in diverse linguistic settings.The following are contributions made in this research: (1) Data Enrichment: In addition to work in [3] where David created a Swahili News Classification dataset, we scrapped additional data from online resources.Combining both datasets, we achieved a more robust and well-suited dataset for classifying Swahili news.(2) Model Diversity and Performance Evaluation: Another significant contribution was developing and testing various models for classifying Swahili news.These models included everyday classical machine learning, ensemble, and deep learning models.The primary aim was to compare their performances and identify the best model for the task.(3) Enhanced Explainability and Interpretability: The three major contributions, the most crucial, lacking in all literature reviewed, was ensuring our models' explainability and interpretability.By doing so, we eliminated the "black box" nature often associated with machine learning models.This transparency allowed us to understand how our models work and gain insights into their decision-making process.

METHODOLOGIES
We used a step-by-step approach to create the best explainable model for classifying Swahili news.Figure 1 illustrates the procedures and steps followed in this research.These steps are detailed in this section.
Figure 1: Step-by-step method used to achieve explainable models for Swahili news classification

Data
The primary dataset used was created by David [3] and made available to the scientific community and the dataset is for Swahili news classification from various sources.This dataset has been referenced twice in the literature review.It contains 22,207 entry sentences, each labelled with one of six news categories: kitaifa ("national"), michezo ("sports"), burudani ("entertainment"), uchumi ("economy"), kimataifa ("international"), and afya ("health").
One significant challenge of this data was the imbalance in news distribution across different categories, as shown in figure 2. This imbalance could have introduced bias during model training and reduced prediction accuracy for categories with relatively low counts.We decided to augment David's dataset with new data to address this issue.We scraped additional data for all news categories to establish more patterns in the dataset, as language modelling requires sufficient data to ensure high-quality word representation.
We used Beautiful Soup and Apache Nutch to acquire the new data, enabling us to scrape 10,211 news entries from various sources, as shown in figure 3.

Exploratory Data Analysis (EDA)
Various strategies were used for the EDA process.There was data preparation, which included data transformation and cleaning.The Data was transformed to be appropriate for analysis.Some of the data transformations were removing irrelevant or noisy data points, and missing values were handled appropriately to ensure data integrity.Next, data visualization played a crucial role in gaining insights into the dataset.For example, the top word frequencies for all classes were visualized using bar charts and Class-wise word frequency, allowing us to identify the most frequently occurring words across the entire dataset, as shown in tables 123456.The results provided a clear representation of the most common words, aiding in understanding the prevalent topics in the text data.Furthermore, word frequency analysis was performed using N-grams, specifically unigrams and bigrams.The unigrams were visualized using a word cloud, as shown in the figure 6.
The EDA process enabled the identification of significant patterns and word distributions in the dataset, enhancing our understanding of the text data's characteristics.
Figure 6: The word cloud of the entire dataset used.It shows that within the dataset, several common words stand out, such as "dar es salaam," which is the famous and commercial city in Tanzania, "anasema," meaning "he said," "nchi," representing "the country," and many others.

Modelling
Three distinct sets of techniques were employed in our study.Firstly, we explored classical machine learning models from the sklearn library, including the SVM Classifier, Logistic Regression Classifier, Multinomial Naive Bayes Classifier, Random Forest Classifier, and Gradient Boosting Classifier.Next, we delved into ensemble models, and specifically, we implemented the bagging classifier and hard voting to enhance classification accuracy further by leveraging the power of multiple base models.Lastly, we ventured into the realm of deep neural network models, which are known for their ability to capture complex patterns in data.In this phase, we developed several models, including Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional LSTM (BI-LSTM), and a combination of CNN and BI-LSTM with Attention mechanism.These deep learning models aimed to perform better by learning intricate representations from text data.Utilizing these diverse techniques allowed us to thoroughly explore the classification task and determine the most effective model for Swahili news classification.
CNN is a powerful deep learning model for processing sequential data such as text.We first encode words into dense vectors.The CNN then employs 1D convolutional layers with ReLU activation to detect local patterns and features in the text.A Global Max Pooling layer was utilized to extract the most important features.Dropout was applied to enhance the model's generalization and reduce overfitting.The output layer was a Dense layer with softmax activation to predict the probabilities for each class.
LSTM is a specialized recurrent neural network (RNN) variant that handles sequential data.Our implementation began with an embedding layer that transforms words into continuous vectors.The LSTM layer with 100 units was introduced to capture longterm dependencies in the text.Dropout was employed to prevent overfitting and improve the model's generalization.The output layer was a Dense layer with sigmoid activation.
BI-LSTM was built upon the LSTM's strengths by processing input sequences in the forward and backward directions.It used an embedding layer to represent words as dense vectors.The Bidirectional LSTM layer with 64 units was created to detect bidirectional dependencies in the text.Regularization in the form of dropout was used to prevent overfitting.Finally, for multi-class classification, the output layer was a Dense layer with softmax activation, comparable to the LSTM.
CNN-BI-LSTM + Attention: is a model that improves performance by combining the strengths of CNN, BI-LSTM, and attention mechanisms.The embedding layer was to represent words as vectors once more.The model used numerous 1D convolutional layers with variable filter sizes to capture diverse patterns in the text.The outputs of these CNN layers were concatenated and fed to a Bidirectional LSTM layer, allowing the model to capture bidirectional temporal dependencies.An attention layer was also added to the LSTM output to dynamically weigh the importance of different time steps while focusing on important information.Finally, the dropout layer and dense layer.

Explainability
To ensure the interpretability and explainability of our Swahili news classification models, we employed Lime (Local Interpretable Model-agnostic Explanations), an Explainable AI technique.Lime is a powerful tool that helps shed light on the black-box nature of complex ML models, enabling us to understand how they make predictions, particularly in the context of text data [16].We learn about our models' decision-making processes using Lime to provide local explanations for individual predictions [19].This procedure entailed altering the input text data and watching how the model's predictions changed.Lime generates a linear approximation that closely approximates the model's behaviour for a specific instance by sampling and perturbing the text data around that occurrence.It increased our knowledge of the model's predictions and provided helpful insights into which phrases or properties were most important in the prediction [19].The transparency and trust models were ensured by the use of Lime's explainable AI capabilities.

MODEL RESULTS AND DISCUSSION
After the dataset was ready for modelling, we split it into two sets: a training set comprising 75% of the data and a testing set containing the remaining 25%.The training set was used to train classical machine learning models, and the testing set was used to evaluate the models' performance.Table 7 summarizes the findings of these models, providing a detailed overview of their performance.According to the table 7, all models did exceptionally well, with more than 75% accuracy on the test set.The Random Forest model overfitted slightly, with 100% accuracy on the training set and 81% accuracy on the test set, but overall, the classical ML models performed admirably.Furthermore, as observed in the methods section, achieving 95% accuracy on the test set is unlikely due to the linguistic closeness among the classes in Swahili news.Hence, as a result, achieving an 83% accuracy on the test set is impressive and demonstrates the models' ability to handle the complexity of Swahili news classification.
The train set was split more to get the validation set for the DNN models with 10% of the 75%.Table 8 presents the performance of various DNN models developed.Notably, the CNN-Bi-LSTM + Attention model achieved the highest AUC score of 97%, as also shown in figure 8, indicating its superior discriminative ability.Despite the impressive 98% accuracy on the training data and 94% accuracy on the validation set, this model displayed a slightly lower accuracy of 84% on the test set.Nonetheless, all DNN models demonstrated commendable performance.Additionally, the remarkable performance of the CNN-Bi-LSTM + Attention model is evident in figure 7. The plot graphically illustrates the model's progress during training, and a noteworthy metric to assess its reliability was the confidence interval.The confidence interval is a statistical concept used to estimate the range within which the true value of a population parameter is likely to lie based on a sample of data.In this case, the confidence interval is computed as 0.9472 ± 0.0472, which means that we are 95% confident that the true performance of the model (e.g., accuracy) lies within the range of 0.9472 minus or plus 0.0472.Furthermore, Figure 9 illustrates the confusion matrix for the CNN-Bi-LSTM + Attention model on the test set, demonstrating its impressive ability to classify different classes accurately.The model achieved excellent performance overall, except for the "afya"  class, which represents health-related articles.Some articles from this class were misclassified as "kimataifa," which pertains to international news.This misclassification can be attributed to the fact that the "kimataifa" class as discussed in the previous sections, has in common words like "mwaka, " meaning year, shared between these classes, resulting in the model's bias.To address this issue and ensure accurate classification of the "afya" class, collecting more data specifically for this category is essential, which would help mitigate such inconveniences and further enhance the model's classification performance.In pursuit of explainability and interpretability, Lime was employed on the test data, enabling us to gain insights into our models' decision-making processes.As we previously discussed, the crux of explainability lies in providing intelligible reasons behind the model's predictions.This paper focuses on reporting the "afya" class, which displayed relatively lower performance than other classes.Nonetheless, the explainability for the other classes performed better with higher confidence.
Figure 10 illustrates the model's predictions, with the "afya" class obtaining a confidence score of 60%, while the "kimataifa" class garnered 39%, and the remaining classes accounted for 1%.Additionally, the figure reveals the true class of the selected article from the test set, which indeed belongs to the "afya" class.Moreover, it highlights the words significantly influencing the model's prediction for this class.This concept is further elucidated in Figures 11 and 12. Figure 11 depicts words in green, such as "samaki" (fish), "bidhaa" (product), and "soko" (market), which were found in the article, favouring its classification into the "afya" class.Conversely, words in red, like "uvuvi" (fishing) and "usafi" (hygiene), were more indicative of other classes.Continuing to Figure 12, it presents the actual article with highlighted words, some of which align with those in Figure 2. Notably, the highlighted words "samaki, " "soko, " and "umaarufu" strongly correspond to the "afya" class, while "uvuvu" and "asafi" represent minority words associated with other classes.These three plots generated by Lime effectively unravel the rationale behind the model's prediction for the "afya" class, providing invaluable insights into its decision-making process and the degree of influence exerted by specific words in the classification.This transparency enhances the model's credibility and facilitates a deeper understanding of its classification outcomes.
The models demonstrated commendable accuracy, achieving over 75% on the test set.Notably, the CNN-Bi-LSTM + Attention model displayed outstanding performance with a high AUC score of 97%.Lime explainability provided valuable insights into the models' decision-making, enhancing transparency.Our findings underscore the potential of both classical and deep learning methods in effectively categorizing Swahili news.
To further improve Swahili news classification, future research can focus on addressing class imbalances, especially for the "afya" class, by collecting more data.Exploring advanced natural language processing techniques, like transformer-based models like BERT or GPT, may yield even better results.Additionally, investigating domain adaptation methods to adapt models to Swahili news datasets and evaluating their real-world application in news aggregation and topic clustering can lead to practical implementation and advancements in Swahili language processing tasks.

Figure 3 :Figure 4 :Figure 5 :
Figure 3: Distribution of our scraped data from online resources

Figure 7 :Figure 8 :
Figure 7: CNN-Bi-LSTM + Attention model accuracy plot with Confidence Interval Metric Figure 8: The ROC AUC score for the CNN-Bi-LSTM + Attention model

Figure 9 :
Figure 9: Confusion matrix on the test data for the CNN-Bi-LSTM + Attention model

Figure 10 :
Figure 10: Lime explainability of article 3 in the test set.

Figure 11 :
Figure 11: Local explainability for article 3 and the target words that determined the classification of Afya class.

Figure 12 :
Figure 12: Text with highlighted words which determined the classification of the class on article

Table 7 :
Classical ML Classifiers Performance

Table 8 :
DNN Models Performance