Fine-Tuning BERT on Twitter and Reddit Data in Luganda and English

Deep learning techniques, driven by the Transformer architecture and models like BERT, find broad utility. While sentiment analysis in high-resource languages is well-established, it’s largely unexplored in low-resource ones. Our focus is on Luganda, a prevalent Ugandan language, spoken by over 21 million people. We utilized three datasets, from social media, to train machine learning models as baseline models and used BERT for deep learning. Our findings enhance sentiment analysis in both Luganda and English. Our approach for data extraction aids domain-specific dataset construction. This research advances NLP and aligns with global deep-learning initiatives.


INTRODUCTION
Deep learning methods have become essential components in the construction of contemporary language models.Inspired by the Transformer architecture [40], many researchers have built models using pre-trained language models like BERT [12].One example of very common applications is the multi-classification of social media text [17,23].There is a particular focus on expanding these methods to low-resource languages, which are costly to gather data for, as these models have demonstrated strong performance even with limited data samples.KinyaBERT [29] is an example of a fine-tuned model on several language tasks in the low-resource language Kinyarwanda.
For any service or product that relies on user feedback, understanding users' opinions, feelings, and emotions to gauge their sentiments is essential for identifying areas that need improvement [32].Sentiment analysis has been widely performed using social media data on high-resource languages [41] as well as some low-resource languages.For example, NaijaSenti [24] was proposed for sentiment analysis in Nigerian languages by fine-tuning mBERT [27], [1] performed a lexicon based sentiment analysis on libyan dialet tweets, RemBERT [11], AFriBERTa [30] and mDeBERTaV3 [16].Another example is the use of machine learning and deep learning techniques for sentiment analysis for central Kurdish [14].However, Uganda still faces a big challenge in employing such approaches.
Uganda, located in East Africa, boasts a multilingual environment with over 43 native languages, categorized into three main language families: Bantu, Central Sudanic, and Nilotic.Among these languages, Luganda stands out as the most widely spoken [9].Additionally, English holds an essential position as one of the official languages since 1962, with extensive usage in education, commerce, and legal proceedings.Luganda's linguistic characteristics encompass a system of 10 formal noun classes, a unique feature in the absence of inflectional suffixes to indicate plural nouns, with plurality expressed through the use of prefixes [3].This contrasts with English, which classifies nouns as countable or uncountable [13].Luganda further showcases morphological richness and agglutinative properties.Within this multilingual context, there's a growing interest in leveraging Natural Language Processing (NLP) to bridge communication gaps in Ugandan society.While NLP research areas have emerged in Uganda, such as machine translation [18], certain aspects, including sentiment analysis, text classification, and question/answer systems, have been explored in related languages like Swahili [36].These uncharted territories offer substantial potential for the study of Luganda, a Bantu language spoken by over 21 million individuals [10].In light of this, our paper designates Luganda as a representative low-resource language.It's essential to note the scarcity of substantial empirical evidence supporting the use of BERT with Luganda.

Twitter and Reddit data
One of the advantages of language models like BERT is the capability of knowledge transfer with little fine-tuning [2].Thus, our work started by fine-tuning BERT on Luganda even though it is not one of the original languages used during the pre-training stage.
We use a public dataset called TweetEval [5], which uses data collected from Twitter 1 .This dataset has been used for various NLP tasks including emotion recognition [20], emoji prediction [6], irony detection [39], hate speech detection [7], offensive language identification [42], sentiment analysis [33] and stance detection [21].For our purpose, we used the 3-class sentiment analysis dataset.This was fine-tuned on BERT for both English and Luganda.
Due to the absence of an open-source context-aware dataset, we extracted topical information from a trending Twitter hashtag (UgHealthExhibition) as well as related tweets aimed at showcasing the status of healthcare in Uganda.Since February of 2023 2 , data retrieval through the Twitter API is limited to the user's level of subscription 3 and academic Twitter which used to provide researchers access to free datasets is yet to be re-activated 4 .Therefore, we retrieved more data from Reddit, whose free data access is limited to 100 queries per minute per client ID 5 .The data posted on these platforms is in English, but many locals would wish to express their opinions in Luganda instead.
In the past, the Uganda National Health Users'/Consumers organization studied the patients' feedback mechanism at health facilities.They used focus group discussions, interviews, and quantitative surveys and analyzed the data using statistical approaches.They emphasized the need to improve the feedback mechanism.Even though the policies, approaches, and suggestion boxes exist, most people were unaware or scared of being identified and hence failed to give their opinions [38].Another research that looked at the community perceptions on quality of care only surveyed two districts using qualitative analysis [26].These approaches are intensive in terms of manpower, time, equipment and skills rendering them ineffective, especially as data grows.Deep learning approaches could be employed, but there are limits such as; 1) the low resource nature of Luganda limits the availability of labelled training data for classification tasks, 2) limited resources to build custom models from scratch, and 3) the absence of text processing libraries like TextBlob for annotation.
To combat these three limitations, we use social media data in English, label the custom-made dataset using the TextBlob library, translate the data to Luganda, and build machine-learning models for both languages.We take advantage of transfer learning by finetuning the transformer-based BERT model for the deep learning approach.This approach requires little training time, making it suitable for use with limited resources.

RELATED RESEARCH
A couple of researchers have used social media data to perform sentiment analysis on English tweets from Uganda.Mukonyezi et al. (2018) performed sentiment analysis on English tweets of Uganda's 2016 presidential elections [25].Another research looked at the determinants of sentiments on Uganda's traditional media houses [28].
To the best of our knowledge, no one has attempted to use deep learning approaches to build Luganda models that are either generic or domain-specific.In this research, we attempt to build these models.

METHODS 3.1 Data collection
The three datasets were initially in English, and then translated to Luganda using Google Translate, whose CHRF score from English to Luganda is 39.8 [4].We maintained the labels from the English dataset and used them on the Luganda dataset.We use the names "public", "Twitter-Reddit", and "merged" to refer to each dataset pair as explained in the following sections.
(1) Public dataset: The dataset was accessed from huggingface6 as used in [5].It had more neutral samples than positive and negative samples, as seen in Fig. 1.The text length distribution for English can also be seen in Fig. 2a, with the neutral class having longer text as compared to negative and positive classes.In the Luganda dataset, the text lengths had a normal distribution curve but with longer sentences compared to English, as seen in Fig. 2b.However, English had more frequent sentences of text length between 110 to 130.The second dataset we used was domain-specific to healthcare.The text was extracted from Twitter and Reddit.Each was annotated with the neutral, positive or negative labels, using TextBlob7 .We then merged them to form one dataset.A word cloud in Fig. 5 gives an overview of the frequency of words showing the main topics in our dataset.The data distribution can be seen in Fig. 3a  (a tweets using the hashtag(UgandaHealthExhibition), keywords ("Uganda" and "Health"), tweets from the Ministry of health(MoH), local news agencies(National Television (NTV) and Uganda Broadcasting cooperation(UBC)) and National Broadcasting Services (NBS) about health.The text was pre-processed by removing usernames, hashtags, special characters, numbers, punctuation, empty spaces, and URL's.The total cleaned sentences were 330.This data was then annotated using TextBlob.
Here, text was tokenized with stop words removed.Then, point of speech tagging (POS) and word stemming were applied, as demonstrated in Fig. 4.
The polarity and subjectivity levels were extracted.If the polarity was above 0, a positive sentiment was assigned; if it was below 0, a negative sentiment was assigned; otherwise, the text was labeled as neutral.These values were instrumental in facilitating the manual verification process.

(b) Extracting Bi-grams
To form a search criterion for extracting text from Reddit, we extracted bi-grams from the Twitter dataset.This gave us the most frequent word pairs in the dataset.(c) Reddit Dataset: The bi-grams helped us to build a search criterion for  (3) Merged dataset: With three datasets, we merged them to build a single dataset.To avoid biasing the data with the public dataset due to its many samples, we only extracted 3,885 samples.This is the same number of the Twitter and Reddit dataset as seen in Fig: 3a 4 EXPERIMENTS After splitting the data into training, validation, and testing, we transformed it into feature vectors using TF-IDF.We trained four multi-classification machine learning models, i.e., Naive Bayes classifier(NB), Random Forest (RF), Support Vector Machines (SVM), and Gradient Boosting (GB) using the Google Colab platform.
For deep learning, we fine-tuned the bert-base-cased model; by default, the base model has 12 transformer layers, a hidden size of 768, and 12 attention heads.The hyperparameters are a batch size of 32, a learning rate of 2e-5, and three training epochs.
The evaluation metrics used on both validation and testing datasets were precision, recall, F1-score, and accuracy.

RESULTS
The maximum length of sentences from the three datasets can be seen in Table :1.This directly impacted the training, as Luganda experiments took more time.The average time taken to train the English dataset on the Twitter-Reddit dataset for three epochs was 74% less compared to Luganda.The delay can be attributed to the sentence length (Fig: 2) and the space occupied by the padding tokens.For each of the training's, a hyperparameter search was carried out, the best values for each of the three sets were selected, as shown in Table :2.
The Luganda (Lg) models on both Twitter-Reddit and the merged datasets used a one-layer feed-forward classifier with no dropout, as indicated by a zero (0) value in the table.A smaller batch size of 16 performed better on all the small datasets.For the rest of the Luganda and English (En) experiments, we defined a feed-forward layer that maps the input features to hidden layers with ReLU, dropout for regularization, and finally, another linear transformation (softmax) layer to get the probability of each class.
The performance results in this paper were based on the testing dataset.Public Dataset The data was split into training, validation, and testing with 90%, 5%, and 5%, respectively.For English (Table :3), the SVM model achieved the highest accuracy among the baseline models on the testing set (66%).Both NB and GB had the lowest value of 60%.The BERT model outperformed the baseline models with an accuracy of 74%.On Luganda (Table: 4), SVM still performed better (60%) than all the baseline models.GB achieved the lowest accuracy (57%).Similar to English, BERT performed higher than all the baseline models; it had an accuracy of 65%.This first experiment proved that we can easily fine-tune BERT on Luganda and achieve a good performance close to that of its English counterpart.
The confusion matrices for both languages in Fig: 6 show that our model predicts more neutral samples as true values, followed by positive and negative samples.One of the causes of this behavior is the imbalanced dataset, as seen in the distribution (Fig. 1).
Twitter-Reddit The dataset was split into 90%, 5%, and 5% for training, validation and testing, respectively.On baseline models, the GB model had the highest accuracy (67%), whereas SVM scored the lowest (46%) on the English dataset (Table :5).BERT performed better than the baseline models scoring an 84% accuracy.On Luganda (6), GB still performed better the the other baseline models, and SVM had the lowest accuracy (47%), slightly higher (+1) than English.BERT scored 62%, which is higher than the baseline models.
As shown in the heatmaps (Fig: 7a and Fig: 7b), our model performed well in predicting the true labels for the positive and neutral classes.
Merged dataset The English dataset was split into 90%, 5% and 5% for training, validation and testing, respectively.Luganda was split into 80%, 10% and 10%, this was done to increase on the testing and validation samples as the baseline models were failing on 5% validation and 5% testing sets.Among the baseline models, the RF model scored the highest accuracy of 63% as NB scored the lowest (55%) on the English dataset as seen in Table : 7. BERT outperformed the baseline models with 74% accuracy.On the Luganda dataset (Table :8), NB, SVM, and GB all scored the same accuracy of 55% as RF scored the lowest (52%).BERT performed higher than all the baseline models scoring 59% accuracy.
The heat areas in the confusion matrix in Fig: 8a and Fig: 8b show that the English and Luganda models highly predicted the positive and neutral class true values.We cannot directly compare the performance as the data was split differently since the Luganda dataset had more samples in the testing dataset than English.

DISCUSSION
In this paper, we have demonstrated that it is possible to finetune BERT on Luganda on a dataset as small as 3,885 samples.It further shows the efficiency of deep learning models over machine learning ones.Even though we used the same dataset in all three experiments, we achieved different accuracies for both languages.The Luganda performance was slightly lower than English due to two main factors; (1) The translation quality: The translation from English to Luganda was affected by the quality of Google Translate of CHRF score of 39.8 [4].In all our Luganda datasets, some English words were not translated into Luganda.We assumed it was so because all of our datasets were extracted from social media, known to have code-mixed text and jargons.So, some corresponding Luganda words might not have existed in the translator.This is more evident with the low performance of the domain-specific Twitter-Reddit dataset on Luganda, compared to English, which had the highest accuracy overall in all our experiments.Domain-specific knowledge in health could be used to improve the translation quality.(2) Luganda is not among the languages used in pre-training BERT or the multilingual BERT [12].However, the model was able to learn the language.We anticipate that this was due to the fact that BERT was pretrained to understand the syntax and semantics of similar or related languages.Additionally, the subword tokenization [35] used by BERT during pretraining can effectively work on Luganda.This is also evidenced in our previous work, which used BPE [8] as the subword tokenization approach to build NMT models for English and Luganda [18].Lastly, words from related languages, if translated, tend to be represented in a similar vector space [19]; we, therefore, believe that since we converted the exact text from English to Luganda, the vector representation of the two languages was closely We have also been able to build our domain-specific health dataset from both Twitter and Reddit.The extraction of N-grams from the Twitter dataset and using them as keywords to extract data from Reddit helped us to build our dataset.There have been works that have used N-grams to perform actions such as; interesting concept discovery to find relatedness between concepts [34], hate speech detection in code mixed social media [22], and opinion mining where N-grams are used as features in various techniques like TF-IDF [15].Even though these cases do not use N-grams to extract data, they can extract the relevant features to perform multiple functions.
Future works could explore the use of other libraries such as NLTK-VADER as used by [31] and API's like that of monkey learn as used by [32] for labelling the datasets.Secondary, One could explore the linguistic features of Luganda as seen in other languages [1] to build custom deep learning models.Pre-training BERT on Luganda could also be explored.

Figure 1 :
Figure 1: Distribution of labels in the public dataset

Figure 4 :
Figure 4: Text extraction and annotation from Twitter

Figure 5 :
Figure 5: Wordcloud for the Twitter and Reddit merged dataset

Figure 6 :
Figure 6: Confusion Matrix of the BERT model for the Public dataset

Table 1 :
Maximum text Length per dataset

Table 5 :
Performance on English (Twitter and Reddit data)