Abstract
Today, comments can be made on many topics on web platforms with the development of the internet. Analyzing the data of these comments is essential for companies and data scientists. There are many methods for analyzing data. Recently, language models have also been used in many studies for sentiment analysis or text classification. In this study, Turkish sentiment analysis is performed using language models on hotel and movie review datasets. The language models are chosen because they are rarely used in Turkish literature. The pre-trained BERT, ALBERT, ELECTRA, and DistilBERT models for the Turkish language are trained and tested with these datasets. In addition, a text filtering method, which removes the words that can provide the opposition sentiment in the positive or negative labeled text, is proposed for sentiment analysis. These datasets obtained by this method are also retrained with language models and the accuracy values of their models are measured. The results of this study are compared with previous studies using the same datasets. As a result of the analysis, the accuracy values obtain state-of-the-art results with language models compared to previous studies. The best performance has been achieved by training the ELECTRA language model using the proposed text filtering method.
1 INTRODUCTION
In recent years, social media platforms and websites have become the place to freely express user and customer opinions about products, services, and platforms. Companies and organizations want to receive feedback for relevant thoughts and comments. Due to a large amount of data, it is difficult to analyze manually. In the age of big data, sentiment analysis (SA) has become one of the most popular areas in natural language processing (NLP), as it enables the investigation of the emotional tendencies of this data through artificial intelligence technology. SA is a type of text classification that includes NLP, machine learning, data mining, information retrieval, and other research areas [35]. SA aims to measure the polarity of people’s thoughts or interpretations and is considered a text classification problem [5].
SA is generally applied based on language (Turkish, English, etc.) and domain (movies, hotels, etc.). When the terms related to the domain are analyzed, the “small” term has negative polarity for the room size in the hotel domain, but this term has positive for the battery size for the camera domain [9]. SA is a method of understanding whether a text in the form of unstructured data can determine its polarity. The polarity or sentiment of the text can be positive, neutral, or negative. Today, it is used as decision-making tools in many fields such as marketing, politics, social events, and even finance [12].
SA is closely related to NLP, so it is heavily dependent on language. Most SA study is done in English; however, research is needed in other languages. There are very few studies for the Turkish language as well [8]. Since Turkish is an agglutinative language, it makes NLP operations complicated. For example, parsing the word “yapamayacaktır” into “yap-a-ma-y-acak-tır” is quite complicated. However, in addition to the complexity, the suffixes at the end of words also help with the use of words [20].
Many methods and models such as the language model (LM) and machine learning are used in SA. LMs using deep learning methods can be given as an example for this. LMs analyze the stems of texts to provide a basis for word predictions. The LM is divided in two as one-directional and bidirectional. When given the token array as input, the one-directional LM factorizes the array and assigns a probability to this array. In bidirectional LM, on the other hand, a probability is assigned to the input array by using the left and right context of the word along with the input array and position [21]. LMs can be used pre-trained. Pre-trained LMs are neural networks surrounding text signals from a large text corpus. These models can be fine-tuned for other target tasks. In particular, LMs like BERT are used in different natural language tasks, including text classification and question answering [13]. In this study, bidirectional LMs such as BERT, DistilBERT, ALBERT, and ELECTRA were used to perform Turkish SA. While many studies are using LMs for English SA, there are very few studies for Turkish SA. LMs were used to contribute to the literature in terms of the Turkish language and to analyze the use of LMs in Turkish SA. In addition, a text filtering method (TFM) that can be performed on datasets for SA has been proposed. With the TFM method, the top “k” (5, 10, 15) most frequent words, which can express opposition or neutrality, in positive-negative labeled texts for SA are detected. Then, The contributions of this research can be summarized as follows:
— | Considering the Turkish literature, this is the first study comparing four LMs for Turkish SA. | ||||
— | TFM has been proposed so that it can eliminate some opposing words found in positive and negative labeled texts for SA. | ||||
— | The accuracy values of LMs with or without TFM are compared with previous studies and the positive effect of LMs has been proven by experiments. | ||||
The remainder of the article is structured as follows. Literature research on LMs for SA and Turkish SA is explained under the “Related Work” section. Section 3 describes the datasets, libraries used for preprocessing, LMs, and the proposed TFM. Experiments and results for datasets and LMs are analyzed in Section 4. Finally, the conclusions of this research and future works are discussed in the last section.
2 RELATED WORK
In the literature, the uses of LMs for text classification and Turkish SA studies are given under separate subheadings. Few studies are using LMs for Turkish SA. Therefore, studies on SA for different languages have also been described in the literature.
2.1 Turkish Sentiment Analysis
Ozyurt and Akcayol [20] proposed a topic model–based method called Sentence Segment LDA for perspective-based SA. In the experimental results, they revealed that the proposed method was quite successful in extracting product aspects. Yildirim et al. [36] applied machine translation on Turkish texts and translated them into English and performed SA on these texts. They used machine learning methods on hotel and movie review datasets and analyzed that their translation operation increased accuracy. Shehu et al. [26] used deep learning algorithms for SA of Turkish tweets. They proposed three data augmentation techniques (Shift, Shuffle, and Hybrid) to improve the data diversity. As a result of the analysis, deep learning outperformed machine learning methods. Çoban et al. [7] applied SA of Facebook data for the Turkish language. They analyzed the success of deep and machine learning methods and stated that deep learning methods were more successful. Ciftci and Apaydin [5] applied Turkish SA using long short-term memory (LSTM) methods on a dataset obtained from shopping and movie websites. They showed that recurrent neural network (RNN)–based approaches improve classification accuracy. Yurtalan et al. [37] proposed a linguistically appropriate dictionary-based pole determination and calculation approach for sentence polarity in SA. They tested the proposed system for Twitter using different datasets and showed that it was more successful than the word-based SA systems previously developed for Turkish. Erşahin et al. [11] presented a hybrid approach combining dictionary-based and machine learning–based approaches. They showed that their proposed methodology increases the success for SA on Turkish hotel, movie, and tweet datasets. Ucan et al. [31] proposed an automatic translation method to create a Turkish sentiment dictionary. The proposed method is independent of language and domain. Accordingly, they suggested three sentiment dictionaries for the Turkish language from SentiWordNet. In the results obtained from the three Turkish dictionaries, the translation approach performed well on positive terms and gave more reliable results than negative ones. Catal and Nangir [4] investigated the possible benefits of the concept of multiple classifier systems to Turkish SA and proposed a new classification technique. Experimental results show that their multiple classifier system increases success. Uysal et al. [32], proposed a tool called SentiMedia to automatically classify the polarity of Turkish product reviews, which takes into account language features of texts to measure and summarize customer satisfaction. They measured the success of the proposed tool with machine learning methods and achieved a high accuracy value.
2.2 Language Models for Sentiment Analysis
Using the comments on social media platforms, Othan et al. [19] predicted the direction of the stocks of the Turkish stock market (BIST100). The CNN, RNN, LSTM methods and the BERT language model were used for classification. As a result of the analysis, they showed that the use of the Turkish BERT model increased the success. González-Carvajal and Garrido-Merchán [14] presented a review of BERT and classical NLP approaches. They tested the behavior of BERT against the traditional machine learning methods for IMDB reviews, hotel reviews, and news analysis. As a result of their experiments, they have proven the superiority of BERT. Siğirci et al. [27] performed SA analysis with Turkish comments collected on Google play. They measured the success of the BERT Turkish model on data with two and five classes. They showed that the BERT model achieved high success with low amount of data. Sousa et al. [29] used the BERT model to perform SA of news articles. They fine-tuned a BERT model on the dataset of stock market articles and achieved 72.5% of the F-score. Acikalin et al. [2] used the multilingual BERT model for Turkish SA. They analyzed movie and hotel review datasets through English translation and achieved high accuracy. Singh et al. [28] performed SA on tweets using the BERT model to understand people’s mental states about COVID. They analyzed the success of the model for two datasets and obtained an accuracy value of 94%. Li et al. [17] performed SA of investors on the stock market reviews. First, they extracted the sentiment value from the information published by the investors with the BERT model. These sentiment values are then weighted to calculate the sentiment indicator. They showed that the BERT model used gave better results than both LSTM and SVM methods. Guven [15] applied sentiment analysis on Turkish Tweets. Machine learning methods and the BERT Turkish model were used for classification. As a result of the evaluations, it has been shown that the BERT model is more successful than machine learning methods. Farha and Magdy [1] evaluated the performance of Arabic pre-trained ELECTRA, ALBERT, and BERT LMs on Arabic SA. They have demonstrated that the ELECTRA is one of the best-performing models. Büyüköz et al. [3] tested the ELMo and DistilBERT models on socio-political and local English news. They showed that DistilBERT transfers general semantic information better than ELMo. Pipalia et al. [22] analyzed the success of pre-trained LMs such as BERT, DistilBERT, and XLNet in SA. The XLNet model achieved the most successful result on the IMDB review dataset. Pota et al. [23] proposed a different approach for Twitter SA. First, they converted tweet jargon into plain text with procedures applicable to different languages. Then, they classified the obtained tweets using the BERT model. Tokgoz et al. [30] used pre-trained Turkish BERT and DistilBERT language models for Turkish news classification. They performed analysis using different tokenization methods. As a result of the analysis, they showed that the DistilBERT model was more successful than BERT. Van Thin et al. [33], used supervised learning methods to compare performance between task approaches based on deep learning frameworks and BERT architecture trained on the Vietnamese language. They showed that the multitasking approach based on BERT architecture is more successful than neural network architectures and single-task approaches.
3 METHODOLOGY
In this study, certain processes are applied sequentially. The stages of the study are shown in Figure 1. First, preprocesses are performed on the hotel and movie datasets. Then, these datasets are trained with pre-trained LMs. The trained models are evaluated with test data and the accuracy values of the LMs are obtained. In addition, the accuracy value of these models is measured by retraining the LMs by applying the TFM on hotel and movie datasets.
Fig. 1. The stages of this study.
3.1 Datasets Description
Turkish hotel and movie review datasets1 created by Hacettepe University for SA are used. Movie reviews were obtained from beyazperde.com and hotel reviews from otelpuan.com. All movie reviews have been rated by the authors according to the stars on the site. One or two stars are labeled as negative and four or five stars are labeled positive. Hotel reviews were rated from 0 to 100. A score of 0 to 40 was chosen as negative, and a score of 80 to 100 was chosen as positive. These score distributions were determined as a result of the evaluation of the sentences by experts. The chart of the distribution of the datasets is shown in Figure 2. Positive and negative labels for both datasets are half in the training and test set [31]. In Figure 3, the visualization of the datasets is provided with word clouds.
Fig. 2. Distribution of the datasets.
Fig. 3. Word clouds for each dataset (Commonly used words: güzel (beautiful), otel (hotel), oda (room), tatil (holiday), yemek (eat), kalmak (stay), berbat (awful), iyi (good), kötü (bad), temiz (clean), havuz (pool), izlemek (watch), değil (not), film (movie), etc.).
3.2 Libraries
Libraries belonging to Python and Java are used to remove stopwords in the texts, correct misspelled words, and lemmatize words. The analyses of these processes are explained in detail under Section 4.1.
3.2.1 Removing Stopwords.
The stopwords in the NLTK2 library are used for the Turkish language. Since the word list in the library is scarce, the stopword list has been expanded.
3.2.2 Text Spelling Correction.
Some NLP tasks need to be normalized as a preprocessing step before applying actual algorithms to the text. Normalization helps better results, especially in areas such as social media and forum texts, chat, and messaging or bot apps. This tool can be used to correct misspelled words or informal speech in noisy texts. Zemberek3 uses various heuristics, searching tables, and LMs for text normalization. First, words are created from a clean and noisy corpus using morphological analysis. With some heuristics and LMs, some words need to be split into two. True, false, and possibly false sets are generated from the compilation. For each noisy word in a sentence, candidates are collected from searching tables, informal and ASCII matching morphological analysis, and a spelling checker. When the Viterbi algorithm is run on candidate words with LM scoring, there is probably a normalized word in the correct order.
3.2.3 Lemmatization.
Zeyrek4 is a python morphological analyzer and descriptor for the Turkish language. It can perform morphological analysis of Turkish text, and return all possible parsings for each word and all possible basic word forms by splitting words into words. This library is used for lemmatization in this study.
3.3 Language Models
3.3.1 BERT.
The BERT model is defined as a Transformer-based bidirectional encoder representation. BERT produces multiple, contextual, and bidirectional word representations. BERT uses a new training tool with the “masked language model” (MLM) method. The MLM randomly masks some tokens in the input. Its purpose is to guess the actual meaning of the masked word based only on its context [10].
BERT’s structure includes pre-training and fine-tuning stages. In pre-training, the model is trained with unlabeled data on different pre-training stages. In the fine-tuning phase, the model is first initialized with pre-trained parameters. All parameters are then fine-tuned using labeled data from downstream tasks [10].
In terms of size, the BERT model has Base and Large types. While the Base model is trained with fewer parameters and layers, the Large model uses more parameters and layers. The model also has multilingual support. The multilingual BERT vocabulary includes 104 languages, including Turkish. The multilingual BERT model is based on the word piece that deals with unknown words to represent word vectors [18].
3.3.2 DistilBERT.
DistilBERT was developed using the BERT model. This model is smaller in size and faster than pre-trained BERT on the same corpus. It is automatically pre-trained on raw text only to generate inputs and labels from texts [24]. This model is pre-trained with three objects as follows:
— | Distillation loss: The process of training the BERT model to obtain the same probabilities. | ||||
— | MLM: This is part of the BERT model. When retrieving a sentence, the model randomly masks some of the words in the input. | ||||
— | Cosine embedding loss: The BERT model is as well trained to generate hidden states. | ||||
3.3.3 ELECTRA.
ELECTRA is used to pre-train transformer networks with less computation than BERT. It has been applied to Transformer [34] text encoders. ELECTRA models distinguish between “actual” and “fake” input tokens produced by another neural network. The purpose of this model is to train a text encoder to distinguish input tokens from high-quality negative samples produced by a small transformer network. Compared to masked language modeling, it was more computationally efficient and performed better on downstream tasks [6].
3.3.4 ALBERT.
Due to the memory limit and communication overhead problem, ALBERT architecture has been developed with much fewer parameters than the BERT architecture. ALBERT includes two-parameter reduction techniques that remove major obstacles to scaling pre-trained models. The first technique is factorized embedding parameterization. The matrix parses the large vocabulary into two smaller matrices and decomposes the size of the hidden layers from the size of word embedding in this technique. The second technique is parameter sharing between layers. This technique prevents the parameter from growing with the depth of the network [16].
3.4 Text Filtering Method
TFM has been proposed to remove some words that cause oppositions in the positive-negative labeled data in the datasets and to increase the accuracy value. First, the positive and negative labeled texts in the training set, which are preprocessed, are separated. The count of all words on the texts with positive and negative labels is calculated separately. The top “k” most frequently used words in positively labeled texts are selected to be removed from negatively labeled texts. At the same time, the top “k” most frequently used words in negatively labeled texts are used to be removed from positively labeled texts. Thus, it is aimed to remove the words that cause the opposite sentiments in positive and negative sentences. In addition, neutral words that don’t express sentiment can be included in these top “k” words. These top “k” words can be found in common for positive and negative labeled texts. By removing these words, the amount of data is reduced. The “k” value was used as 5, 10, and 15, respectively. Considering that TFM removes neutral and opposite sentiment words in texts and reduces the amount of data, it is predicted that the accuracy value may increase. The pseudocode of the TFM is shown in Figure 4.
Fig. 4. The pseudocode of TFM.
4 EXPERIMENTS AND RESULTS
4.1 Datasets Preparation
Before the datasets are trained on LMs, the texts are first preprocessed. In the preprocessing stage, punctuation marks removal, text conversion to lowercase, text spelling correction, and removal of stopwords are applied for each dataset. Then, lemmatization is performed for the remaining words in the texts and this text document is given to the LMs for training each dataset.
In most social media texts, there are situations such as missing words and using too many letters in words. Therefore, it is important that the text spelling correction process is performed before the text document is exported to any model. An example of spelling correction applied with Zemberek for the datasets is given in Table 1. As a result of this process, the words found in the wrong form were corrected, although they were the same.
| Dataset | Stage | Text |
|---|---|---|
| TR: Cok guzel ve eglencesi bol bır tatıl gecırdım butun departmanlara bu guzel tatıl ıcın tessekkur edıyorum | ||
| Hotel | First version | EN: I had a veryy nicee and fun holidey, I would like to thanq all departments for this beuatiful holidey. |
| TR: Çok güzel ve eğlencesi bol bir tatil geçirdim bütün departmanlara bu güzel tatil için teşekkür ediyorum | ||
| Spelling correction | EN: I had a very nice and fun holiday, I would like to thank all departments for this beautiful holiday. | |
| TR: süpper bir film gidin izleyinn süperrr | ||
| Movie | First version | EN: supper a movie go wattch it, superrrr |
| TR: süper bir film gidin izleyin, süper | ||
| Spelling correction | EN: super a movie go watch it, super |
Table 1. Spelling Correction Process in Texts (TR: Turkish, EN: English)
After the spelling correction process, stopwords that don’t express specific meaning were removed from the texts. An example of removal stopwords is shown in Table 2. Thus, the volume of data to be used in the training phase has decreased.
| Dataset | Stage | Text |
|---|---|---|
| TR: Çok güzel ve eğlencesi bol bir tatil geçirdim bütün departmanlara bu güzel tatil için teşekkür ediyorum | ||
| Hotel | Normal Text | EN: I had a very nice and fun holiday, I would like to thank all departments for this beautiful holiday. |
| TR: güzel eğlencesi bol tatil geçirdim departmanlara güzel tatil teşekkür ediyorum | ||
| Removal stopword | EN: nice fun holiday thank departments beautiful holiday | |
| TR: süper bir film gidin izleyin, süper | ||
| Movie | Normal Text | EN: super a movie go watch it, super |
| TR: süper film gidin izleyin süper | ||
| Removal stopword | EN: super movie go watch super |
Table 2. Removal of Stopwords
To use the words uniformly, the lemmatization process was applied to the texts in the datasets. The same words that have suffixes in different forms were obtained in a single form as a result of the lemmatization. Thus, it is predicted that a more accurate result will be formed for text classification. An example for lemmatization is given in Table 3.
| Dataset | Stage | Text |
|---|---|---|
| TR: güzel eğlencesi bol tatil geçirdim departmanlara güzel tatil teşekkür ediyorum | ||
| Hotel | Normal Text | EN: nice fun holiday thank departments beautiful holiday |
| TR: güzel eğlence bol tatil geç departman güzel tatil teşekkür et | ||
| Lemmatization | EN: nice fun holiday thank department beautiful holiday | |
| TR: süper film gidin izleyin süper | ||
| Movie | Normal Text | EN: super movie go watch super |
| TR: süper film git izle süper | ||
| Lemmatization | EN: super movie go watch super |
Table 3. Lemmatization Process in Texts
In addition, the proposed TFM was performed to hotel and movie datasets. With TFM, words with a negative meaning in positively labeled texts and words with positive meaning in negatively labeled texts were removed. At the same time, neutral words in top “k” most frequent words for each label that wouldn’t affect the classification were also removed. The removed top “k” most frequent words as a result of applying TFM for hotel and movie datasets are shown in Table 4. The table shows that the neutral words “otel (hotel), yemek (eat), kalmak (stay), oda (room)” for the hotel dataset, and the neutral words “film (movie), izlemek (watch), sinema (cinema), sahne (scene)” for the movie dataset are common in positive and negative labeled texts, so these words were deleted from the entire dataset (Words in Table 4: berbat (awful), kötü (bad), değil (not), güzel (beautiful), iyi (good), memnun (glad), tavsiye (advice), harika (wonderful), mükemmel (perfect), etc.).
| Hotel | Movie | ||||||
|---|---|---|---|---|---|---|---|
| Negative | Positive | Negative | Positive | ||||
| Word | Count | Word | Count | Word | Count | Word | Count |
| otel | 7,781 | güzel | 2,130 | film | 29,967 | film | 26,039 |
| yemek | 3,314 | iyi | 1,187 | izlemek | 4,722 | izlemek | 5,677 |
| oda | 2,680 | otel | 1,061 | kötü | 3,090 | iyi | 4,897 |
| tatil | 1,949 | memnun | 918 | değil | 3,069 | güzel | 4,357 |
| berbat | 1,770 | kalmak | 849 | oyun | 2,477 | oyun | 2,540 |
| kötü | 1,528 | yemek | 774 | konu | 2,184 | iz | 2,436 |
| değil | 1,504 | tesis | 731 | sahne | 2,107 | sinema | 1,742 |
| kalmak | 1,475 | tavsiye | 633 | sinema | 1,757 | harika | 1,716 |
| havuz | 1,376 | hizmet | 524 | senaryo | 1,667 | mükemmel | 1,492 |
| para | 1,241 | personel | 497 | anlamak | 1,636 | sahne | 1,411 |
| temiz | 1,231 | oda | 466 | iz | 1,487 | değil | 1,391 |
| iyi | 1,044 | temiz | 425 | beğenmek | 1,435 | anlamak | 1,270 |
| güzel | 947 | havuz | 362 | puan | 1,314 | konu | 1,248 |
| personel | 944 | animasyon | 284 | arkadaş | 1,271 | hayat | 1,136 |
| rezalet | 901 | tatil | 283 | hayal | 1,166 | süper | 1,101 |
Table 4. The Removed Top 15 Words According to Train Sets by TFM
As a result of preprocessing and applying TFM to hotel and movie datasets, the amount of data to be used is considerably decreased. This also saves time for us. In addition, a more accurate classification process is performed after the words are in different forms but the same is translated into the same form. The change of the term count in the datasets after the processes is shown in Table 5. The table shows that while the total word count decreased by approximately 47%, the unique word count decreased by approximately 86% for hotel and movie datasets after preprocessing. Also, the total word count decreased by approximately 11.5% for hotel and 18.8% for movie dataset after the TFM process.
| Hotel | Movie | ||
|---|---|---|---|
| Total word count | 916,250 | 2,007,114 | |
| Total word count after preprocessing | 488,356 | 1,062,197 | |
| Unique word count | 91,458 | 199,038 | |
| Unique word count after preprocessing | 11,097 | 27,506 | |
| Total word count after TFM | k = 5 | 450,097 | 902,943 |
| Unique word count after TFM | 11,096 | 27,503 | |
| Total word count after TFM | k = 10 | 432,785 | 864,698 |
| Unique word count after TFM | 11,094 | 27,499 | |
| Total word count after TFM | k = 15 | 405,032 | 836,205 |
| Unique word count after TFM | 11,088 | 27,494 |
Table 5. Change of Word Counts in Datasets After Processes
Statistics of words expressing opposite sentiment in removed words were analyzed. As a result of applying TFM, approximately 56,000 words are removed for the hotel and 190,000 words are removed for the movie dataset. These words include “iyi (good), güzel (beautiful), harika (wonderful),” and so on. with a positive meaning or “kötü (bad), değil (not), berbat (terrible)” with a negative meaning. The statistical graph of removing words from opposite labeled text is shown in Figure 5. For the hotel and movie dataset, respectively, 11.6% and 7.4% of words deleted from the opposite label prevent misclassification.
Fig. 5. Statistics of words expressing opposite sentiment (k = 10).
4.2 Analysis and Comparison of Language Models
BERT, DistilBERT, ELECTRA, and ALBERT LMs were used to analyze sentiment on datasets. In these LMs, there are models created for Turkish. These models can be specified as Bert-Multilingual,5 BERT-Tr,6 DistilBERT-Tr,7 ALBERT-Tr,8 and ELECTRA-Tr.9 Details of language models for Turkish are shown in Table 6. The models were trained with Turkish hotel and movie corpus.
| Language Model | Information |
|---|---|
| BERT-Multilingual | The model is pre-trained with Wikipedia articles in 102 languages. The entire Wikipedia dump for each language was taken as training data. However, the size of Wikipedia for a given language varies widely. |
| BERT-Tr | Training was carried out only with the Turkish corpus. The model was trained with a filtered and sentence segmented version of the Turkish OSCAR corpus (35 GB), which is a Wikipedia dump. |
| DistilBERT-Tr | Since it uses information distillation in the pre-training phase, its size is quite small compared to the BERT model. The model was trained on the original Turkish training data (7 GB) using the cased version of BERTurk [25]. |
| ALBERT-Tr | It has quite fewer parameters than the BERT architecture, and implements two parameter reduction techniques. The dataset (200 GB) consisting of texts such as online blogs, free e-books, newspapers, Twitter, articles, Wikipedia, and so on, was used for training. |
| ELECTRA-Tr | It uses less calculations than BERT in pre-training. Instead of the MLM method in BERT, it trains the text encoder that distinguishes input tokens. The model was trained with the Turkish OSCAR corpus (35 GB), which is a Wikipedia dump. |
Table 6. The Details of the Turkish LMs
New LMs were trained using preprocessed datasets and these models in this study. The training phase was performed with Google COLAB’s GPU processor (Tesla K80 GPU, 12.89 GB Ram, 65 GB Disk Size). The accuracy values of the test sets were measured on the trained new LMs. The accuracy values obtained by all LMs for the datasets and file sizes of LMs are given in Table 7. The table shows that BERT-Tr for hotel and ELECTRA-Tr for movie dataset were the most successful models.
| Hotel Accuracy (%, MB) | Movie Accuracy (%, MB) | |
|---|---|---|
| New BERT-Multilingual | 78.46 (653.8 MB) | 80.88 (653.8 MB) |
| New BERT-Tr | 90.18 (432.2 MB) | 90.47 (432.2 MB) |
| New DistilBERT-Tr | 89.27 (266 MB) | 89.72 (266 MB) |
| New ALBERT-Tr | 89.65 (46.6 MB) | 89.30 (46.6 MB) |
| New ELECTRA-Tr | 89.29 (432 MB) | 90.67 (432 MB) |
Table 7. The Accuracy Values and File Sizes of the Trained LMs on the Datasets Without TFM
Then, the effect of the proposed TFM on the accuracy value of these LMs was analyzed. The datasets passed through this filtering technique were retrained on all LMs with the same parameters according to “k” values (5, 10, 15). The accuracy value of the test sets on the retrained LMs and file sizes of LMs is shown in Table 8. TFM has increased accuracy value positively as it removes the oppositions and neutral words in positive or negative texts.
| Hotel Accuracy (%) | Movie Accuracy (%) | Model’s Size (MB) | |||||
|---|---|---|---|---|---|---|---|
| k = 5 | k = 10 | k = 15 | k = 5 | k = 10 | k = 15 | (Hotel, Movie) | |
| Retrained BERT-Multilingual | 94.76 | 95.88 | 93.65 | 82.82 | 84.69 | 83.30 | 653.8, 653.8 MB |
| Retrained BERT-Tr | 98.01 | 97.86 | 96.36 | 92.04 | 92.11 | 90.96 | 432.2, 432.2 MB |
| Retrained DistilBERT-Tr | 97.74 | 97.67 | 95.12 | 91.34 | 92.00 | 90.89 | 266, 266 MB |
| Retrained ALBERT-Tr | 97.10 | 96.63 | 94.12 | 91.75 | 91.71 | 90.43 | 46.6, 46.6 MB |
| Retrained ELECTRA-Tr | 97.74 | 98.38 | 95.31 | 92.08 | 92.21 | 91.48 | 394.2, 432 MB |
Table 8. The Accuracy Value and File Sizes of Retrained LMs with TFM
The accuracy chart of the LMs obtained after the preprocessing and utilizing TFM is shown in Figure 6. The figure shows only the results for k = 10, where TFM gives the best results on LMs. For the hotel dataset, the new BERT-Tr model was the most successful with 90.18%, while the retrained ELECTRA-Tr model achieved an accuracy value of 98.38% with the TFM. For the movie dataset, while the new ELECTRA-Tr was the most successful with 90.67%, the accuracy value reached 92.21% with retrained ELECTRA-Tr after utilizing TFM. The figure shows that the accuracy value of all trained LMs10 has increased with TFM. When all the models are compared, BERT-Multilingual was the least successful in both datasets. The reason for this is that the model is trained with articles in 102 languages, not only in Turkish. Thus, since this model uses data in many languages during the testing phase, its accuracy values are quite low compared to other models. Other models were trained only in Turkish. ELECTRA-Tr, ALBERT-Tr, and DistilBERT-Tr are derived from BERT by changing the structure and parameter numbers of BERT models. Therefore, the accuracy value between these models can be close.
Fig. 6. The accuracy (%) chart of all models (TFM for k = 10).
There are many studies using hotel and movie datasets. The effect of these LMs was compared with previous studies. The accuracy value of previous studies and our LMs are shown in Table 9. Among the trained LMs, only the most successful ones are selected for this table. Compared to the literature, trained LMs and utilizing TFM have been quite successful. The highest accuracy was achieved at 98.38% for hotel and 92.21% for movie datasets with TFM for k = 10.
| Hotel Accuracy (%) | Movie Accuracy (%) | |
|---|---|---|
| Ucan et al. [31] | 80.70 | 84.60 |
| Erşahin et al. [11] | 91.96 | 86.31 |
| Yildirim et al. [36] | 86.00 | 83.00 |
| New BERT-Tr (Hotel), ELECTRA-Tr (Movie) | 90.18 | 90.67 |
| Retrained BERT-Tr (Hotel), ELECTRA-Tr (Movie) with TFM k = 5 | 98.01 | 92.08 |
| Retrained ELECTRA-Tr with TFM k = 10 | 98.38 | 92.21 |
| Retrained BERT-Tr (Hotel), ELECTRA-Tr (Movie) with TFM k = 15 | 96.36 | 90.67 |
Table 9. Comparison of Previous Studies and our LMs
5 DISCUSSION
First, the analysis of four LMs for the SA of Turkish hotel and movie reviews was performed in this study. It has contributed to the literature as there is no study comparing the BERT, ALBERT, DistilBERT, and ELECTRA models for Turkish. When the accuracy values of the LMs are compared with the previous studies, good results have been obtained for the hotel and movie datasets. Since LMs are created by using deep learning methods and the structure of the language, they give more successful results than previous studies.
On the other hand, the proposed TFM method applies filtering according to frequently used words in positive-negative labeled texts. Words expressing opposition or neutral words in these frequently used top “k” words are deleted from other labeled texts (k = 5, 10, 15). Thus, the amount of data decreased by approximately 11.5% for the hotel and 18.8% for the movie dataset. When the LMs were re-analyzed with the datasets obtained as a result of TFM for k = 10, the accuracy values increased in both datasets and gave the state-of-the-art performance. Compared with previous studies, the application of LM and TFM increased the accuracy value for 6.5% hotel and 5.75% movie dataset. The accuracy chart of previous studies and LMs is shown in Figure 7. In TFM analysis, the first 5, 10, and 15 words are removed from the datasets, respectively. When the accuracy values for these “k” values are examined, the reason why the success for the k = 5 value is less than the k = 10 value is that the opposite words can’t be completely deleted from the datasets. For the k = 15 value, it is less successful than the k = 10 value since many neutral words are deleted from the datasets. Therefore, the most appropriate value of k should be determined.
Fig. 7. The accuracy chart of trained LMs with previous studies.
When the file sizes of the LMs are examined, the sizes of ALBERT and DistilBERT are lower than the other LMs. Since parameter reduction and information distillation are applied in these models, these models’ sizes are small. Although the models are small in size, they are quite close to the success of other LMs. The use of the ALBERT and DistilBERT models can be considered when time performance is important to the user. The ELECTRA model, which uses fewer calculations and has structural changes, is the most successful in terms of performance.
6 CONCLUSION AND FUTURE WORKS
In this study, the effect of LMs was analyzed for SA. LMs trained only after preprocessing achieved an accuracy value of more than 90% in both hotel and movie datasets. Compared to previous studies, the new ELECTRA-Tr model was most successful for the movie dataset. The hotel, on the other hand, achieved success close to the previous studies with the new BERT-Tr model. Then, hotel and movie datasets were updated with the proposed TFM to eliminate some opposite sentiment words in the positive and negative labeled texts. As a result of retrained LMs with TFM, the accuracy value of these models increased considerably. While the accuracy value increase between 6% and 17.5% was achieved for the hotel dataset, there was an increase in accuracy value between 1.6% and 3.8% for the movie. It shows that there are many opposite sentiment words in these texts. The reason why the accuracy value increases more in the hotel dataset is that more words express opposition. The ELECTRA-Tr model, which was retrained as a result of utilizing TFM, was the most successful model with an accuracy rate of 98.38% for the hotel dataset and 92.21% for the movie dataset.
In terms of contributing to the literature, trained LMs and utilizing TFM were compared with previous studies. The results show that LMs and TFM can increase the accuracy value for SA. Compared with previous studies, the application of LMs and TFM increased the accuracy value for 6.5% hotel and 5.75% movie dataset. Therefore, it would be advantageous to use LMs and TFM in SA studies.
In terms of contributing to the Turkish language in future works, it is thought to use LMs in different studies such as analysis of Turkish tweets, spam detection, sarcasm detection, or news headlines classification. The TFM is also intended to be used for SA such as tweets and comments. In addition, it will be beneficial for our language to apply LMs in different areas like question answering, text generation, text summarization, and machine translation for the Turkish language.
Footnotes
1 http://humirapps.cs.hacettepe.edu.tr/tsad.aspx.
Footnote2 https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip.
Footnote3 https://github.com/ahmetaa/zemberek-nlp.
Footnote4 https://zeyrek.readthedocs.io/en/latest/#.
Footnote5 https://huggingface.co/bert-base-multilingual-uncased.
Footnote6 https://huggingface.co/dbmdz/bert-base-turkish-uncased.
Footnote7 https://huggingface.co/dbmdz/distilbert-base-turkish-cased.
Footnote8 https://huggingface.co/loodos/ALBERT-base-turkish-uncased.
Footnote9 https://huggingface.co/dbmdz/electra-base-turkish-cased-discriminator.
Footnote10 https://drive.google.com/drive/folders/1KaEjCWFZz_OA2SBtszAiVK85tm9Xrqq7.
Footnote
- [1] . 2021. Benchmarking transformer-based language models for Arabic sentiment and sarcasm detection. In Proceedings of the 6th Arabic Natural Language Processing Workshop. 21–31. https://www.aclweb.org/anthology/2021.wanlp-1.3.Google Scholar
- [2] . 2020. Turkish sentiment analysis using BERT. In Proceedings of the 2020 28th Signal Processing and Communications Applications Conference (SIU’020). Google Scholar
Cross Ref
- [3] . 2020. Analyzing ELMo and DistilBERT on socio-political news classification. Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020May (2020), 9–18. https://www.aclweb.org/anthology/2020.aespen-1.4.Google Scholar
- [4] . 2017. A sentiment classification model based on multiple classifiers. Applied Soft Computing Journal 50 (2017), 135–141. Google Scholar
Digital Library
- [5] . 2019. A deep learning approach to sentiment analysis in Turkish. In 2018 International Conference on Artificial Intelligence and Data Processing (IDAP’18). Google Scholar
Cross Ref
- [6] . 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators.
arxiv:2003.10555 . Retrieved from http://arxiv.org/abs/2003.10555.Google Scholar - [7] . 2021. Deep learning-based sentiment analysis of Facebook data: The case of Turkish users. Computer Journal 64, 3 (2021), 473–499. Google Scholar
Cross Ref
- [8] . 2016. SentiTurkNet: A Turkish polarity lexicon for sentiment analysis. Language Resources and Evaluation 50, 3 (2016), 667–685. Google Scholar
Digital Library
- [9] . 2017. Sentiment analysis in Turkish at different granularity levels. Natural Language Engineering 23, 4 (2017), 535–559. Google Scholar
Cross Ref
- [10] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), Vol. 1. 4171–4186.
arxiv:1810.04805. Google Scholar - [11] . 2019. A hybrid sentiment analysis method for Turkish. Turkish Journal of Electrical Engineering and Computer Sciences 27, 3 (2019), 1780–1793. Google Scholar
Cross Ref
- [12] . 2018. Sentiment analysis of Thai financial news. In Proceedings of the 2018 2nd International Conference on Software and E-Business (ICSEB’18). Association for Computing Machinery, New York, NY, 39–43. Google Scholar
Digital Library
- [13] . 2020. Understanding BERT rankers under distillation. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval (ICTIR’20). Association for Computing Machinery, New York, NY, 149–152. Google Scholar
Digital Library
- [14] . 2020. Comparing BERT against traditional machine learning text classification.
arxiv:2005.13012. Retrieved from http://arxiv.org/abs/2005.13012.Google Scholar - [15] . 2021. Comparison of BERT models and machine learning methods for sentiment analysis on Turkish tweets. In Proceedings of the 2021 6th International Conference on Computer Science and Engineering (UBMK’21). 98–101. Google Scholar
Cross Ref
- [16] . 2019. ALBERT: A lite BERT for self-supervised learning of language representations.
arxiv:1909.11942 . Retrieved from http://arxiv.org/abs/1909.11942.Google Scholar - [17] . 2021. Applying BERT to analyze investor sentiment in stock market. Neural Computing and Applications 33, 10 (2021), 4663–4676. Google Scholar
Digital Library
- [18] . 2021. A neural joint model with BERT for Burmese syllable segmentation, word segmentation, and POS tagging. ACM Transactions on Asian and Low-Resource Language Information Processing 20, 4 (
May 2021), Article54 , 23 pages. Google ScholarDigital Library
- [19] . 2019. Financial sentiment analysis for predicting direction of stocks using bidirectional encoder representations from transformers (BERT) and deep learning models. In International Conference on Innovative and Intelligent Technologies, Istanbul, Turkey.
DOI: Google ScholarCross Ref
- [20] . 2021. A new topic modeling based approach for aspect extraction in aspect based sentiment analysis: SS-LDA. Expert Systems with Applications 168 (2021). Google Scholar
Cross Ref
- [21] . 2020. Language models as knowledge bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19),2463–2473.
arxiv:1909.01066 Google ScholarCross Ref
- [22] . 2020. Comparative analysis of different transformer based architectures used in sentiment analysis. In Proceedings of the 2020 9th International Conference on System Modeling and Advancement in Research Trends (SMART’20). 411–415. Google Scholar
Cross Ref
- [23] . 2021. An effective BERT-based pipeline for Twitter sentiment analysis: A case study in Italian. Sensors (Switzerland) 21, 1 (2021), 1–21. Google Scholar
Cross Ref
- [24] . 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter.
arxiv:1910.01108 . Retrieved from http://arxiv.org/abs/1910.01108.Google Scholar - [25] . 2020. BERTurk—BERT Models for Turkish. Google Scholar
Cross Ref
- [26] . 2021. Deep sentiment analysis: A case study on stemmed Turkish Twitter data. IEEE Access 9 (2021), 56836–56854. Google Scholar
Cross Ref
- [27] . 2020. Sentiment analysis of Turkish reviews on Google Play store. In Proceedings of the 2020 5th International Conference on Computer Science and Engineering (UBMK’20). IEEE, 314–315.Google Scholar
Cross Ref
- [28] . 2021. Sentiment analysis on the impact of coronavirus in social life using the BERT model. Social Network Analysis and Mining 11, 1 (2021). Google Scholar
Cross Ref
- [29] . 2019. BERT for stock market sentiment analysis. In Proceedings of the International Conference on Tools with Artificial Intelligence (ICTAI), Vol. 2019-November. 1597–1601. Google Scholar
Cross Ref
- [30] . 2021. Tuning language representation models for classification of Turkish news. In 2021 International Symposium on Electrical, Electronics and Information Engineering. 402–407.Google Scholar
Digital Library
- [31] . 2017. SentiWordNet for new language: Automatic translation approach. In Proceedings of the 12th International Conference on Signal Image Technology and Internet-Based Systems (SITIS’16). 308–315. Google Scholar
Cross Ref
- [32] . 2017. Sentiment analysis for the social media: A case study for Turkish general elections. In Proceedings of the SouthEast Conference (ACM SE’17). Association for Computing Machinery, New York, NY, 215–218. Google Scholar
Digital Library
- [33] . 2021. Two new large corpora for Vietnamese aspect-based sentiment analysis at sentence level. ACM Transactions on Asian and Low-Resource Language Information Processing 20, 4 (
May 2021), Article62 , 22 pages. Google ScholarDigital Library
- [34] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 2017-December. 5999–6009.
arxiv:1706.03762 Google Scholar - [35] . 2019. Sentiment analysis of comment texts based on BiLSTM. IEEE Access 7 (2019), 51522–51532. Google Scholar
Cross Ref
- [36] . 2020. Sentiment analysis for Turkish unstructured data by machine translation. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data 2020). 4811–4817. Google Scholar
Cross Ref
- [37] . 2019. A polarity calculation approach for lexicon-based Turkish sentiment analysis. Turkish Journal of Electrical Engineering and Computer Sciences 27, 2 (2019), 1325–1339. Google Scholar
Cross Ref
Index Terms
The Comparison of Language Models with a Novel Text Filtering Approach for Turkish Sentiment Analysis
Recommendations
A Novel Hybrid HDP-LDA Model for Sentiment Analysis
WI-IAT '13: Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 01Sentiment analysis studies the public opinions towards an entity, and it is an important research area in data mining. Recently, a lot of sentiment analysis models have been proposed, including supervised and unsupervised approaches. However, the role ...
Joint sentiment/topic model for sentiment analysis
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementSentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet ...
Extracting domain-specific opinion words for sentiment analysis
MICAI'12: Proceedings of the 11th Mexican international conference on Advances in Computational Intelligence - Volume Part IIIn this paper, we consider opinion word extraction, one of the key problems in sentiment analysis. Sentiment analysis (or opinion mining) is an important research area within computational linguistics. Opinion words, which form an opinion lexicon, ...













Comments