Abstract
Sinhala is a low-resource language, for which basic language and linguistic tools have not been properly defined. This affects the development of NLP-based end-user applications for Sinhala. Thus, when implementing NLP tools such as sentiment analyzers, we have to rely only on language-independent techniques. This article presents the use of such language-independent techniques in implementing a sentiment analysis system for Sinhala news comments. We demonstrate that for low-resource languages such as Sinhala, the use of recently introduced word embedding models as semantic features can compensate for the lack of well-developed language-specific linguistic or language resources, and text classification with acceptable accuracy is indeed possible using both traditional statistical classifiers and Deep Learning models. The developed classification models, a corpus of 8.9 million tokens extracted from Sinhala news articles and user comments, and Sinhala Word2Vec and fastText word embedding models are now available for public use; 9,048 news comments annotated with POSITIVE/NEGATIVE/NEUTRAL polarities have also been released.
- Zishan Ahmad, Raghav Jindal, Asif Ekbal, and Pushpak Bhattachharyya. 2020. Borrow from rich cousin: transfer learning for emotion detection using cross lingual embedding. Expert Systems with Applications 139 (2020), 112851.Google Scholar
Digital Library
- Md Shad Akhtar, Asif Ekbal, and Pushpak Bhattacharyya. 2016a. Aspect based sentiment analysis in Hindi: Resource creation and evaluation. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC ’16). 2703–2709.Google Scholar
- Md Shad Akhtar, Ayush Kumar, Asif Ekbal, and Pushpak Bhattacharyya. 2016b. A hybrid deep learning architecture for sentiment analysis. In Proceedings of the 26th International Conference on Computational Linguistics (COLING'16). 482–493.Google Scholar
- Hmai Amali and S. Jayalal, 2020. Classification of cyberbullying Sinhala language comments on social media. In 2020 Moratuwa Engineering Research Conference (MERCon’20). IEEE. 266–271Google Scholar
- Akshat Bakliwal, Piyush Arora, and Vasudeva Varma. 2012. Hindi subjective lexicon: A lexical resource for Hindi polarity classification. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 1189–1196.Google Scholar
- Leonardo Gabiato Catharin and Valéria Delisandra Feltrim. 2018. Finding opinion targets in news comments and book reviews. In International Conference on Computational Processing of the Portuguese Language. Springer, 375–384.Google Scholar
Cross Ref
- P. D. T. Chathuranga, S. A. S. Lorensuhewa, and M. A. L. Kalyani. 2019. Sinhala sentiment analysis using corpus based sentiment lexicon. In International Conference on Advances in ICT for Emerging Regions (ICTer’19), 1. 7.Google Scholar
Cross Ref
- Jason P. C. Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics 4 (2016), 357–370.Google Scholar
Cross Ref
- Stéphane Clinchant and Florent Perronnin. 2013. Aggregating continuous word embeddings for information retrieval. In Proceedings of the Workshop on Continuous Vector Space Models and Their Compositionality. 100–109.Google Scholar
- Cedric De Boom, Steven Van Canneyt, Thomas Demeester, and Bart Dhoedt. 2016. Representation learning for very short texts using weighted word embedding aggregation. Pattern Recognition Letters 80 (2016), 150–156. Google Scholar
Digital Library
- Nicholas Diakopoulos and Mor Naaman. 2011. Topicality, time, and sentiment in online news comments. In CHI’11 Extended Abstracts on Human Factors in Computing Systems. ACM, 1405–1410. Google Scholar
Digital Library
- Wen Fan and Shutao Sun. 2010. Sentiment classification for online comments on chinese news. In International Conference on Computer Application and System Modeling (ICCASM’10), 4. IEEE, V4–740.Google Scholar
- Fathima Farhath, Pranavan Theivendiram, Surangika Ranathunga, Sanath Jayasena, and Gihan Dias. 2018. Improving domain-specific SMT for low-resourced languages using data from different domains. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).Google Scholar
- Noura Farra, Kathy McKeown, and Nizar Habash. 2015. Annotating targets of opinions in arabic using crowdsourcing. In Proceedings of the 2nd Workshop on Arabic Natural Language Processing. 89–98.Google Scholar
Cross Ref
- Niroshinie Fernando and Ruwan Weerasinghe. 2013. A morphological parser for Sinhala verbs. In Proceedings of the International Conference on Advances in ICT for Emerging Regions.Google Scholar
- Sandareka Fernando and Surangika Ranathunga. 2018. Evaluation of different classifiers for Sinhala POS tagging. In 2018 Moratuwa Engineering Research Conference (MERCon’18). IEEE, 96–101.Google Scholar
Cross Ref
- Sandareka Fernando, Surangika Ranathunga, Sanath Jayasena, and Gihan Dias. 2016. Comprehensive part-of-speech tag set and SVM based POS tagger for Sinhala. In Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP’16). 173–182.Google Scholar
- Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. 2002. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research 3, (2002), 115–143. Google Scholar
Digital Library
- Marlo Häring, Wiebke Loosen, and Walid Maalej. 2018. Who is addressed in this comment?: Automatically classifying meta-comments in news comments. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 67. Google Scholar
Digital Library
- Budditha Hettige and Asoka S. Karunananda. 2006. A morphological analyzer to enable English to Sinhala machine translation. In International Conference on Information and Automation, 2006. IEEE, 21–26.Google Scholar
- Nadheesh Jihan, Yasas Senarath, Dulanjaya Tennekoon, Mithila Wickramarathne, and Surangika Ranathunga. 2017. Multi-domain aspect extraction using support vector machines. In Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING’17). 308–322.Google Scholar
- Aditya Joshi, A. R. Balamurali, Pushpak Bhattacharyya et al. 2010. A fall-back strategy for sentiment analysis in Hindi: A case study. In Proceedings of the 8th International Conference on Natural Language Processing (ICON'10).Google Scholar
- Ayush Kumar, Sarah Kohail, Asif Ekbal, and Chris Biemann. 2015. IIT-TUDA: System for sentiment analysis in Indian languages using lexical acquisition. In International Conference on Mining Intelligence and Knowledge Exploration. Springer, 684–693. Google Scholar
Digital Library
- S. Sachin Kumar, M. Anand Kumar, and K. P. Soman. 2017. Sentiment analysis of tweets in Malayalam using long shortterm memory units and convolutional neural nets. In International Conference on Mining Intelligence and Knowledge Exploration. Springer, 320–334.Google Scholar
- Chamila Liyanage, Randil Pushpananda, Dulip Lakmal Herath, and Ruvan Weerasinghe. 2012. A computational grammar of Sinhala. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 188–200. Google Scholar
Digital Library
- S. A. P. M. Manamini, A. F. Ahamed, R. A. E. C. Rajapakshe, G. H. A. Reemal, S. Jayasena, G. V. Dias, and S. Ranathunga. 2016. Ananya—A named-entity-recognition (NER) system for Sinhala language. In Moratuwa Engineering Research Conference (MERCon’16). IEEE, 30–35.Google Scholar
- Nishantha Medagoda. 2016. Sentiment analysis on morphologically rich languages: An artificial neural network (ANN) approach. In Artificial Neural Network Modelling. Springer, 377–393.Google Scholar
- Nishantha Medagoda, Subana Shanmuganathan, and Jacqueline Whalley. 2015. Sentiment lexicon construction using SentiWordNet 3.0. In 2015 11th International Conference on Natural Computation (ICNC’15). IEEE, 802–807.Google Scholar
Cross Ref
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of word Representations in Vector Space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- Andriy Mnih and Geoffrey E. Hinton. 2009. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems. 1081–1088. Google Scholar
Digital Library
- Alejandro Moreo, M. Romero, J. L. Castro, and Jose Manuel Zurita. 2012. Lexicon-based comments-oriented news sentiment analyzer system. Expert Systems with Applications 39, 10 (2012), 9166–9180. Google Scholar
Digital Library
- Siddhartha Mukherjee. 2019. Deep learning technique for sentiment analysis of Hindi-English code-mixed text using late fusion of character and word features. In 2019 IEEE 16th India Council International Conference (INDICON’19). IEEE, 1–4.Google Scholar
Cross Ref
- Addlight Mukwazvure and K. P. Supreethi. 2015. A hybrid approach to sentiment analysis of news comments. In 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO’15) (Trends and Future Directions). IEEE, 1–6.Google Scholar
- F. Å. Nielsen. 2011. AFINN. Retrieved from http://localhost/pubdb/p.php?6010.Google Scholar
- Alexander Pak and Patrick Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion mining. In LREc, Vol. 10. 1320–1326.Google Scholar
- Braja Gopal Patra, Dipankar Das, and Amitava Das. 2018. Sentiment analysis of code-mixed indian languages: An overview of SAIL_code-mixed shared task@ ICON-2017. arXiv preprint arXiv:1803.06745 (2018).Google Scholar
- Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. 2016. Sentiment analysis of tweets in three Indian languages. In Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP’16). 93–102.Google Scholar
- Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, A. L.-Smadi Mohammad, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphée De Clercq et al. 2016. SemEval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval’16). 19–30.Google Scholar
- Sudha Shanker Prasad, Jitendra Kumar, Dinesh Kumar Prabhakar, and Sukomal Pal. 2015. Sentiment classification: An approach for Indian language tweets using decision tree. In International Conference on Mining Intelligence and Knowledge Exploration. Springer, 656–663. Google Scholar
Digital Library
- Jenarthanan Rajenthiran, Yasas Senarath, and Uthayasanker Thayasivam. 2019. ACTSEA: Annotated corpus for Tamil & Sinhala emotion analysis. In 2019 Moratuwa Engineering Research Conference (MERCon’19). IEEE, 2019. 49–53Google Scholar
- Sujata Rani and Parteek Kumar. 2019. A journey of Indian languages over sentiment analysis: A systematic review. Artificial Intelligence Review 52, 2 (2019), 1415–1462. Google Scholar
Digital Library
- Urmi Saha, Abhijeet Dubey, and Pushpak Bhattacharyya. 2019. IIT Bombay at HASOC 2019: Supervised hate speech and offensive content detection in Indo-European languages. FIRE (Working Notes). 352--358.Google Scholar
- H. M. S. T. Sandaruwan, S. A. S. Lorensuhewa, and M. A. L. Kalyani. 2020. Identification of abusive Sinhala comments in social media using text mining and machine learning techniques. International Journal on Advances in ICT for Emerging Regions 13, 1 (2020), 1.Google Scholar
- Dietmar Schabus, Marcin Skowron, and Martin Trapp. 2017. One million posts: A data set of german online discussions. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1241–1244. Google Scholar
Digital Library
- Shriya Se, R. Vinayakumar, M. Anand Kumar, and K. P. Soman. 2016. Predicting the sentimental reviews in Tamil movie using machine learning algorithms. Indian Journal of Science and Technology 9, 45 (2016), 1--5.Google Scholar
Cross Ref
- K. U. Senevirathne, N. S. Attanayake, A. W. M. H. Dhananjanie, W. A. S. U. Weragoda, A. Nugaliyadde, and S. Thelijjagoda. 2015. Conditional random fields based named entity recognition for Sinhala. In 10th International Conference on Industrial and Information Systems (ICIIS’15). IEEE, 302–307.Google Scholar
- Shriya Seshadri, Anand Kumar Madasamy, Soman Kotti Padannayil, and M. Anand Kumar. 2016. Analyzing sentiment in Indian languages micro text using recurrent neural network. IIOAB J 7 (2016), 313–318.Google Scholar
- Parul Sharma and Teng-Sheng Moh. 2016. Prediction of Indian election using sentiment analysis on Hindi Twitter. In 2016 IEEE International Conference on Big Data (Big Data’16). IEEE, 1966–1971.Google Scholar
Cross Ref
- S. Soumya and K. V. Pramod. 2019. Sentiment analysis of Malayalam tweets using different deep neural network models-case study. In 2019 9th International Conference on Advances in Computing and Communication (ICACC’19). IEEE, 163–168.Google Scholar
- Diego Tumitan and Karin Becker. 2014. Sentiment-based features for predicting election polls: A case study on the Brazilian scenario. In Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 02. IEEE Computer Society, 126–133. Google Scholar
Digital Library
- Peter D. Turney. 2002. Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 417–424. Google Scholar
Digital Library
- Dimuthu Upeksha, Chamila Wijayarathna, Maduranga Siriwardena, Lahiru Lasandun, Chinthana Wimalasuriya, N. H. N. D. De Silva, and Gihan Dias. 2015. Implementing a corpus for Sinhala language. In Symposium on Language Technology for South Asia 2015.Google Scholar
- Manju Venugopalan and Deepa Gupta. 2015. Sentiment classification for Hindi tweets in a constrained environment augmented using tweet specific features. In International Conference on Mining Intelligence and Knowledge Exploration. Springer, 664–670. Google Scholar
Digital Library
- Ruvan Weerasinghe, Dulip Herath, and Viraj Welgama. 2009. Corpus-based Sinhala lexicon. In Proceedings of the 7th Workshop on Asian Language Resources. Association for Computational Linguistics, 17–23. Google Scholar
Digital Library
- Dongwen Zhang, Hua Xu, Zengcai Su, and Yunfeng Xu. 2015. Chinese comments sentiment classification based on word2vec and SVMperf. Expert Systems with Applications 42, 4 (2015), 1857–1863. Google Scholar
Digital Library
- Yan Zhao, Suyu Dong, and Leixiao Li. 2014. Sentiment analysis on news comments based on supervised learning method.Google Scholar
Index Terms
Sentiment Analysis of Sinhala News Comments
Recommendations
Automatic Indonesian Sentiment Lexicon Curation with Sentiment Valence Tuning for Social Media Sentiment Analysis
Special issue on Deep Learning for Low-Resource Natural Language Processing, Part 1 and Regular PapersA novel Indonesian sentiment lexicon (SentIL -- Sentiment Indonesian Lexicon) is created with an automatic pipeline; from creating sentiment seed words, adding new words with slang words, emoticons, and from the given dictionary and sentiment corpus, ...
A Word Sense Disambiguation Technique for Sinhala
ICAIET '14: Proceedings of the 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and TechnologyWord sense disambiguation is the task of identifying the implied sense of a polysemous word in a given context. There have been many efforts on word sense disambiguation for English, but the amount of efforts for Sinhala is very little. This paper ...
Adapter-based fine-tuning of pre-trained multilingual language models for code-mixed and code-switched text classification
AbstractCode-mixing and code-switching are frequent features in online conversations. Classification of such text is challenging if one of the languages is low-resourced. Fine-tuning pre-trained multilingual language models is a promising avenue for code-...






Comments