Abstract
During the last two decades, sentiment analysis, also known as opinion mining, has become one of the most explored research areas in Natural Language Processing (NLP) and data mining. Sentiment analysis focuses on the sentiments or opinions of consumers expressed over social media or different web sites. Due to exposure on the Internet, sentiment analysis has attracted vast numbers of researchers over the globe. A large amount of research has been conducted in English, Chinese, and other languages used worldwide. However, Roman Urdu has been neglected despite being the third most used language for communication in the world, covering millions of users around the globe. Although some techniques have been proposed for sentiment analysis in Roman Urdu, these techniques are limited to a specific domain or developed incorrectly due to the unavailability of language resources available for Roman Urdu. Therefore, in this article, we are proposing an unsupervised approach for sentiment analysis in Roman Urdu. First, the proposed model normalizes the text to overcome spelling variations of different words. After normalizing text, we have used Roman Urdu and English opinion lexicons to correctly identify users’ opinions from the text. We have also incorporated negation terms and stemming to assign polarities to each extracted opinion. Furthermore, our model assigns a score to each sentence on the basis of the polarities of extracted opinions and classifies each sentence as positive, negative, or neutral. In order to verify our approach, we have conducted experiments on two publicly available datasets for Roman Urdu and compared our approach with the existing model. Results have demonstrated that our approach outperforms existing models for sentiment analysis tasks in Roman Urdu. Furthermore, our approach does not suffer from domain dependency.
- [1] . 2020. Automatic detection of offensive language for Urdu and Roman Urdu. IEEE Access 8, (2020), 91213–91226.Google Scholar
Cross Ref
- [2] . 2009. Urdu text classification. In Proceedings of the 7th International Conference on Frontiers of Information Technology, Abbottabad, Pakistan. 1–7. Google Scholar
Digital Library
- [3] . 2017. Pattern based comprehensive Urdu stemmer and short text classification. IEEE Access 6, (2017), 7374–7389.Google Scholar
Cross Ref
- [4] . 2020. TOP-Rank: A novel unsupervised approach for topic prediction using keyphrase extraction for Urdu documents. IEEE Access 8, (2020), 212675–212686.Google Scholar
Cross Ref
- [5] . 2019. Creating sentiment lexicon for sentiment analysis in Urdu: The case of a resource-poor language. Expert Systems 36, 3 (2019), e12397.Google Scholar
Cross Ref
- [6] 2019. Role of discourse information in Urdu sentiment classification: A rule-based method and machine-learning technique. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 4 (2019), 34. Google Scholar
Digital Library
- [7] 2019. A comprehensive survey of Arabic sentiment analysis. Information Processing & Management 56, 2 (2019), 320–342.Google Scholar
Cross Ref
- [8] . 2016. Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, decision tree and KNN classification techniques. Journal of King Saud University-Computer and Information Sciences 28, 3 (2016), 330–344. Google Scholar
Digital Library
- [9] . 2020. Writer identification system for pre-segmented offline handwritten Devanagari characters using k-NN and SVM. Soft Computing 24 (2020), 1011–10122.Google Scholar
Digital Library
- [10] , and others. 2014. Roman Urdu opinion mining system (RUOMiS). CSEIJ 4, 6 (2014), 1–9.Google Scholar
Cross Ref
- [11] . 2019. Deep learning-based sentiment analysis for Roman Urdu text. Procedia Computer Science 147, (2019), 131–135.Google Scholar
Digital Library
- [12] . 2020. Forensic document examination system using boosting and bagging methodologies. Soft Computing 24, 7 (2020), 5409–5426.Google Scholar
Cross Ref
- [13] . 2018. Opinion within opinion: Segmentation approach for Urdu sentiment analysis. International Arab Journal of Information Technology 15, 1 (2018), 21–28.Google Scholar
- [14] . 2012. Processing informal, romanized Pakistani text messages. In Proceedings of the 2nd Workshop on Language in Social Media, Association for Computational Linguistics, Montréal, Canada. 75–78. Google Scholar
Digital Library
- [15] . 2016. Analysis and development of resources for Urdu text stemming. In Proceedings of the 6th International Conference on Language and Technology, Lahore, Pakistan. 1–7.Google Scholar
- [16] . 2013. Opinion analysis of Bi-lingual event data from social networks. In ESSEM@ AI* IA, Citeseer, 164–172.Google Scholar
- [17] . 2021. Text and graphics segmentation of newspapers printed in Gurmukhi script: A hybrid approach. The Visual Computer 37 (2021), 1637–1659.Google Scholar
Cross Ref
- [18] . 2020. A clustering framework for lexical normalization of roman urdu. Natural Language Engineering (2020), 1–31.Google Scholar
Cross Ref
- [19] . 2018. Urdu sentiment analysis. International Journal of Advanced Computer Science and Applications 9, 9 (2018), 646–651.Google Scholar
Cross Ref
- [20] . 2016. Pattern and semantic analysis to improve unsupervised techniques for opinion target identification. Kuwait Journal of Science 43, 1 (2016), 129–149.Google Scholar
- [21] . 2018. Sentiment classification of customer's reviews about automobiles in Roman Urdu. In Future of Information and Communication Conference, Singapore, Springer, 630–640.Google Scholar
- [22] . 2016. Sentiment/subjectivity analysis survey for languages other than English. Social Network Analysis and Mining 6, 1 (2016), 1–17.Google Scholar
Cross Ref
- [23] . 2020. Hybrid context enriched deep learning model for fine-grained sentiment analysis in textual and visual semiotic modality social data. Information Processing & Management 57, 1 (2020), 102141.Google Scholar
Digital Library
- [24] . 2019. Character and numeral recognition for non-Indic and Indic scripts: A survey. Artificial Intelligence Review 52, 4 (2019), 2235–2261.Google Scholar
Digital Library
- [25] . 2020. Performance evaluation of classifiers for the recognition of offline handwritten Gurmukhi characters and numerals: A study. Artificial Intelligence Review 53, 3 (2020), 2075–2097.Google Scholar
Cross Ref
- [26] . 2018. Performance comparison of several feature selection techniques for offline handwritten character recognition. In 2018 International Conference on Research in Intelligent and Computing in Engineering (RICE), San Salvador, El Salvador, IEEE, 1–6.Google Scholar
Cross Ref
- [27] . 2019. Improved recognition results of medieval handwritten Gurmukhi manuscripts using boosting and bagging methodologies. Neural Processing Letters 50, 1 (2019), 43–56.Google Scholar
Digital Library
- [28] . 2020. A study on recognition of pre-segmented handwritten multi-lingual characters. Archives of Computational Methods in Engineering 27, 2 (2020), 577–589.Google Scholar
Cross Ref
- [29] 2020. Deep sentiments in Roman Urdu text using recurrent convolutional neural network model. Information Processing & Management 57, 4 (2020), 102233.Google Scholar
Cross Ref
- [30] . 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal 5, 4 (2014), 1093–1113.Google Scholar
Cross Ref
- [31] . 2019. Sentiment analysis for a resource poor language—Roman Urdu. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 1 (2019), 10. Google Scholar
Digital Library
- [32] . 2020. ALDONAr: A hybrid solution for sentence-level aspect-based sentiment analysis using a lexicalized domain ontology and a regularized neural attention model. Information Processing & Management 57, 3 (2020), 102211.Google Scholar
Digital Library
- [33] . 2018. Identification and handling of intensifiers for enhancing accuracy of Urdu sentiment analysis. Expert Systems 35, 6 (2018), e12317.Google Scholar
Cross Ref
- [34] . 2018. Lexicon-based approach outperforms supervised machine learning approach for Urdu sentiment analysis in multiple domains. Telematics and Informatics 35, 8 (2018), 2173–2183.Google Scholar
Cross Ref
- [35] . 2018. Urdu sentiment analysis using supervised machine learning approach. International Journal of Pattern Recognition and Artificial Intelligence 32, 02 (2018), 1851001.Google Scholar
Cross Ref
- [36] . 2012. Analyzing Urdu social media for sentiments using transfer learning with controlled translations. In Proceedings of the 2nd Workshop on Language in Social Media, Montréal, Canada, ACL. 1–8. Google Scholar
Digital Library
- [37] . 2010. An information-extraction system for Urdu—a resource-poor language. ACM Transactions on Asian Language Information Processing 9, 4 (2010), 15. Google Scholar
Digital Library
- [38] . 2019. Devanagari ancient documents recognition using statistical feature extraction techniques. Sādhanā 44, 6 (2019), 1–8.Google Scholar
Cross Ref
- [39] . 2019. Devanagari ancient character recognition using DCT features with adaptive boosting and bootstrap aggregating. Soft Computing 23, 24 (2019), 13603–13614.Google Scholar
Cross Ref
- [40] 2020. On the recognition of Devanagari ancient handwritten characters using SIFT and Gabor features. Soft Computing 24, 22 (2020), 17279–17289.Google Scholar
Digital Library
- [41] . 2016. Generating an emotion ontology for Roman Urdu text. International Journal of Computational Linguistics Research 7, (2016), 83–91.Google Scholar
- [42] . 2019. Sentiment analysis in E-commerce using SVM on Roman Urdu text. In International Conference for Emerging Technologies in Computing, London, UK. Springer, 213–222.Google Scholar
Cross Ref
- [43] . 2017. A review of sentiment analysis research in Chinese language. Cognitive Computation 9, 4 (2017), 423–435.Google Scholar
Cross Ref
- [44] . 2019. TDAM: A topic-dependent attention model for sentiment analysis. Information Processing & Management 56, 6 (2019), 102084.Google Scholar
Digital Library
- [45] . 2021. Extraction of opinion target using syntactic rules in Urdu text. Intelligent Automation & Soft Computing 29, 3 (2021), 839–853.Google Scholar
Cross Ref
- [46] . 2016. Topic modeling in sentiment analysis: A systematic review. Journal of ICT Research and Applications 10, 1 (2016), 76–93.Google Scholar
Cross Ref
- [47] . 2020. Multi-level knowledge-based approach for implicit aspect identification. Applied Intelligence 50, 12 (2020), 4616–4630.Google Scholar
Digital Library
- [48] . 2016. Aspect extraction in sentiment analysis: Comparative analysis and survey. Artificial Intelligence Review 46, 4 (2016), 459–483. Google Scholar
Digital Library
- [49] . 2016. Exploiting sequential patterns to detect objective aspects from online reviews. In 2016 International Conference on Advanced Informatics: Concepts, Theory and Application (ICAICTA’16), Penang Malaysia. IEEE, 1–5.Google Scholar
Cross Ref
- [50] . 2017. A two-fold rule-based model for aspect extraction. Expert Systems with Applications 89, (2017), 273–285. Google Scholar
Digital Library
- [51] . 2017. Improving aspect extraction using aspect frequency and semantic similarity-based approach for aspect-based sentiment analysis. In International Conference on Computing and Information Technology, Bangkok, Thailand, Springer, 317–326.Google Scholar
- [52] . 2018. Sequential patterns-based rules for aspect-based sentiment analysis. Advanced Science Letters 24, 2 (2018), 1370–1374.Google Scholar
Cross Ref
- [53] . 2019. Sequential patterns rule-based approach for opinion target extraction from customer reviews. Journal of Information Science 45, 5 (2019), 643–655.Google Scholar
Digital Library
- [54] . 2015. Hybrid rule-based approach for aspect extraction and categorization from customer reviews. In 9th International Conference on IT in Asia (CITA’15), Sarawak, Malaysia. IEEE, 1–5.Google Scholar
Cross Ref
- [55] . 2015. A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowledge-Based Systems 89, (2015), 14–46. Google Scholar
Digital Library
- [56] . 2016. Lexicon-based sentiment analysis for Urdu language. In 6th International Conference on Innovative Computing Technology (INTECH’16), Dublin, Ireland. IEEE, 497–501.Google Scholar
Cross Ref
- [57] . 2019. Sentiment analysis based on improved pre-trained word embeddings. Expert Systems with Applications 117, (2019), 139–147.Google Scholar
Cross Ref
- [58] . 2015. Survey on aspect-level sentiment analysis. IEEE Transactions on Knowledge and Data Engineering 28, 3 (2015), 813–830. Google Scholar
Digital Library
- [59] . 2017. Lexical normalization of Roman Urdu text. International Journal of Computer Science and Network Security 17, 12 (2017), 213–221.Google Scholar
- [60] . 2007. A comparison and analysis of name matching algorithms. International Journal of Applied Science, Engineering and Technology 4, 1 (2007), 252–257.Google Scholar
- [61] . 2018. Text classification in an under-resourced language via lexical normalization and feature pooling. In PACIS, Yokohama, Japan. 96.Google Scholar
- [62] . 2020. SACPC: A framework based on probabilistic linguistic terms for short text sentiment analysis. Knowledge-Based Systems 194, (2020), 105572.Google Scholar
Cross Ref
- [63] . 2010. Lexicon based sentiment analysis of Urdu text using SentiUnits. In Mexican International Conference on Artificial Intelligence, Pachuca, Mexico. Springer, 32–43. Google Scholar
Digital Library
- [64] . 2011. Sentiment analysis of Urdu language: Handling phrase-level negation. In Mexican International Conference on Artificial Intelligence, Puebla, Mexico. Springer, 382–393. Google Scholar
Digital Library
- [65] . 2019. Fuzzy rule based unsupervised sentiment analysis from social media posts. Expert Systems with Applications 138, (2019), 112834.Google Scholar
Cross Ref
- [66] . 2019. Sentiment analysis of comment texts based on BiLSTM. IEEE Access 7, (2019), 51522–51532.Google Scholar
Cross Ref
- [67] . 2019. Aspect-based sentiment analysis with alternating coattention networks. Information Processing & Management 56, 3 (2019), 463–478.Google Scholar
Digital Library
- [68] . 2018. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 4 (2018), e1253.Google Scholar
Cross Ref
Index Terms
An Unsupervised Approach for Sentiment Analysis on Social Media Short Text Classification in Roman Urdu
Recommendations
Sentiment Analysis for a Resource Poor Language—Roman Urdu
Sentiment analysis is an important sub-task of Natural Language Processing that aims to determine the polarity of a review. Most of the work done on sentiment analysis is for the resource-rich languages of the world, but very limited work has been done ...
Sentiment classification of Roman-Urdu opinions using Nave Bayesian, Decision Tree and KNN classification techniques
Sentiment mining is a field of text mining to determine the attitude of people about a particular product, topic, politician in newsgroup posts, review sites, comments on facebook posts twitter, etc. There are many issues involved in opinion mining. One ...
Roman Urdu toxic comment classification
AbstractWith the increasing popularity of user-generated content on social media, the number of toxic texts is also on the rise. Such texts cause adverse effects on users and society at large, therefore, the identification of toxic comments is a growing ...






Comments