Abstract
Studies on natural language processing are mainly conducted in English, with very few exploring languages that are under-resourced, including the Dravidian languages. We present a novel work in detecting offensive language using a corpus collected from YouTube containing comments in Tamil. The study specifically aims to compare two machine learning approaches—namely, supervised and unsupervised—to detect offensive patterns in textual communications. In the first setup, offensive language detection models were developed using traditional machine learning algorithms such as Random Forest, Logistic Regression, Support Vector Machine, and AdaBoost, and assessed based on human labeling. Conversely, we used K-means (K = 2) to cluster the unlabeled data before training the same set of machine learning algorithms to detect offensive communications. Performance scores indicate unsupervised clustering to be more effective than human labeling with ensemble classifiers achieving an impressive accuracy of 99.70% and 99.87% respectively for balanced and imbalanced datasets, hence showing that the unsupervised approach can be used effectively to detect offensive language in low-resourced languages.
- . 2020. Hate and offensive speech detection on Arabic social media. Online Social Networks and Media 19 (2020), 100096.Google Scholar
Cross Ref
- 2021. Deep learning-based Tamil Parts of Speech (POS) tagger. Bulletin of the Polish Academy of Sciences: Technical Sciences 69, 6 (2021), e138820–e138820.Google Scholar
- . 2020. Machine learning techniques for hate speech classification of Twitter data: State-of-the-art, future challenges and research directions. Computer Science Review 38 (2020), 100311.Google Scholar
Cross Ref
- . 2021. IIITG-ADBU@ HASOC-Dravidian-CodeMix-FIRE2020: Offensive content detection in code-mixed Dravidian text. arXiv preprint arXiv:2107.14336 (2021).Google Scholar
- . 2019. SemEval-2019 Task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation. 54–63.Google Scholar
Cross Ref
- . 2017. Hyperbolic feature-based sarcasm detection in tweets: A machine learning approach. In Proceedings of the 2017 14th IEEE India Council International Conference (INDICON’17). IEEE, Los Alamitos, CA, 1–6.Google Scholar
- . 2020. Basic tenets of classification algorithms k-nearest-neighbor, support vector machine, random forest and neural network: A review. Journal of Data Analysis and Information Processing 8, 4 (2020), 341–357.Google Scholar
Cross Ref
- . 2021. Collecting and categorizing offensive words in Chinese. In An Anatomy of Chinese Offensive Words. Palgrave Macmillan, Cham, Switzerland, 53–65.Google Scholar
Cross Ref
- . 2018. Improving wordnets for under-resourced languages using machine translation. In Proceedings of the 9th Global Wordnet Conference. Singapore, 77--86.Google Scholar
- . 2021a. DravidianCodeMix: Sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. arXiv preprint arXiv:2106.09460 (2021).Google Scholar
- . 2021b. Dataset for identification of homophobia and transophobia in multilingual YouTube comments. arXiv preprint arXiv:2109.00227 (2021).Google Scholar
- . 2020a. Corpus creation for sentiment analysis in code-mixed Tamil-English text. arXiv preprint arXiv:2006.00206 (2020).Google Scholar
- . 2020b. Bilingual lexicon induction across orthographically-distinct under-resourced Dravidian languages. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties, and Dialects. 57–69.Google Scholar
- . 2014. Use of the word ‘fuck’ in pedagogy and higher learning. Journal of Law & Social Deviance 8 (2014), 133.Google Scholar
- . 2020. A holistic approach for detecting DDoS attacks by using ensemble unsupervised machine learning. In Proceedings of the Future of Information and Communication Conference. 721–738.Google Scholar
Cross Ref
- . 2020. Extraction of named entities from social media text in Tamil language using n-gram embedding for disaster management. In Nature-Inspired Computation in Data Mining and Machine Learning. Springer, Cham, Switzerland, 207–223.Google Scholar
Cross Ref
- . 2019. Ethnologue: Languages of the World. SIL International. Available at https://www.ethnologue.com.Google Scholar
- . 2020. Arabic text classification using deep learning models. Information Processing and Management 57 (2020), 102121.Google Scholar
Digital Library
- . 2021. Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Information and Software Technology 139 (2021), 106662.Google Scholar
Digital Library
- . 2020. Complicating ‘victimhood’ in diaspora studies: The saga of Tamils in exile. Sociological Bulletin 69, 3 (2020), 313–330.Google Scholar
Cross Ref
- . 2020. Arabic offensive language detection with attention-based deep neural networks. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 76–81.Google Scholar
- . 2021. Benchmarking multi-task learning for sentiment analysis and offensive language identification in under-resourced Dravidian languages. arXiv preprint arXiv:2108.03867 (2021).Google Scholar
- . 2020. Ensemble deep learning for multilabel binary classification of user-generated content. Algorithms 13, 4 (2020), 83.Google Scholar
Cross Ref
- . 2018. A dataset and preliminaries study for abusive language detection in Indonesian social media. Procedia Computer Science 135 (2018), 222–229.Google Scholar
Cross Ref
- . 2002. Hate crime: Criminal law and identity politics: Author's summary. Theoretical Criminology 6, 4 (2002), 481–484.Google Scholar
Cross Ref
- . 2021. Supervised or unsupervised learning? Investigating the role of pattern recognition assumptions in the success of binary predictive prescriptions. Neurocomputing 434 (2021), 165–193.Google Scholar
Cross Ref
- . 2017. Sarcasm detection of tweets: A comparative study. In Proceedings of the 2017 10th International Conference on Contemporary Computing (IC3’17). IEEE, Los Alamitos, CA, 1–6.Google Scholar
- . 2020. Comparison of the accuracy of SVM kernel functions in text classification. In Proceedings of the 2020 International Conference on Biomedical Innovations and Applications (BIA’20). IEEE, Los Alamitos, CA, 141–145.Google Scholar
- . 2019. Clustering with similarity preserving. Neurocomputing 365 (2019), 211–218.Google Scholar
Digital Library
- . 2022. Dirty food: Racism and casteism in India. Ethnic and Racial Studies 45, 2 (2022), 278–297.Google Scholar
Cross Ref
- . 2021. Offensive, aggressive, and hat speech analysis: From data-centric to human-centered approach. Information Processing and Management 58 (2021), 102643. https://doi.org/10.1016/j.ipm.2021.102643Google Scholar
- . 2018. Benchmarking aggression identification in social media. In Proceedings of the 1st Workshop on Trolling, Aggression, and Cyberbullying (TRAC-2018). 1–11.Google Scholar
- 2020. Dynamic mode-based feature with random mapping for sentiment analysis. In Intelligent Systems, Technologies, and Applications. Springer, Singapore, 1–15.Google Scholar
- . 2021. Discourse of Indonesian language in public domain: Its use in public debate prior to the presidential election 2019. Linguistics and Culture Review 5, S1 (2021), 922–934.Google Scholar
Cross Ref
- . 2014. Indian English slang. In Global English Slang. Routledge, 138–146.Google Scholar
- . 2020. Overview of the HASOC Track at FIRE 2020: Hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’20). 29–32.Google Scholar
Digital Library
- . 2021. An LSH-based k-representatives clustering method for large categorical data. Neurocomputing 463 (2021), 29–44.Google Scholar
Digital Library
- . 2019. The trauma floor. The Verge. Retrieved November 10, 2021 from https://www.theverge.com/2019/2/25/18229714/cognizant-facebook-content-moderator-interviews-trauma-working-conditions-arizona.Google Scholar
- 2016. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web. 145–153.Google Scholar
Digital Library
- . 2016. Recognizing emotions in text using ensemble of classifiers. Engineering Applications of Artificial Intelligence 51 (2016), 191–201.Google Scholar
Digital Library
- . 2018. Unsupervised anomaly detection for high dimensional data—An exploratory analysis. In Computational Intelligence for Multimedia Big Data on the Cloud with Engineering Applications. Elsevier, 233–251.Google Scholar
Cross Ref
- . 2019. Corpus and baseline system for hate speech detection in Telugu-English code-mixed tweets. In Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’19).Google Scholar
- . 2020. Sarcasm detection using machine learning algorithms in Twitter: A systematic review. International Journal of Market Research 62, 5 (2020), 578–598.Google Scholar
Cross Ref
- . 2017. A survey on hate speech detection using natural language processing. In Proceedings of the 5th International Workshop on Natural Language Processing for Social Media, Association for Computational Linguistics, Valencia, 1--10. https://www.aclweb.org/anthology/W17-1101.Google Scholar
- . 2021. NLP-CUET@ DravidianLangTech-EACL2021: Offensive language detection from multilingual code-mixed text using Transformers. arXiv preprint arXiv:2103.00455 (2021).Google Scholar
- . 2018. A comprehensive study on sarcasm detection techniques in sentiment analysis. International Journal of Pure and Applied Mathematics 118, 22 (2018), 433–442.Google Scholar
- (Ed.). 2019. The Dravidian Languages. Routledge.Google Scholar
Cross Ref
- . 2015. Supervised vs unsupervised exercise for intermittent claudication: A systematic review and meta-analysis. American Heart Journal 169, 6 (2015), 924–937.Google Scholar
Cross Ref
- . 2020. Directions in abusive language training data: Garbage in, garbage out. arXiv:2004.01670.Google Scholar
- . 2020. Offensive language detection: A comparative analysis. arXiv preprint arXiv:2001.03131 (2020).Google Scholar
- , M. Siegel, and J. Ruppenhofer. 2018. Overview of the GermEval 2018 shared task on the identification of offensive language. In Proceedings of GermEval 2018, 14th Conference on Natural Language Processing. 1--10.Google Scholar
- . 2012. Learning from bullying traces in social media. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 656–666.Google Scholar
Digital Library
- . 2021. A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Information Sciences 572 (2021), 574–589.Google Scholar
Digital Library
- . 2021. Fuzzy heuristics and decision tree for classification of statistical feature-based control chart patterns. Symmetry 13, 1 (2021), 110.Google Scholar
Cross Ref
- . 2019. SemEval-2019 Task 6: Identifying and categorizing offensive language in social media (OffensEval). arXiv preprint arXiv:1903.08983 (2019).Google Scholar
- . 2022. Sentiment analysis of international and foreign Chinese-language texts with multilevel features. Discrete Dynamics in Nature and Society 2022 (2022), 1–12.Google Scholar
Index Terms
Tamil Offensive Language Detection: Supervised versus Unsupervised Learning Approaches
Recommendations
A Survey of Offensive Language Detection for the Arabic Language
Special issue on Deep Learning for Low-Resource Natural Language Processing, Part 1 and Regular PapersThe use of offensive language in user-generated content is a serious problem that needs to be addressed with the latest technology. The field of Natural Language Processing (NLP) can support the automatic detection of offensive language. In this survey, ...
A New Corpus and Lexicon for Offensive Tamazight Language Detection
Sideways '22: Proceedings of the 7th International Workshop on Social Media World SensorsIn this paper, we address the offensive language detection on Tamazight language, which is one of the under-resourced languages that are still in their infancy and lack of standard orthography. We are particularly interested in the Kabyle dialect, ...
Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts
AbstractOffensive Language detection in social media platforms has been an active field of research over the past years. In non-native English-speaking countries, social media users mostly use a code-mixed form of text in their posts/comments. This poses ...






Comments