skip to main content
research-article

Tamil Offensive Language Detection: Supervised versus Unsupervised Learning Approaches

Published:24 March 2023Publication History
Skip Abstract Section

Abstract

Studies on natural language processing are mainly conducted in English, with very few exploring languages that are under-resourced, including the Dravidian languages. We present a novel work in detecting offensive language using a corpus collected from YouTube containing comments in Tamil. The study specifically aims to compare two machine learning approaches—namely, supervised and unsupervised—to detect offensive patterns in textual communications. In the first setup, offensive language detection models were developed using traditional machine learning algorithms such as Random Forest, Logistic Regression, Support Vector Machine, and AdaBoost, and assessed based on human labeling. Conversely, we used K-means (K = 2) to cluster the unlabeled data before training the same set of machine learning algorithms to detect offensive communications. Performance scores indicate unsupervised clustering to be more effective than human labeling with ensemble classifiers achieving an impressive accuracy of 99.70% and 99.87% respectively for balanced and imbalanced datasets, hence showing that the unsupervised approach can be used effectively to detect offensive language in low-resourced languages.

REFERENCES

  1. Alsafari S., Sadaoui S., and Mouhoub M.. 2020. Hate and offensive speech detection on Arabic social media. Online Social Networks and Media 19 (2020), 100096.Google ScholarGoogle ScholarCross RefCross Ref
  2. Anbukkarasi S. and Varadhaganapathy S.. 2021. Deep learning-based Tamil Parts of Speech (POS) tagger. Bulletin of the Polish Academy of Sciences: Technical Sciences 69, 6 (2021), e138820e138820.Google ScholarGoogle Scholar
  3. Ayo F. E., Folorunso O., Ibharalu F. T., and Osinuga I. A.. 2020. Machine learning techniques for hate speech classification of Twitter data: State-of-the-art, future challenges and research directions. Computer Science Review 38 (2020), 100311.Google ScholarGoogle ScholarCross RefCross Ref
  4. Baruah A., Das K. A., Barbhuiya F. A., and Dey K.. 2021. IIITG-ADBU@ HASOC-Dravidian-CodeMix-FIRE2020: Offensive content detection in code-mixed Dravidian text. arXiv preprint arXiv:2107.14336 (2021).Google ScholarGoogle Scholar
  5. Basile V., Bosco C., Fersini E., Nozza D., Patti V., Pardo F. M. R., Rosso P., and Sanguinetti M.. 2019. SemEval-2019 Task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation. 5463.Google ScholarGoogle ScholarCross RefCross Ref
  6. Bharti S. K., Naidu R., and Babu K. S.. 2017. Hyperbolic feature-based sarcasm detection in tweets: A machine learning approach. In Proceedings of the 2017 14th IEEE India Council International Conference (INDICON’17). IEEE, Los Alamitos, CA, 16.Google ScholarGoogle Scholar
  7. Boateng E. Y., Otoo J., and Abaye D. A.. 2020. Basic tenets of classification algorithms k-nearest-neighbor, support vector machine, random forest and neural network: A review. Journal of Data Analysis and Information Processing 8, 4 (2020), 341357.Google ScholarGoogle ScholarCross RefCross Ref
  8. Carson L. and Jiang N.. 2021. Collecting and categorizing offensive words in Chinese. In An Anatomy of Chinese Offensive Words. Palgrave Macmillan, Cham, Switzerland, 5365.Google ScholarGoogle ScholarCross RefCross Ref
  9. Chakravarthi B. R., M. Arcan, and J. P. McCrae. 2018. Improving wordnets for under-resourced languages using machine translation. In Proceedings of the 9th Global Wordnet Conference. Singapore, 77--86.Google ScholarGoogle Scholar
  10. Chakravarthi B. R., Priyadharshini R., Muralidaran V., Jose N., Suryawanshi S., Sherly E., and McCrae J. P.. 2021a. DravidianCodeMix: Sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. arXiv preprint arXiv:2106.09460 (2021).Google ScholarGoogle Scholar
  11. Chakravarthi B. R., Priyadharshini R., Ponnusamy R., Kumaresan P. K., Sampath K., Thenmozhi D., Thangasamy S., Nallathambi R., and McCrae J. P.. 2021b. Dataset for identification of homophobia and transophobia in multilingual YouTube comments. arXiv preprint arXiv:2109.00227 (2021).Google ScholarGoogle Scholar
  12. Chakravarthi B. R., Muralidaran V., Priyadharshini R., and McCrae J. P.. 2020a. Corpus creation for sentiment analysis in code-mixed Tamil-English text. arXiv preprint arXiv:2006.00206 (2020).Google ScholarGoogle Scholar
  13. Chakravarthi B. R., Rajasekaran N., Arcan M., McGuinness K., O'Connor N. E., and McCrae J. P.. 2020b. Bilingual lexicon induction across orthographically-distinct under-resourced Dravidian languages. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties, and Dialects. 5769.Google ScholarGoogle Scholar
  14. Cusack C. M.. 2014. Use of the word ‘fuck’ in pedagogy and higher learning. Journal of Law & Social Deviance 8 (2014), 133.Google ScholarGoogle Scholar
  15. Das S., Venugopal D., and Shiva S.. 2020. A holistic approach for detecting DDoS attacks by using ensemble unsupervised machine learning. In Proceedings of the Future of Information and Communication Conference. 721738.Google ScholarGoogle ScholarCross RefCross Ref
  16. Devi G. R., Kumar M. A., and Soman K. P.. 2020. Extraction of named entities from social media text in Tamil language using n-gram embedding for disaster management. In Nature-Inspired Computation in Data Mining and Machine Learning. Springer, Cham, Switzerland, 207223.Google ScholarGoogle ScholarCross RefCross Ref
  17. Eberhard D. M., Simons G. F., and Fennig C. D.. 2019. Ethnologue: Languages of the World. SIL International. Available at https://www.ethnologue.com.Google ScholarGoogle Scholar
  18. Elnagar A., Al-Debsi R., and Einea O.. 2020. Arabic text classification using deep learning models. Information Processing and Management 57 (2020), 102121.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Feng S., Keung J., Yu X., Xiao Y., and Zhang M.. 2021. Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Information and Software Technology 139 (2021), 106662.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ganesh K.. 2020. Complicating ‘victimhood’ in diaspora studies: The saga of Tamils in exile. Sociological Bulletin 69, 3 (2020), 313330.Google ScholarGoogle ScholarCross RefCross Ref
  21. Haddad B., Orabe Z., Al-Abood A., and Ghneim N.. 2020. Arabic offensive language detection with attention-based deep neural networks. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. 7681.Google ScholarGoogle Scholar
  22. Hande A., Hegde S. U., Priyadharshini R., Ponnusamy R., Kumaresan P. K., Thavareesan S., and Chakravarthi B. R.. 2021. Benchmarking multi-task learning for sentiment analysis and offensive language identification in under-resourced Dravidian languages. arXiv preprint arXiv:2108.03867 (2021).Google ScholarGoogle Scholar
  23. Haralabopoulos G., Anagnostopoulos I., and McAuley D.. 2020. Ensemble deep learning for multilabel binary classification of user-generated content. Algorithms 13, 4 (2020), 83.Google ScholarGoogle ScholarCross RefCross Ref
  24. Ibrohim M. O. and Budi I.. 2018. A dataset and preliminaries study for abusive language detection in Indonesian social media. Procedia Computer Science 135 (2018), 222229.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jacobs J. B.. 2002. Hate crime: Criminal law and identity politics: Author's summary. Theoretical Criminology 6, 4 (2002), 481484.Google ScholarGoogle ScholarCross RefCross Ref
  26. Jafari-Marandi R.. 2021. Supervised or unsupervised learning? Investigating the role of pattern recognition assumptions in the success of binary predictive prescriptions. Neurocomputing 434 (2021), 165193.Google ScholarGoogle ScholarCross RefCross Ref
  27. Jain T., Agrawal N., Goyal G., and Aggrawal N.. 2017. Sarcasm detection of tweets: A comparative study. In Proceedings of the 2017 10th International Conference on Contemporary Computing (IC3’17). IEEE, Los Alamitos, CA, 16.Google ScholarGoogle Scholar
  28. Kalcheva N., Karova M., and Penev I.. 2020. Comparison of the accuracy of SVM kernel functions in text classification. In Proceedings of the 2020 International Conference on Biomedical Innovations and Applications (BIA’20). IEEE, Los Alamitos, CA, 141145.Google ScholarGoogle Scholar
  29. Kang Z., Xu H., Wang B., Zhu H., and Xu Z.. 2019. Clustering with similarity preserving. Neurocomputing 365 (2019), 211218.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Kikon D.. 2022. Dirty food: Racism and casteism in India. Ethnic and Racial Studies 45, 2 (2022), 278297.Google ScholarGoogle ScholarCross RefCross Ref
  31. Kocon J., A. Figas, M. Gruza, D. Puchalska, T. Kajdanowicz, and P. Kazienko. 2021. Offensive, aggressive, and hat speech analysis: From data-centric to human-centered approach. Information Processing and Management 58 (2021), 102643. https://doi.org/10.1016/j.ipm.2021.102643Google ScholarGoogle Scholar
  32. Kumar R., Ojha A. K., Malmasi S., and Zampieri M.. 2018. Benchmarking aggression identification in social media. In Proceedings of the 1st Workshop on Trolling, Aggression, and Cyberbullying (TRAC-2018). 111.Google ScholarGoogle Scholar
  33. Kumar S. S., Kumar M. A., Soman K. P., and Poornachandran P.. 2020. Dynamic mode-based feature with random mapping for sentiment analysis. In Intelligent Systems, Technologies, and Applications. Springer, Singapore, 115.Google ScholarGoogle Scholar
  34. Laksana I. K. D.. 2021. Discourse of Indonesian language in public domain: Its use in public debate prior to the presidential election 2019. Linguistics and Culture Review 5, S1 (2021), 922934.Google ScholarGoogle ScholarCross RefCross Ref
  35. Lambert J.. 2014. Indian English slang. In Global English Slang. Routledge, 138146.Google ScholarGoogle Scholar
  36. Mandl T., Modha S., Kumar M. A., and Chakravarthi B. R. 2020. Overview of the HASOC Track at FIRE 2020: Hate speech and offensive language identification in Tamil, Malayalam, Hindi, English and German. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’20). 2932.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mau T. N. and Huynh V. N.. 2021. An LSH-based k-representatives clustering method for large categorical data. Neurocomputing 463 (2021), 2944.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Newton C.. 2019. The trauma floor. The Verge. Retrieved November 10, 2021 from https://www.theverge.com/2019/2/25/18229714/cognizant-facebook-content-moderator-interviews-trauma-working-conditions-arizona.Google ScholarGoogle Scholar
  39. Nobata C., Tetreault J., Thomas A., Mehdad Y., and Chang Y.. 2016. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web. 145153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Perikos I. and Hatzilygeroudis I.. 2016. Recognizing emotions in text using ensemble of classifiers. Engineering Applications of Artificial Intelligence 51 (2016), 191201.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ramchandran A. and Sangaiah A. K.. 2018. Unsupervised anomaly detection for high dimensional data—An exploratory analysis. In Computational Intelligence for Multimedia Big Data on the Cloud with Engineering Applications. Elsevier, 233251.Google ScholarGoogle ScholarCross RefCross Ref
  42. Sane K. R., Kolla S., Sane S. R., Srirangam V. K., and Mamidi R.. 2019. Corpus and baseline system for hate speech detection in Telugu-English code-mixed tweets. In Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing’19).Google ScholarGoogle Scholar
  43. Sarsam S. M., Al-Samarraie H., Alzahrani A. I., and Wright B.. 2020. Sarcasm detection using machine learning algorithms in Twitter: A systematic review. International Journal of Market Research 62, 5 (2020), 578598.Google ScholarGoogle ScholarCross RefCross Ref
  44. Schmidt A. and M. Wiegand. 2017. A survey on hate speech detection using natural language processing. In Proceedings of the 5th International Workshop on Natural Language Processing for Social Media, Association for Computational Linguistics, Valencia, 1--10. https://www.aclweb.org/anthology/W17-1101.Google ScholarGoogle Scholar
  45. Sharif O., Hossain E., and Hoque M. M.. 2021. NLP-CUET@ DravidianLangTech-EACL2021: Offensive language detection from multilingual code-mixed text using Transformers. arXiv preprint arXiv:2103.00455 (2021).Google ScholarGoogle Scholar
  46. Sindhu C., Vadivu G., and Rao M. V.. 2018. A comprehensive study on sarcasm detection techniques in sentiment analysis. International Journal of Pure and Applied Mathematics 118, 22 (2018), 433442.Google ScholarGoogle Scholar
  47. Steever S. B. (Ed.). 2019. The Dravidian Languages. Routledge.Google ScholarGoogle ScholarCross RefCross Ref
  48. Vemulapalli S., Dolor R. J., Hasselblad V., Schmit K., Banks A., Heidenfelder B., Patel M. R., and Jones W. S.. 2015. Supervised vs unsupervised exercise for intermittent claudication: A systematic review and meta-analysis. American Heart Journal 169, 6 (2015), 924937.Google ScholarGoogle ScholarCross RefCross Ref
  49. Vidgen B. and L. Derczynski. 2020. Directions in abusive language training data: Garbage in, garbage out. arXiv:2004.01670.Google ScholarGoogle Scholar
  50. Vyshnav M. T., Kumar S., and Soman K. P.. 2020. Offensive language detection: A comparative analysis. arXiv preprint arXiv:2001.03131 (2020).Google ScholarGoogle Scholar
  51. Wiegand M., M. Siegel, and J. Ruppenhofer. 2018. Overview of the GermEval 2018 shared task on the identification of offensive language. In Proceedings of GermEval 2018, 14th Conference on Natural Language Processing. 1--10.Google ScholarGoogle Scholar
  52. Xu J. M., Jun K. S., Zhu X., and Bellmore A.. 2012. Learning from bullying traces in social media. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 656666.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Xu Z, Shen D., Nie T., Kou Y., Yin N., and Han X.. 2021. A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Information Sciences 572 (2021), 574589.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Zaman M. and Hassan A.. 2021. Fuzzy heuristics and decision tree for classification of statistical feature-based control chart patterns. Symmetry 13, 1 (2021), 110.Google ScholarGoogle ScholarCross RefCross Ref
  55. Zampieri M., Malmasi S., Nakov P., Rosenthal S., Farra N., and Kumar R.. 2019. SemEval-2019 Task 6: Identifying and categorizing offensive language in social media (OffensEval). arXiv preprint arXiv:1903.08983 (2019).Google ScholarGoogle Scholar
  56. Zhu M.. 2022. Sentiment analysis of international and foreign Chinese-language texts with multilevel features. Discrete Dynamics in Nature and Society 2022 (2022), 112.Google ScholarGoogle Scholar

Index Terms

  1. Tamil Offensive Language Detection: Supervised versus Unsupervised Learning Approaches

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
          April 2023
          682 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3588902
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 March 2023
          • Online AM: 15 December 2022
          • Accepted: 27 November 2022
          • Revised: 12 September 2022
          • Received: 3 February 2022
          Published in tallip Volume 22, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)165
          • Downloads (Last 6 weeks)23

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!