Abstract
On social networking sites, online hate speech has become more prevalent due to the quick expansion of mobile computing and Web technology. Previous research has found that being exposed to Internet hate speech has substantial offline implications for historically disadvantaged communities. Therefore, there is a lot of interest in research on automated hate-based comment and post detection. Hate speech can have an influence on any population group, but some are more vulnerable than others. From this background, detecting and reporting such hate related comments and posts can help to avoid the harmful effects of hate speech. There are some studies available on this context and it was found that machine learning algorithms are more efficient in detecting abusive texts in social media. In this research, we applied selected seven machine learning algorithms such as Support Vector Machine (SVM), Naïve Bayes (NB), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Gradient Boost (GB) and K Nearest Neighbor (KNN) to detect hate speech and compare the performances of those algorithms to develop an ensemble model. Researchers collected and combined Tamil – English code-mixed hate speech tweets dataset which was created in HASOC. This dataset's tweets are divided into two groups: not offensive and offensive. This dataset includes 35,442 tweets. In this research, NB has obtained highest F1 scores in detecting offensive and not offensive tweets with highest weighted average. But SVM has obtained highest accuracy in detecting Tamil – English hate speech texts with 80% in 10-fold cross-validation. Based on the stand-alone performances, researchers developed two ensemble classifiers including max-voting and averaging ensemble. Averaging ensemble classification obtained 90.67% in accuracy. The research study's findings are significant because these results can be applied as a model for Tamil – English code-mixed hate speech to evaluate future research works using various algorithms for identifying hate contents more accurately and professionally.
- [1] . 2020. A deep learning approach for automatic hate speech detection in the Saudi Twittersphere. Appl. Sci. 10 (2020), 8614.Google Scholar
Cross Ref
- [2] . 2019. Detecting and monitoring hate speech in Twitter. Sensors 19 (2019), 4654.Google Scholar
Cross Ref
- [3] . 2022. Analysing hate speech against migrants and women through tweets using ensembled deep learning model. Computational Intelligence and Neuroscience (2022).Google Scholar
Digital Library
- [4] . 2021. Thirty years of research into hate speech: Topics of interest and their evolution. Scientometrics 126, 1 (2021), 157–179.Google Scholar
Digital Library
- [5] . 2020. A comparison of classification algorithms for hate speech detection. In IOP Conference Series: Materials Science and Engineering. IOP Publishing. 830, 3 (2020), 032006.Google Scholar
Cross Ref
- [6] . 2018. A survey on automatic detection of hate speech in text. ACM Comput. Surv. 51 (2018), 1–30.Google Scholar
Digital Library
- [7] 2019. Spread of hate speech in online social media. In Proceedings of the 10th ACM Conference on Web Science, (Boston, MA) (30 June 2019). 173–182.Google Scholar
Digital Library
- [8] . 2017. Detecting hate speech in social media. arXiv 2017, arXiv:1712.06427.Google Scholar
- [9] . 2019. Hate speech detection: A solved problem? The challenging case of long tail on Twitter. Semant. Web 10 (2019), 925–945.Google Scholar
Digital Library
- [10] . 2019. Right-wing German hate speech on Twitter: Analysis and automatic detection. arXiv 2019, arXiv:1910.07518.Google Scholar
- [11] . 2017. A web of hate tackling hateful speech in online social spaces. arXiv 2017, arXiv:1709.10159. Available online http://arxiv.org/abs/1709.10159 (accessed on 5 September 2021).Google Scholar
- [12] . 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of NAACL-HLT 2016. Association for Computational Linguistics, San Diego, CA. 88–93.Google Scholar
- [13] . 2013. Locate the hate: Detecting tweets against blacks. In Proceedings of the 27th AAAI Conference on Artificial Intelligence. Association for the Advancement of Artificial Intelligence, 1621–1622.Google Scholar
Cross Ref
- [14] . 2004. Mining and summarizing customer reviews. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04). ACM, New York, 168–177.Google Scholar
Digital Library
- [15] . 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation 39, 2 (May 2005), 165–210.Google Scholar
Cross Ref
- [16] 2019. A challenge dataset and effective models for aspectbased sentiment analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), (Hong Kong, China, Nov.). Association for Computational Linguistics, 6279–6284.Google Scholar
Cross Ref
- [17] . 2018. RuSentiment: An enriched sentiment analysis dataset for social media in Russian. In Proceedings of the 27th International Conference on Computational Linguistics, (Santa Fe, NM, Aug.). Association for Computational Linguistics, 755–763.Google Scholar
- [18] . 2017. A Twitter corpus and benchmark resources for German sentiment analysis. In Proceedings of the Fifth International Workshop on Natural Language Processing for social media, (Valencia, Spain, Apr.). Association for Computational Linguistics, 45–51.Google Scholar
Cross Ref
- [19] 2019. Annotating evaluative sentences for sentiment analysis: A dataset for Norwegian. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, (Turku, Finland, Sept.–Oct.). Linkoping University Electronic Press, 121–130.Google Scholar
- [20] . 2018. No more beating about the bush: A step towards idiom handling for Indian language NLP. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), (Miyazaki, Japan, May). European Language Resources Association (ELRA).Google Scholar
- [21] . 2020. A comparative study of different state-of-the-art hate speech detection methods for Hindi-English code-mixed data. In Proceedings of the 2nd Workshop on Trolling, Aggression and Cyberbullying, (Marseille, France, May). European Language Resources Association (ELRA).Google Scholar
- [22] . 2015. Code mixing among Tamil English bilingual children. International Journal of Social Science and Humanity 5, 9 (2015), 788.Google Scholar
Cross Ref
- [23] . 2020a. Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text. In Proceedings of the 2nd Workshop on Trolling, Aggression and Cyberbullying, (Marseille, France, May). European Language Resources Association (ELRA).Google Scholar
- [24] . 2020b. A dataset for troll classification of Tamil memes. In Proceedings of the 5th Workshop on Indian Language Data Resource and Evaluation (WILDRE-5), Marseille, France, May. European Language Resources Association (ELRA).Google Scholar
- [25] . 2021. Automatic hate speech detection in English-Odia code mixed social media data using machine learning techniques. Applied Sciences 11, 18 (2021), 8575.Google Scholar
Cross Ref
- [26] . 2012. Detecting hate speech on the World Wide Web. In Proceedings of the 2012 Workshop on Language in Social Media (LSM’12). 19–26. Google Scholar
Digital Library
- [27] 2017. Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion. 759–760. Google Scholar
Digital Library
- [28] . 2015. Overview of FIRE-2015 shared task on mixed script information retrieval. In Proceedings of the FIRE Workshops 2015. 19–25.Google Scholar
- [29] . 2014. PoS tagging of English-Hindi code-mixed social media content. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 974–979.Google Scholar
Cross Ref
- [30] . 2014. Word-level language identification in bi-lingual code-switched texts. In Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing. 348–357.Google Scholar
- [31] . 2015. NELIS-named entity and language identification system: Shared task system description. In Proceedings of the FIRE Workshops. 43–46.Google Scholar
- [32] . 2019. Convolutional neural network-based detection and judgement of environmental obstacle in vehicle operation. CAAI Trans Intell Technol 4, 2 (2019), 80–91. Google Scholar
Digital Library
- [33] . 2016. Sentiment analysis for mixed script indic sentences. In Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI). 524–529.Google Scholar
Cross Ref
- [34] . 2017. Stark assessment of lifestyle-based human disorders using data mining-based learning techniques. IRBM 38, 6 (2017), 305–324.Google Scholar
Cross Ref
- [35] . 2019. An effective cybernated word embedding system for analysis and language identification in code-mixed social media text. Int. J. Knowl- Based Intell. Eng. Syst. 23, 3 (2019), 167–179.Google Scholar
Digital Library
- [36] . 2020. Deep learning approach for microarray cancer data classification. CAAI Trans. Intell. Technol. 5, 1 (2020), 22–33. Google Scholar
Digital Library
- [37] . 2019. Three stage network for age estimation. CAAI Trans Intell Technol 4, 2 (2019), 122–126. Google Scholar
Digital Library
- [38] 2014. I am borrowingya mixing? An analysis of English-Hindi code mixing in Facebook. In Proceedings of the First Workshop on computational approaches to code switching 2014. 116–126.Google Scholar
Cross Ref
- [39] . 2016. A survey on the state-of the- art machine learning models in the context of NLP. Kuwait Journal of Science 43, 4 (2016), 95–113.Google Scholar
- [40] . 2019. Implementation of machine learning to detect hate speech in Bangla language. In Proceedings of the International Conference on System Modeling & Advancement in Research Trends, (Moradabad, India), (2019).Google Scholar
Cross Ref
- [41] . 2018. Automated Detection of hate speech Towards Women on Twitter. In Proceedings of the 2018 International Conference on Computer Science and Engineering. (UBMK), (Turkey), (2018).Google Scholar
Cross Ref
- [42] . 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter. In Proceedings of 2016 EMNLP Workshop on Natural Language Processing and Computational Social Science, (Austin, TX). 2016.Google Scholar
Cross Ref
- [43] . 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of NAACL-HLT 2016. (San Diego, CA), (2016).Google Scholar
Cross Ref
- [44] . 2019. Automated hate speech detection on Twitter. In Proceedings of the 2019 5th International Conference on Computing, Communication, Control and Automation (ICCUBEA). (Pune, India), (2019).Google Scholar
Cross Ref
- [45] . 2017. Automated hate speech detection and the problem of offensive language. In ICWSM, (2017).Google Scholar
Cross Ref
- [46] . 2018. Effective hate-speech detection in Twitter data using recurrent neural networks. Applied Intelligence 48, 12 (2018), 4730–4742.Google Scholar
Digital Library
- [47] . 2017. Detecting hate speech in social media, 26 Dec 2017.
DOI: arXiv: 1712.06427v2.Google Scholar - [48] . 2019. IITG-ADBU at HASOC 2019: Automated hate speech and offensive content detection in English and code mixed Hindi text, (2019), 12–15.
DOI: http://ceur-ws.org/Vol-2517/T3-7.pdfGoogle Scholar - [49] . 2019.Irlab@ iitbhu at hasoc 2019: Traditional machine learning for hate speech and offensive content identification, (2019), 308–314.
DOI: http://ceur-ws.org/Vol-2517/T3-17.pdf.Google Scholar - [50] . 2015. Hate speech detection with comment embeddings, 2015.
DOI: Google ScholarDigital Library
- [51] . 2021. Findings of the shared task on offensive language identification in Tamil, Malayalam, and Kannada. In Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages. 133–145.Google Scholar
- [52] . 2021. Kbcnmujal@ hasoc-dravidian-codemix-fire2020: Using machine learning for detection of hate speech and offensive code-mixed social media text. arXiv preprint arXiv:2102.09866.Google Scholar
- [53] . 2022. Towards offensive language identification for Tamil code-mixed YouTube comments and posts. SN Computer Science 3, 1 (2022), 1–13.Google Scholar
Cross Ref
- [54] . 2021. An evaluation of multilingual offensive language identification methods for the languages of India. Information 12, 8 (2021), 306.Google Scholar
Cross Ref
- [55] . 2022. Deep learning-based hate speech detection in code-mixed Tamil text. IETE Journal of Research, (2022).
DOI: Google ScholarCross Ref
- [56] . 2018. A dataset of Hindi-English code-mixed social media text for hate speech detection. In Proceedings of the 2nd Workshop on Computational Modeling of People's Opinions, Personality, and Emotions in Social Media, (New Orleans, LA, 6 June 2018), 36–41.Google Scholar
- [57] . 2020. Kannada code-mixed dataset for sentiment analysis and offensive language detection. In Proceedings of the 3rd Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media, Online, (13 December 2020), 54–63.Google Scholar
- [58] . 2020. Corpus creation for sentiment analysis in code-mixed Tamil-English text. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), (Marseille, France, 11 May 2020), 202–210.Google Scholar
- [59] . 2020. Automatic detection of offensive language for Urdu and Roman Urdu. IEEE Access 8 (2020), 91213–91226.Google Scholar
Cross Ref
- [60] . 2020. [email protected]: Multilingual offensive speech detection in code-mixed and romanized text. In Proceedings of the 12th Forum for Information Retrieval, (Hyderabad, India, 16–20 December 2020).Google Scholar
- [61] . 2020. Hindi-English hate speech detection: Author profiling, debiasing, and practical perspectives. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 1 (2020), 386–393. Google Scholar
Cross Ref
- [62] . 2020. Detecting hate speech in social media articles in romanized Sinhala. In Proceedings of the 2020 20th International Conference on Advances in ICT for Emerging Regions (ICTER). IEEE, 250–255.Google Scholar
Cross Ref
- [63] B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, N. Jose, S. Suryawanshi, E. Sherly, and J. P. McCrae. 2022. DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. Language Resources and Evaluation 56 (2022), 765--806.Google Scholar
- [64] . 2018. A recent overview of the state-of-the-art elements of text classification. Expert Syst. Appl. 106 (2018), 36–54.Google Scholar
Cross Ref
- [65] . 1998. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2 (1998), 121–167.Google Scholar
Digital Library
- [66] . 2020. Hate in the machine: Anti-black and anti-Muslim social media posts as predictors of offline racially and religiously aggravated crime. The British Journal of Criminology 60, 1 (2020), 93–117.Google Scholar
Cross Ref
- [67] Voting Classifier using Sklearn. [n.d.]. Prutor Online Academy (Developed at IIT Kanpur). Retrieved September 8, 2021, from https://prutor.ai/voting-classifier-using-sklearn.Google Scholar
- [68] . 2021. [email protected] EACL2021: Offensive language identification in Dravidian code-mixed YouTube comments and posts. In Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages. Association for Computational Linguistics.Google Scholar
- [69] . 2021. [email protected]: Offensive Language Identification and Meme Classification in Tamil, Malayalam and Kannada. In Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages. Association for Computational Linguistics.Google Scholar
- [70] . 2021. NLPCUET@ DravidianLangTech-EACL2021: Offensive language detection from multilingual code-mixed text using transformers. In Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages. Association for Computational Linguistics.Google Scholar
- [71] . 2021. No [email protected]: Offensive Tamil identification and beyond the performance. In Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages. Association for Computational Linguistics.Google Scholar
- [72] JudithJeyafreeda Andrew. 2021. [email protected] EACL2021: Offensive language detection for Dravidian Code-mixed YouTube comments. In Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages. Association for Computational Linguistics.Google Scholar
Index Terms
Development of an Efficient Method to Detect Mixed Social Media Data with Tamil-English Code Using Machine Learning Techniques
Recommendations
Hate Speech Detection in Hindi-English Code-Mixed Social Media Text
CODS-COMAD '19: Proceedings of the ACM India Joint International Conference on Data Science and Management of DataWith the increase in user generated content, particularly on social media networks, the amount of hate speech is also steadily increasing. So, there is a need to automatically detect such hateful content and curb the wrongful activities. While relevant ...
A Measurement Study of Hate Speech in Social Media
HT '17: Proceedings of the 28th ACM Conference on Hypertext and Social MediaSocial media platforms provide an inexpensive communication medium that allows anyone to quickly reach millions of users. Consequently, in these platforms anyone can publish content and anyone interested in the content can obtain it, representing a ...
Hate Speech in the Political Discourse on Social Media: Disparities Across Parties, Gender, and Ethnicity
WWW '22: Proceedings of the ACM Web Conference 2022Social media has become an indispensable channel for political communication. However, the political discourse is increasingly characterized by hate speech, which affects not only the reputation of individual politicians but also the functioning of ...






Comments