Abstract
Social Media has been growing and has provided the world with a platform to opine, debate, display, and discuss like never before. It has a major influence in research areas that analyze human behavior and social groups, and the phenomenon of social interactions is even being used in areas such as Internet of Things. This constant stream of data connecting individuals and organizations across the globe has had a tremendous impact on the functioning of society and even has the power to sway elections. Despite having numerous benefits, social media has certain issues such as the prevalence of fake news, which has also led to the rise of the hate speech phenomenon. Due to lax security throughout these social media platforms, these issues continue to exist without any repercussions. This leads to cyberbullying, defamation, and presents grave security concerns. Even though some work has been done independently on native scripts, hate speech detection, and code-mixed data, there exists a lack of academic work and research in the area of detecting hate speech in transliterated code-mixed data and in-text containing native language scripts. Research in this field is inhibited greatly due to the multiple variations in grammar and spelling and in general a lack of availability of annotated datasets, especially when it comes to native languages. This article comes up with a method to automate hate speech detection in code-mixed and native language text. The article presents an architecture containing a Tabnet classifier-based model trained on features extracted using MuRIL from transliterated code-mixed textual data. The article also shows that the same model works well on features extracted from text in Devanagari despite being trained on transliterated data.
- [1] . 2020. A CNN-based stock price trend prediction with futures and historical price. In International Conference on Pervasive Artificial Intelligence (ICPAI). 134–139.
DOI: Google ScholarCross Ref
- [2] . 2021. A ML-based stock trading model for profit predication. In Advances and Trends in Artificial Intelligence. From Theory to Practice, , , , and (Eds.). Springer International Publishing, Cham, 554–563.Google Scholar
- [3] . 2021. Attention-based deep entropy active learning using lexical algorithm for mental health treatment. Front. Psychol. 12 (2021), 471.
DOI: Google ScholarCross Ref
- [4] . 2017. True threats. First Amend. Encycl. (2017).Google Scholar
- [5] . 2018. Language Identification for Hindi Language Transliterated Text in Roman Script Using Generative Adversarial Networks. Towards Extensible and Adaptable Methods in Computing, Springer, 267–279.
DOI: Google ScholarCross Ref
- [6] . 2020. Detection of hate speech text in Hindi-English code-mixed data. Proced. Comput. Sci. 171 (2020), 737–744.
DOI: Google ScholarCross Ref
- [7] . 2017. Deep learning for hate speech detection in tweets. In 26th International Conference on World Wide Web Companion.
DOI: Google ScholarDigital Library
- [8] . 2018. Did you offend me? Classification of offensive tweets in Hinglish language. In 2nd Workshop on Abusive Language Online (ALW2). Association for Computational Linguistics, 138–148.
DOI: Google ScholarCross Ref
- [9] . 2018. Hate Speech Detection from Code-mixed Hindi-English Tweets Using Deep Learning Models. (2018). arxiv:cs.CL/1811.05145Google Scholar
- [10] . 2021. Online multilingual hate speech detection: Experimenting with Hindi and English social media. Information 12, 1 (2021).
DOI: Google ScholarCross Ref
- [11] . 2019. Race, ethnicity and national origin-based discrimination in social media and hate crimes across 100 U.S. cities. In International AAAI Conference on Web and Social Media. arXiv:1902.00119.Google Scholar
- [12] . 2020. Multilingual representations for Indian languages: A BERT model pre-trained on 17 Indian languages, and their transliterated counterparts. https://huggingface.co/google/muril-base-cased.Google Scholar
- [13] . 2020. TabNet: Attentive interpretable tabular learning. arxiv:cs.LG/1908.07442.Google Scholar
- [14] 2020: Hate speech and offensive content identification in Indo-European languages. hasocfire.github.io/hasoc/2020/.Google Scholar
- [15] . 2017. Attention Is All You Need. arxiv:cs.CL/1706.03762.Google Scholar
- [16] . 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:cs.CL/1810.04805.Google Scholar
- [17] . 2017. Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion (WWW’17 Companion). International World Wide Web Conferences Steering Committee, 759–760.
DOI: Google ScholarDigital Library
- [18] . 2018. Aggression-annotated corpus of Hindi-English code-mixed data. In 11th Language Resources and Evaluation Conference (LREC).Google Scholar
- [19] . 2018. Hate speech dataset from a white supremacy forum. In 2nd Workshop on Abusive Language Online (ALW2). Association for Computational Linguistics, 11–20.
DOI: Google ScholarCross Ref
- [20] . 2021. Hate or non-hate: Translation based hate speech identification in code-mixed Hinglish data set. In IEEE International Conference on Big Data (Big Data). 2470–2475.
DOI: Google ScholarCross Ref
Index Terms
A Framework for Online Hate Speech Detection on Code-mixed Hindi-English Text and Hindi Text in Devanagari
Recommendations
Hate Speech Detection in Hindi-English Code-Mixed Social Media Text
CODS-COMAD '19: Proceedings of the ACM India Joint International Conference on Data Science and Management of DataWith the increase in user generated content, particularly on social media networks, the amount of hate speech is also steadily increasing. So, there is a need to automatically detect such hateful content and curb the wrongful activities. While relevant ...
Hate Speech Detection on Code-Mixed Dataset Using a Fusion of Custom and Pre-trained Models with Profanity Vector Augmentation
AbstractWith the increase in user-generated content on social media networks, hate speech and offensive language content are also increasing. From the perspective of computer science, automatic detection of such hate speech and offensive language content ...
Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi
Artificial Neural Networks in Pattern RecognitionAbstractTransformers are the most eminent architectures used for a vast range of Natural Language Processing tasks. These models are pre-trained over a large text corpus and are meant to serve state-of-the-art results over tasks like text classification. ...






Comments