Abstract
Hate speech is a specific type of controversial content that is widely legislated as a crime that must be identified and blocked. However, due to the sheer volume and velocity of the Twitter data stream, hate speech detection cannot be performed manually. To address this issue, several studies have been conducted for hate speech detection in European languages, whereas little attention has been paid to low-resource South Asian languages, making the social media vulnerable for millions of users. In particular, to the best of our knowledge, no study has been conducted for hate speech detection in Roman Urdu text, which is widely used in the sub-continent. In this study, we have scrapped more than 90,000 tweets and manually parsed them to identify 5,000 Roman Urdu tweets. Subsequently, we have employed an iterative approach to develop guidelines and used them for generating the Hate Speech Roman Urdu 2020 corpus. The tweets in the this corpus are classified at three levels: Neutral-Hostile, Simple-Complex, and Offensive-Hate speech. As another contribution, we have used five supervised learning techniques, including a deep learning technique, to evaluate and compare their effectiveness for hate speech detection. The results show that Logistic Regression outperformed all other techniques, including deep learning techniques for the two levels of classification, by achieved an F1 score of 0.906 for distinguishing between Neutral-Hostile tweets, and 0.756 for distinguishing between Offensive-Hate speech tweets.
- Disability Horizons. 2019. Nearly 40% of disabled people we surveyed experienced hate crime online. Disability Horizons. Retrieved March 2, 2005 from http://disabilityhorizons.com/2019/05/nearly-40-of-disabled-people-we-surveyed-experienced-hate-crime-online/.Google Scholar
- Ron Artstein. 2017. Inter-annotator agreement. In Handbook of Linguistic Annotation. Springer, 297--313.Google Scholar
- Javed Ashraf, Naveed Iqbal, Naveed Sarfraz Khattak, and Ather Mohsin Zaidi. 2010. Speaker independent Urdu speech recognition using HMM. In Proceedings of the 2010 7th International Conference on Informatics and Systems (INFOS’10). IEEE, Los Alamitos, CA, 1--5.Google Scholar
Cross Ref
- Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. 2017. Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion. 759--760.Google Scholar
Digital Library
- J. Clement. 2020. Number of monthly active Twitter users worldwide from 1st quarter 2010 to 1st quarter 2019. Statista. Retrieved June 17, 2020 from https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/.Google Scholar
- Ali Daud, Wahab Khan, and Dunren Che. 2017. Urdu language processing: A survey. Artificial Intelligence Review 47, 3 (2017), 279--311.Google Scholar
Digital Library
- Misbah Daud, Rafiullah Khan, Mohibullah, and Aitazaz Daud. 2015. Roman Urdu opinion mining system (RUOMiS). arXiv:1501.01386Google Scholar
- Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the 11th International AAAI Conference on Web and Social Media.Google Scholar
- Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, Marinella Petrocchi, and Maurizio Tesconi. 2017. Hate me, hate me not: Hate speech detection on Facebook. In Proceedings of the 1st Italian Conference on Cybersecurity (ITASEC’17). 86--95.Google Scholar
- Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Grbovic, Vladan Radosavljevic, and Narayan Bhamidipati. 2015. Hate speech detection with comment embeddings. In Proceedings of the 24th International Conference on World Wide Web. 29--30.Google Scholar
Digital Library
- Elisabeth Eder, Ulrike Krieg-Holz, and Udo Hahn. 2019. At the lower end of language—Exploring the vulgar and obscene side of German. In Proceedings of the 3rd Workshop on Abusive Language Online. 119--128.Google Scholar
Cross Ref
- Ethnologue. 2020. What are the top 200 most spoken languages? The Ethnologue 200. Ethnologue. Retrieved March 2, 2020 from https://www.ethnologue.com/guides/ethnologue200.Google Scholar
- Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Computing Surveys 51, 4 (2018), 1--30.Google Scholar
Digital Library
- Mohammad Hanafy, Mahmoud I. Khalil, and Hazem M. Abbas. 2018. Combining classical and deep learning methods for Twitter sentiment analysis. In Proceedings of the IAPR Workshop on Artificial Neural Networks in Pattern Recognition. 281--292.Google Scholar
- Akshita Jha and Radhika Mamidi. 2017. When does a compliment become sexist? Analysis and classification of ambivalent sexism using Twitter data. In Proceedings of the 2nd Workshop on NLP and Computational Social Science. 7--16.Google Scholar
Cross Ref
- Moin Khan and Kamran Malik. 2018. Sentiment classification of customer reviews about automobiles in Roman Urdu. In Proceedings of the Future of Information and Communication Conference. 630--640.Google Scholar
- S. M. Lodhi and M. A. Matin. 2005. Urdu character recognition using Fourier descriptors for optical networks. In Photonic Devices and Algorithms for Computing VII, Vol. 5907. International Society for Optics and Photonics.Google Scholar
- Shervin Malmasi and Marcos Zampieri. 2017. Detecting hate speech in social media. arXiv:1712.06427Google Scholar
- Thomas Mandl, Sandip Modha, Prasenjit Majumder, Daksh Patel, Mohana Dave, Chintak Mandlia, and Aditya Patel. 2019. Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in Indo-European languages. In Proceedings of the 11th Forum for Information Retrieval Evaluation. 14--17.Google Scholar
Digital Library
- Khawar Mehmood, Daryl Essam, Kamran Shafi, and Muhammad Kamran Malik. 2019. Sentiment analysis for a resource poor language—Roman Urdu. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 1 (2019), 1--15.Google Scholar
Digital Library
- Vincent Menger, Floor Scheepers, and Marco Spruit. 2018. Comparing deep learning and classical machine learning approaches for predicting inpatient violence incidents from clinical text. Applied Sciences 8, 6 (2018), 981.Google Scholar
Cross Ref
- Hamdy Mubarak, Kareem Darwish, and Walid Magdy. 2017. Abusive language detection on Arabic social media. In Proceedings of the 1st Workshop on Abusive Language Online. 52--56.Google Scholar
Cross Ref
- Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web. 145--153.Google Scholar
Digital Library
- Pakistan Bureau of Statistics. 2020. Provisional summary result of 6th population and housing census—2017. Pakistan Bureau of Statistics. Retrieved March 2, 2020 from https://www.worldometers.info/world-population/pakistan-population/.Google Scholar
- Georgios K. Pitsilis, Heri Ramampiaro, and Helge Langseth. 2018. Detecting offensive language in tweets using deep learning. arXiv:1801.04433Google Scholar
- Rahul Pradhan, Ankur Chaturvedi, Aprna Tripathi, and Dilip Kumar Sharma. 2020. A review on offensive language detection. In Advances in Data and Information Sciences. Springer, 433--439.Google Scholar
- Anna Schmidt and Michael Wiegand. 2017. A survey on hate speech detection using natural language processing. In Proceedings of the 5th International Workshop on Natural Language Processing for Social Media. 1--10.Google Scholar
Cross Ref
- Zhicheng Tang, Nickolas Wergeles, and Yi Shang. 2019. Deep learning vs. classical machine learning: A comparison of methods for fluid intelligence prediction. In Adolescent Brain Cognitive Development Neurocognitive Prediction. Lecture Notes in Computer Science, Vol. 11791. Springer, 17--25.Google Scholar
- Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, and Amit P. Sheth. 2014. Cursing in English on Twitter. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing. 415--425.Google Scholar
- Zeerak Waseem. 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter. In Proceedings of the 1st Workshop on NLP and Computational Social Science. 138--142.Google Scholar
Cross Ref
- Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop. 88--93.Google Scholar
Cross Ref
- Hajime Watanabe, Mondher Bouazizi, and Tomoaki Ohtsuki. 2018. Hate speech on Twitter: A pragmatic approach to collect hateful and offensive expressions and perform hate speech detection. IEEE Access 6 (2018), 13825--13835.Google Scholar
Cross Ref
Index Terms
Hate Speech Detection in Roman Urdu
Recommendations
HateCircle and Unsupervised Hate Speech Detection Incorporating Emotion and Contextual Semantics
The explosive growth of social media has fueled an extensive increase in online freedom of speech. The worldwide platform of human voice creates possibilities to assail other users without facing any consequences, and flout social etiquettes, resulting in ...
Hate speech detection using brazilian imageboards
WebMedia '19: Proceedings of the 25th Brazillian Symposium on Multimedia and the WebWith the changes in human interaction prompted by the development of communications platforms over the internet, hate speech and offensive language emerged as a contemporary problem. Social networks allow users with different opinions and backgrounds to ...
UHated: hate speech detection in Urdu language using transfer learning
AbstractSocial media has become a driving force for social change in the global society. Events that take place in one part of the world can quickly reverberate across the globe due to the vast amount of data generated on these platforms. However, ...






Comments