skip to main content
research-article

Hate Speech Detection in Roman Urdu

Published:09 March 2021Publication History
Skip Abstract Section

Abstract

Hate speech is a specific type of controversial content that is widely legislated as a crime that must be identified and blocked. However, due to the sheer volume and velocity of the Twitter data stream, hate speech detection cannot be performed manually. To address this issue, several studies have been conducted for hate speech detection in European languages, whereas little attention has been paid to low-resource South Asian languages, making the social media vulnerable for millions of users. In particular, to the best of our knowledge, no study has been conducted for hate speech detection in Roman Urdu text, which is widely used in the sub-continent. In this study, we have scrapped more than 90,000 tweets and manually parsed them to identify 5,000 Roman Urdu tweets. Subsequently, we have employed an iterative approach to develop guidelines and used them for generating the Hate Speech Roman Urdu 2020 corpus. The tweets in the this corpus are classified at three levels: Neutral-Hostile, Simple-Complex, and Offensive-Hate speech. As another contribution, we have used five supervised learning techniques, including a deep learning technique, to evaluate and compare their effectiveness for hate speech detection. The results show that Logistic Regression outperformed all other techniques, including deep learning techniques for the two levels of classification, by achieved an F1 score of 0.906 for distinguishing between Neutral-Hostile tweets, and 0.756 for distinguishing between Offensive-Hate speech tweets.

References

  1. Disability Horizons. 2019. Nearly 40% of disabled people we surveyed experienced hate crime online. Disability Horizons. Retrieved March 2, 2005 from http://disabilityhorizons.com/2019/05/nearly-40-of-disabled-people-we-surveyed-experienced-hate-crime-online/.Google ScholarGoogle Scholar
  2. Ron Artstein. 2017. Inter-annotator agreement. In Handbook of Linguistic Annotation. Springer, 297--313.Google ScholarGoogle Scholar
  3. Javed Ashraf, Naveed Iqbal, Naveed Sarfraz Khattak, and Ather Mohsin Zaidi. 2010. Speaker independent Urdu speech recognition using HMM. In Proceedings of the 2010 7th International Conference on Informatics and Systems (INFOS’10). IEEE, Los Alamitos, CA, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  4. Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. 2017. Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion. 759--760.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Clement. 2020. Number of monthly active Twitter users worldwide from 1st quarter 2010 to 1st quarter 2019. Statista. Retrieved June 17, 2020 from https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/.Google ScholarGoogle Scholar
  6. Ali Daud, Wahab Khan, and Dunren Che. 2017. Urdu language processing: A survey. Artificial Intelligence Review 47, 3 (2017), 279--311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Misbah Daud, Rafiullah Khan, Mohibullah, and Aitazaz Daud. 2015. Roman Urdu opinion mining system (RUOMiS). arXiv:1501.01386Google ScholarGoogle Scholar
  8. Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the 11th International AAAI Conference on Web and Social Media.Google ScholarGoogle Scholar
  9. Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, Marinella Petrocchi, and Maurizio Tesconi. 2017. Hate me, hate me not: Hate speech detection on Facebook. In Proceedings of the 1st Italian Conference on Cybersecurity (ITASEC’17). 86--95.Google ScholarGoogle Scholar
  10. Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Grbovic, Vladan Radosavljevic, and Narayan Bhamidipati. 2015. Hate speech detection with comment embeddings. In Proceedings of the 24th International Conference on World Wide Web. 29--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Elisabeth Eder, Ulrike Krieg-Holz, and Udo Hahn. 2019. At the lower end of language—Exploring the vulgar and obscene side of German. In Proceedings of the 3rd Workshop on Abusive Language Online. 119--128.Google ScholarGoogle ScholarCross RefCross Ref
  12. Ethnologue. 2020. What are the top 200 most spoken languages? The Ethnologue 200. Ethnologue. Retrieved March 2, 2020 from https://www.ethnologue.com/guides/ethnologue200.Google ScholarGoogle Scholar
  13. Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Computing Surveys 51, 4 (2018), 1--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mohammad Hanafy, Mahmoud I. Khalil, and Hazem M. Abbas. 2018. Combining classical and deep learning methods for Twitter sentiment analysis. In Proceedings of the IAPR Workshop on Artificial Neural Networks in Pattern Recognition. 281--292.Google ScholarGoogle Scholar
  15. Akshita Jha and Radhika Mamidi. 2017. When does a compliment become sexist? Analysis and classification of ambivalent sexism using Twitter data. In Proceedings of the 2nd Workshop on NLP and Computational Social Science. 7--16.Google ScholarGoogle ScholarCross RefCross Ref
  16. Moin Khan and Kamran Malik. 2018. Sentiment classification of customer reviews about automobiles in Roman Urdu. In Proceedings of the Future of Information and Communication Conference. 630--640.Google ScholarGoogle Scholar
  17. S. M. Lodhi and M. A. Matin. 2005. Urdu character recognition using Fourier descriptors for optical networks. In Photonic Devices and Algorithms for Computing VII, Vol. 5907. International Society for Optics and Photonics.Google ScholarGoogle Scholar
  18. Shervin Malmasi and Marcos Zampieri. 2017. Detecting hate speech in social media. arXiv:1712.06427Google ScholarGoogle Scholar
  19. Thomas Mandl, Sandip Modha, Prasenjit Majumder, Daksh Patel, Mohana Dave, Chintak Mandlia, and Aditya Patel. 2019. Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in Indo-European languages. In Proceedings of the 11th Forum for Information Retrieval Evaluation. 14--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Khawar Mehmood, Daryl Essam, Kamran Shafi, and Muhammad Kamran Malik. 2019. Sentiment analysis for a resource poor language—Roman Urdu. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 1 (2019), 1--15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Vincent Menger, Floor Scheepers, and Marco Spruit. 2018. Comparing deep learning and classical machine learning approaches for predicting inpatient violence incidents from clinical text. Applied Sciences 8, 6 (2018), 981.Google ScholarGoogle ScholarCross RefCross Ref
  22. Hamdy Mubarak, Kareem Darwish, and Walid Magdy. 2017. Abusive language detection on Arabic social media. In Proceedings of the 1st Workshop on Abusive Language Online. 52--56.Google ScholarGoogle ScholarCross RefCross Ref
  23. Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web. 145--153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Pakistan Bureau of Statistics. 2020. Provisional summary result of 6th population and housing census—2017. Pakistan Bureau of Statistics. Retrieved March 2, 2020 from https://www.worldometers.info/world-population/pakistan-population/.Google ScholarGoogle Scholar
  25. Georgios K. Pitsilis, Heri Ramampiaro, and Helge Langseth. 2018. Detecting offensive language in tweets using deep learning. arXiv:1801.04433Google ScholarGoogle Scholar
  26. Rahul Pradhan, Ankur Chaturvedi, Aprna Tripathi, and Dilip Kumar Sharma. 2020. A review on offensive language detection. In Advances in Data and Information Sciences. Springer, 433--439.Google ScholarGoogle Scholar
  27. Anna Schmidt and Michael Wiegand. 2017. A survey on hate speech detection using natural language processing. In Proceedings of the 5th International Workshop on Natural Language Processing for Social Media. 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  28. Zhicheng Tang, Nickolas Wergeles, and Yi Shang. 2019. Deep learning vs. classical machine learning: A comparison of methods for fluid intelligence prediction. In Adolescent Brain Cognitive Development Neurocognitive Prediction. Lecture Notes in Computer Science, Vol. 11791. Springer, 17--25.Google ScholarGoogle Scholar
  29. Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan, and Amit P. Sheth. 2014. Cursing in English on Twitter. In Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing. 415--425.Google ScholarGoogle Scholar
  30. Zeerak Waseem. 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter. In Proceedings of the 1st Workshop on NLP and Computational Social Science. 138--142.Google ScholarGoogle ScholarCross RefCross Ref
  31. Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop. 88--93.Google ScholarGoogle ScholarCross RefCross Ref
  32. Hajime Watanabe, Mondher Bouazizi, and Tomoaki Ohtsuki. 2018. Hate speech on Twitter: A pragmatic approach to collect hateful and offensive expressions and perform hate speech detection. IEEE Access 6 (2018), 13825--13835.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Hate Speech Detection in Roman Urdu

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 1
      Special issue on Deep Learning for Low-Resource Natural Language Processing, Part 1 and Regular Papers
      January 2021
      332 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3439335
      Issue’s Table of Contents

      Copyright © 2021 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 March 2021
      • Revised: 1 July 2020
      • Accepted: 1 July 2020
      • Received: 1 March 2020
      Published in tallip Volume 20, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!