skip to main content
research-article

Multilingual Offensive Language Identification for Low-resource Languages

Published:10 November 2021Publication History
Skip Abstract Section

Abstract

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.

REFERENCES

  1. [1] Agić Željko, Johannsen Anders, Plank Barbara, Alonso Héctor Martínez, Schluter Natalie, and Søgaard Anders. 2016. Multilingual projection for parsing truly low-resource languages. Trans. Assoc. Comput. Ling. 4 (2016), 301312.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Ahn Hwijeen, Sun Jimin, Park Chan Young, and Seo Jungyun. 2020. NLPDove at SemEval-2020 task 12: Improving offensive language detection with cross-lingual transfer. In Proceedings of SemEval.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Alami Hamza, Alaoui Said Ouatik El, Benlahbib Abdessamad, and En-nahnahi Noureddine. 2020. LISAC FSDM-USMBA team at SemEval 2020 task 12: Overcoming AraBERT’s pretrain-finetune discrepancy for Arabic offensive language identification. In Proceedings of SemEval.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Aroyehun Segun Taofeek and Gelbukh Alexander. 2018. Aggression detection in social media: Using deep neural networks, data augmentation, and pseudo labeling. In Proceedings of TRAC.Google ScholarGoogle Scholar
  5. [5] Bannink Rienke, Broeren Suzanne, Looij-Jansen Petra M. van de, Waart Frouwkje G. de, and Raat Hein. 2014. Cyber and traditional bullying victimization as a risk factor for mental health problems and suicidal ideation in adolescents. PloS One 9, 4 (2014).Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Bashar Md Abul and Nayak Richi. 2019. QutNocturnal@ HASOC’19: CNN for hate speech and offensive content identification in Hindi language. In Proceedings of FIRE.Google ScholarGoogle Scholar
  7. [7] Basile Valerio, Bosco Cristina, Fersini Elisabetta, Nozza Debora, Patti Viviana, Pardo Francisco Manuel Rangel, Rosso Paolo, and Sanguinetti Manuela. 2019. SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of SemEval.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Bhattacharya Shiladitya, Singh Siddharth, Kumar Ritesh, Bansal Akanksha, Bhagat Akash, Dawer Yogesh, Lahiri Bornini, and Ojha Atul Kr.. 2020. Developing a multilingual annotated corpus of misogyny and aggression. In Proceedings of TRAC.Google ScholarGoogle Scholar
  9. [9] Bonanno Rina A. and Hymel Shelley. 2013. Cyber bullying and internalizing difficulties: Above and beyond the impact of traditional forms of bullying. J. Youth Adolesc. 42, 5 (2013), 685697.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Burnap Pete and Williams Matthew L.. 2015. Cyber hate speech on Twitter: An application of machine classification and statistical modeling for policy and decision making. Polic. Internet 7, 2 (2015), 223242.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Conneau Alexis, Khandelwal Kartikay, Goyal Naman, Chaudhary Vishrav, Wenzek Guillaume, Guzmán Francisco, Grave Edouard, Ott Myle, Zettlemoyer Luke, and Stoyanov Veselin. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Dadvar Maral, Trieschnigg Dolf, Ordelman Roeland, and Jong Franciska de. 2013. Improving cyberbullying detection with user context. In Advances in Information Retrieval. Springer, 693696.Google ScholarGoogle Scholar
  13. [13] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL.Google ScholarGoogle Scholar
  14. [14] Djuric Nemanja, Zhou Jing, Morris Robin, Grbovic Mihajlo, Radosavljevic Vladan, and Bhamidipati Narayan. 2015. Hate speech detection with comment embeddings. In Proceedings of WWW.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Ghadery Erfan and Moens Marie-Francine. 2020. LIIR at SemEval-2020 task 12: A cross-lingual augmentation approach for multilingual offensive language identification. In Proceedings of SemEval.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Hettiarachchi Hansi and Ranasinghe Tharindu. 2019. Emoji powered capsule network to detect type and target of offensive posts in social media. In Proceedings of RANLP.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Hettiarachchi Hansi and Ranasinghe Tharindu. 2020. BRUMS at SemEval-2020 Task 3: Contextualised embeddings for predicting the (graded) effect of context in word similarity. In Proceedings of SemEval.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Hettiarachchi Hansi and Ranasinghe Tharindu. 2020. Infominer at WNUT-2020 task 2: Transformer-based Covid-19 informative tweet extraction. In Proceedings of W-NUT.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Jauhiainen Tommi, Ranasinghe Tharindu, and Zampieri Marcos. 2021. Comparing approaches to Dravidian language identification. In Proceedings of VarDial.Google ScholarGoogle Scholar
  20. [20] Karthikeyan K., Wang Zihan, Mayhew Stephen, and Roth Dan. 2020. Cross-lingual ability of multilingual BERT: An empirical study. In Proceedings of ICLR.Google ScholarGoogle Scholar
  21. [21] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. Arxiv Preprint arXiv:1412.6980.Google ScholarGoogle Scholar
  22. [22] Kumar Ritesh, Ojha Atul Kr., Malmasi Shervin, and Zampieri Marcos. 2018. Benchmarking aggression identification in social media. In Proceedings of TRAC.Google ScholarGoogle Scholar
  23. [23] Kumar Ritesh, Ojha Atul Kr., Malmasi Shervin, and Zampieri Marcos. 2020. Evaluating aggression identification in social media. In Proceedings of TRAC.Google ScholarGoogle Scholar
  24. [24] Majumder Prasenjit, Mandl Thomas, et al. 2018. Filtering aggression from the multilingual social media feed. In Proceedings of TRAC.Google ScholarGoogle Scholar
  25. [25] Malmasi Shervin and Zampieri Marcos. 2017. Detecting hate speech in social media. In Proceedings of RANLP.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Malmasi Shervin and Zampieri Marcos. 2018. Challenges in discriminating profanity from hate speech. J. Experim. Theoret. Artif. Intell. 30, 2 (2018).Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Mandl Thomas, Modha Sandip, Majumder Prasenjit, Patel Daksh, Dave Mohana, Mandlia Chintak, and Patel Aditya. 2019. Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in Indo-European languages. In Proceedings of FIRE.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Mubarak Hamdy, Kareem Darwish, and Walid Magdy. 2017. Abusive language detection on arabic social media. In Proceedings of ALW.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Mubarak Hamdy, Rashed Ammar, Darwish Kareem, Samih Younes, and Abdelali Ahmed. 2020. Arabic offensive language on Twitter: Analysis and experiments. Arxiv Preprint arXiv:2004.02192.Google ScholarGoogle Scholar
  30. [30] Pamies Marc, Ohman Emily, Kajava Kaisla, and Tiedemann Jorg. 2020. [email protected] at SemEval-2020 task 12: Multilingual or language-specific BERT? In Proceedings of SemEval.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Pamungkas Endang Wahyu and Patti Viviana. 2019. Cross-domain and cross-lingual abusive language detection: A hybrid approach with deep learning and a multilingual lexicon. In Proceedings ACL:SRW.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Pelosi Serena, Maisto Alessandro, Vitale Pierluigi, and Vietri Simonetta. 2017. Mining offensive language on social media. In Proceedings of CLiC-it.Google ScholarGoogle Scholar
  33. [33] Pérez Juan Manuel and Luque Franco M.. 2019. Atalaya at SemEval 2019 task 5: Robust embeddings for tweet classification. In Proceedings of SemEval.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Pires Telmo, Schlinger Eva, and Garrette Dan. 2019. How multilingual is multilingual BERT? In Proceedings of ACL.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Pitenis Zeses, Zampieri Marcos, and Ranasinghe Tharindu. 2020. Offensive language identification in Greek. In Proceedings of LREC.Google ScholarGoogle Scholar
  36. [36] Ranasinghe Tharindu, Gupte Sarthak, Zampieri Marcos, and Nwogu Ifeoma. 2020. WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive language identification in code-switched YouTube comments. In Proceedings of FIRE.Google ScholarGoogle Scholar
  37. [37] Ranasinghe Tharindu and Hettiarachchi Hansi. 2020. BRUMS at SemEval-2020 task 12: Transformer based multilingual offensive language identification in social media. In Proceedings of SemEval.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Ranasinghe Tharindu and Zampieri Marcos. 2020. Multilingual offensive language identification with cross-lingual embeddings. In Proceedings of EMNLP.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Ranasinghe Tharindu and Zampieri Marcos. 2021. MUDES: Multilingual detection of offensive spans. In Proceedings of NAACL.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Ranasinghe Tharindu, Zampieri Marcos, and Hettiarachchi Hansi. 2019. BRUMS at HASOC 2019: Deep learning models for multilingual hate speech and offensive language identification. In Proceedings of FIRE.Google ScholarGoogle Scholar
  41. [41] Risch Julian and Krestel Ralf. 2020. Bagging BERT models for robust aggression identification. In Proceedings of TRAC.Google ScholarGoogle Scholar
  42. [42] Rosa Hugo, Pereira N., Ribeiro Ricardo, Ferreira Paula Costa, Carvalho Joao Paulo, Oliveira S., Coheur Luísa, Paulino Paula, Simão A. M. Veiga, and Trancoso Isabel. 2019. Automatic cyberbullying detection: A systematic review. Comput. Hum. Behav. 93 (2019), 333345.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Rosenthal Sara, Atanasova Pepa, Karadzhov Georgi, Zampieri Marcos, and Nakov Preslav. 2020. A large-scale semi-supervised dataset for offensive language identification. Arxiv Preprint arXiv:2004.14454 (2020).Google ScholarGoogle Scholar
  44. [44] Ross Björn, Rist Michael, Carbonell Guillermo, Cabrera Benjamin, Kurowsky Nils, and Wojatzki Michael. 2016. Measuring the reliability of hate speech annotations: The case of the European refugee crisis. In Proceedings of NLP4CMC.Google ScholarGoogle Scholar
  45. [45] Röttger Paul, Vidgen Bertram, Nguyen Dong, Waseem Zeerak, Margetts Helen, and Pierrehumbert Janet. 2020. Hatecheck: Functional tests for hate speech detection models. Arxiv Preprint arXiv:2012.15606 (2020).Google ScholarGoogle Scholar
  46. [46] Sun Chi, Qiu Xipeng, Xu Yige, and Huang Xuanjing. 2019. How to fine-tune BERT for text classification? In Chinese Computational Linguistics.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Taher Ehsan, Hoseini Seyed Abbas, and Shamsfard Mehrnoush. 2019. Beheshti-NER: Persian named entity recognition using BERT. In Proceedings of NSURL:ICNLSP.Google ScholarGoogle Scholar
  48. [48] Tapo Allahsera Auguste, Coulibaly Bakary, Diarra Sébastien, Homan Christopher, Kreutzer Julia, Luger Sarah, Nagashima Arthur, Zampieri Marcos, and Leventhal Michael. 2020. Neural machine translation for extremely low-resource African languages: A case study on Bambara. In Proceedings of LowResMT.Google ScholarGoogle Scholar
  49. [49] Tulkens Stéphan, Hilte Lisa, Lodewyckx Elise, Verhoeven Ben, and Daelemans Walter. 2016. A dictionary-based approach to racism detection in Dutch social media. In Proceedings of TA-COS.Google ScholarGoogle Scholar
  50. [50] Hee Cynthia Van, Lefever Els, Verhoeven Ben, Mennes Julie, Desmet Bart, Pauw Guy De, Daelemans Walter, and Hoste Veronique. 2015. Automatic detection and prevention of cyberbullying. In Proceedings of HUSO.Google ScholarGoogle Scholar
  51. [51] Vega Luis Enrique Argota, Reyes-Magaña Jorge Carlos, Gómez-Adorno Helena, and Bel-Enguix Gemma. 2019. MineriaUNAM at SemEval-2019 task 5: Detecting hate speech in Twitter using multiple features in a combinatorial framework. In Proceedings of SemEval.Google ScholarGoogle Scholar
  52. [52] Wang Shuohuan, Liu Jiaxiang, Ouyang Xuan, and Sun Yu. 2020. Galileo at SemEval-2020 task 12: Multi-lingual learning for offensive language identification using pre-trained language models. In Proceedings of SemEval.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Waseem Zeerak, Davidson Thomas, Warmsley Dana, and Weber Ingmar. 2017. Understanding abuse: A typology of abusive language detection subtasks. In Proceedings of ALW.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Wiegand Michael, Siegel Melanie, and Ruppenhofer Josef. 2018. Overview of the GermEval 2018 shared task on the identification of offensive language. In Proceedings of GermEval.Google ScholarGoogle Scholar
  55. [55] Xu Jun-Ming, Jun Kwang-Sung, Zhu Xiaojin, and Bellmore Amy. 2012. Learning from bullying traces in social media. In Proceedings of NAACL-HLT.Google ScholarGoogle Scholar
  56. [56] Zampieri Marcos, Malmasi Shervin, Nakov Preslav, Rosenthal Sara, Farra Noura, and Kumar Ritesh. 2019. Predicting the type and target of offensive posts in social media. In Proceedings of NAACL.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Zampieri Marcos, Malmasi Shervin, Nakov Preslav, Rosenthal Sara, Farra Noura, and Kumar Ritesh. 2019. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In Proceedings of SemEval.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Zampieri Marcos, Nakov Preslav, Rosenthal Sara, Atanasova Pepa, Karadzhov Georgi, Mubarak Hamdy, Derczynski Leon, Pitenis Zeses, and Çöltekin Çağrı. 2020. SemEval-2020 Task 12: Multilingual offensive language identification in social media (OffensEval 2020). In Proceedings of SemEval.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Multilingual Offensive Language Identification for Low-resource Languages

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian and Low-Resource Language Information Processing
          ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 1
          January 2022
          442 pages
          ISSN:2375-4699
          EISSN:2375-4702
          DOI:10.1145/3494068
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 10 November 2021
          • Revised: 1 March 2021
          • Accepted: 1 March 2021
          • Received: 1 August 2020
          Published in tallip Volume 21, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!