Abstract
Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this article, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task [23], 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020 [58], 0.8568 F1 macro for Hindi in HASOC 2019 shared task [27], and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) [7], showing that our approach compares favorably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.
- [1] . 2016. Multilingual projection for parsing truly low-resource languages. Trans. Assoc. Comput. Ling. 4 (2016), 301–312.Google Scholar
Cross Ref
- [2] . 2020. NLPDove at SemEval-2020 task 12: Improving offensive language detection with cross-lingual transfer. In Proceedings of SemEval.Google Scholar
Cross Ref
- [3] . 2020. LISAC FSDM-USMBA team at SemEval 2020 task 12: Overcoming AraBERT’s pretrain-finetune discrepancy for Arabic offensive language identification. In Proceedings of SemEval.Google Scholar
Cross Ref
- [4] . 2018. Aggression detection in social media: Using deep neural networks, data augmentation, and pseudo labeling. In Proceedings of TRAC.Google Scholar
- [5] . 2014. Cyber and traditional bullying victimization as a risk factor for mental health problems and suicidal ideation in adolescents. PloS One 9, 4 (2014).Google Scholar
Cross Ref
- [6] . 2019. QutNocturnal@ HASOC’19: CNN for hate speech and offensive content identification in Hindi language. In Proceedings of FIRE.Google Scholar
- [7] . 2019. SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of SemEval.Google Scholar
Cross Ref
- [8] . 2020. Developing a multilingual annotated corpus of misogyny and aggression. In Proceedings of TRAC.Google Scholar
- [9] . 2013. Cyber bullying and internalizing difficulties: Above and beyond the impact of traditional forms of bullying. J. Youth Adolesc. 42, 5 (2013), 685–697.Google Scholar
Cross Ref
- [10] . 2015. Cyber hate speech on Twitter: An application of machine classification and statistical modeling for policy and decision making. Polic. Internet 7, 2 (2015), 223–242.Google Scholar
Cross Ref
- [11] . 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL.Google Scholar
Cross Ref
- [12] . 2013. Improving cyberbullying detection with user context. In Advances in Information Retrieval. Springer, 693–696.Google Scholar
- [13] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL.Google Scholar
- [14] . 2015. Hate speech detection with comment embeddings. In Proceedings of WWW.Google Scholar
Digital Library
- [15] . 2020. LIIR at SemEval-2020 task 12: A cross-lingual augmentation approach for multilingual offensive language identification. In Proceedings of SemEval.Google Scholar
Cross Ref
- [16] . 2019. Emoji powered capsule network to detect type and target of offensive posts in social media. In Proceedings of RANLP.Google Scholar
Cross Ref
- [17] . 2020. BRUMS at SemEval-2020 Task 3: Contextualised embeddings for predicting the (graded) effect of context in word similarity. In Proceedings of SemEval.Google Scholar
Cross Ref
- [18] . 2020. Infominer at WNUT-2020 task 2: Transformer-based Covid-19 informative tweet extraction. In Proceedings of W-NUT.Google Scholar
Cross Ref
- [19] . 2021. Comparing approaches to Dravidian language identification. In Proceedings of VarDial.Google Scholar
- [20] . 2020. Cross-lingual ability of multilingual BERT: An empirical study. In Proceedings of ICLR.Google Scholar
- [21] . 2014. Adam: A method for stochastic optimization. Arxiv Preprint arXiv:1412.6980.Google Scholar
- [22] . 2018. Benchmarking aggression identification in social media. In Proceedings of TRAC.Google Scholar
- [23] . 2020. Evaluating aggression identification in social media. In Proceedings of TRAC.Google Scholar
- [24] . 2018. Filtering aggression from the multilingual social media feed. In Proceedings of TRAC.Google Scholar
- [25] . 2017. Detecting hate speech in social media. In Proceedings of RANLP.Google Scholar
Cross Ref
- [26] . 2018. Challenges in discriminating profanity from hate speech. J. Experim. Theoret. Artif. Intell. 30, 2 (2018).Google Scholar
Cross Ref
- [27] . 2019. Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in Indo-European languages. In Proceedings of FIRE.Google Scholar
Digital Library
- [28] . 2017. Abusive language detection on arabic social media. In Proceedings of ALW.Google Scholar
Cross Ref
- [29] . 2020. Arabic offensive language on Twitter: Analysis and experiments. Arxiv Preprint arXiv:2004.02192.Google Scholar
- [30] . 2020. [email protected] at SemEval-2020 task 12: Multilingual or language-specific BERT? In Proceedings of SemEval.Google Scholar
Cross Ref
- [31] . 2019. Cross-domain and cross-lingual abusive language detection: A hybrid approach with deep learning and a multilingual lexicon. In Proceedings ACL:SRW.Google Scholar
Cross Ref
- [32] . 2017. Mining offensive language on social media. In Proceedings of CLiC-it.Google Scholar
- [33] . 2019. Atalaya at SemEval 2019 task 5: Robust embeddings for tweet classification. In Proceedings of SemEval.Google Scholar
Cross Ref
- [34] . 2019. How multilingual is multilingual BERT? In Proceedings of ACL.Google Scholar
Cross Ref
- [35] . 2020. Offensive language identification in Greek. In Proceedings of LREC.Google Scholar
- [36] . 2020. WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive language identification in code-switched YouTube comments. In Proceedings of FIRE.Google Scholar
- [37] . 2020. BRUMS at SemEval-2020 task 12: Transformer based multilingual offensive language identification in social media. In Proceedings of SemEval.Google Scholar
Cross Ref
- [38] . 2020. Multilingual offensive language identification with cross-lingual embeddings. In Proceedings of EMNLP.Google Scholar
Cross Ref
- [39] . 2021. MUDES: Multilingual detection of offensive spans. In Proceedings of NAACL.Google Scholar
Cross Ref
- [40] . 2019. BRUMS at HASOC 2019: Deep learning models for multilingual hate speech and offensive language identification. In Proceedings of FIRE.Google Scholar
- [41] . 2020. Bagging BERT models for robust aggression identification. In Proceedings of TRAC.Google Scholar
- [42] . 2019. Automatic cyberbullying detection: A systematic review. Comput. Hum. Behav. 93 (2019), 333–345.Google Scholar
Digital Library
- [43] . 2020. A large-scale semi-supervised dataset for offensive language identification. Arxiv Preprint arXiv:2004.14454 (2020).Google Scholar
- [44] . 2016. Measuring the reliability of hate speech annotations: The case of the European refugee crisis. In Proceedings of NLP4CMC.Google Scholar
- [45] . 2020. Hatecheck: Functional tests for hate speech detection models. Arxiv Preprint arXiv:2012.15606 (2020).Google Scholar
- [46] . 2019. How to fine-tune BERT for text classification? In Chinese Computational Linguistics.Google Scholar
Digital Library
- [47] . 2019. Beheshti-NER: Persian named entity recognition using BERT. In Proceedings of NSURL:ICNLSP.Google Scholar
- [48] . 2020. Neural machine translation for extremely low-resource African languages: A case study on Bambara. In Proceedings of LowResMT.Google Scholar
- [49] . 2016. A dictionary-based approach to racism detection in Dutch social media. In Proceedings of TA-COS.Google Scholar
- [50] . 2015. Automatic detection and prevention of cyberbullying. In Proceedings of HUSO.Google Scholar
- [51] . 2019. MineriaUNAM at SemEval-2019 task 5: Detecting hate speech in Twitter using multiple features in a combinatorial framework. In Proceedings of SemEval.Google Scholar
- [52] . 2020. Galileo at SemEval-2020 task 12: Multi-lingual learning for offensive language identification using pre-trained language models. In Proceedings of SemEval.Google Scholar
Cross Ref
- [53] . 2017. Understanding abuse: A typology of abusive language detection subtasks. In Proceedings of ALW.Google Scholar
Cross Ref
- [54] . 2018. Overview of the GermEval 2018 shared task on the identification of offensive language. In Proceedings of GermEval.Google Scholar
- [55] . 2012. Learning from bullying traces in social media. In Proceedings of NAACL-HLT.Google Scholar
- [56] . 2019. Predicting the type and target of offensive posts in social media. In Proceedings of NAACL.Google Scholar
Cross Ref
- [57] . 2019. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In Proceedings of SemEval.Google Scholar
Cross Ref
- [58] . 2020. SemEval-2020 Task 12: Multilingual offensive language identification in social media (OffensEval 2020). In Proceedings of SemEval.Google Scholar
Cross Ref
Index Terms
Multilingual Offensive Language Identification for Low-resource Languages
Recommendations
A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such ...
Improving NER Tagging Performance in Low-Resource Languages via Multilingual Learning
Existing supervised solutions for Named Entity Recognition (NER) typically rely on a large annotated corpus. Collecting large amounts of NER annotated corpus is time-consuming and requires considerable human effort. However, collecting small amounts of ...
Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
AbstractUnsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...






Comments