Abstract
Languages across the world have words, phrases, and behaviors—the taboos—that are avoided in public communication considering them as obscene or disturbing to the social, religious, and ethical values of society. However, people deliberately use these linguistic taboos and other language constructs to make hurtful, derogatory, and obscene comments. It is nearly impossible to construct a universal set of offensive or taboo terms because offensiveness is determined entirely by different factors such as socio-physical setting, speaker-listener relationship, and word choices. In this article, we present a detailed corpus-based study of offensive language in Nepali. We identify and describe more than 18 different categories of linguistic offenses including politics, religion, race, and sex. We discuss 12 common euphemisms, such as synonym, metaphor, and circumlocution. In addition, we introduce a manually constructed dataset of more than 1,000 offensive and taboo terms popular among contemporary speakers. We describe the first experiments that provide baseline results in detecting offensive language in Nepali. This in-depth study of offensive language and resource will provide a foundation for several downstream tasks, such as offensive language detection and language learning.
- [1] . 2006. Taboos and their origins. In Forbidden Words: Taboo and the Censoring of Language. Cambridge University Press, Cambridge, UK, 1–28.Google Scholar
- [2] . 2018. Aggression detection in social media: Using deep neural networks, data augmentation, and pseudo labeling. In Proceedings of the 1st Workshop on Trolling, Aggression, and Cyberbullying (TRAC’18). 90–97.Google Scholar
- [3] . 2020. Aggression identification in English, Hindi and Bangla text using BERT, RoBERTa and SVM. In Proceedings of the 2nd Workshop on Trolling, Aggression, and Cyberbullying (TRAC’20). 76–82.Google Scholar
- [4] . 2018. Linguistic taboos in Karonese culture. KnE Social Sciences 2018 (2018), 411–421.Google Scholar
Cross Ref
- [5] . 2017. Linguistic taboos in the Pahari culture: A sociolinguistic analysis. ARIEL—An International Research Journal of English Language and Literature 27 (2017), 86–97.Google Scholar
- [6] . 2015. The five W’s of “bullying” on Twitter: Who, what, why, where, and when. Computers in Human Behavior 44 (2015), 305–314.Google Scholar
Digital Library
- [7] . 2020. Hostility detection dataset in Hindi. arXiv preprint arXiv:2011.03588 (2020).Google Scholar
- [8] . 2018. A dataset of Hindi-English code-mixed social media text for hate speech detection. In Proceedings of the 2nd Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media. 36–41.Google Scholar
Cross Ref
- [9] . 2017. Mean birds: Detecting aggression and bullying on Twitter. In Proceedings of the 2017 ACM Web Science Conference. 13–22.Google Scholar
Digital Library
- [10] . 2021. Bangla hate speech detection on social media using attention-based recurrent neural network. Journal of Intelligent Systems 30, 1 (2021), 578–591.Google Scholar
Cross Ref
- [11] . 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the 11th International AAAI Conference on Web and Social Media.Google Scholar
Cross Ref
- [12] . 2017. Hate me, hate me not: Hate speech detection on Facebook. In Proceedings of the 1st Italian Conference on Cybersecurity (ITASEC’17). 86–95.Google Scholar
- [13] . 2013. Linguistic taboos in the Igbo society: A sociolinguistic investigation. Language Discourse & Society 2, 2 (2013), 117–132.Google Scholar
- [14] . 2013. A sociolinguistic study of English taboo language. Theory and Practice in Language Studies 3, 12 (2013), 2310.Google Scholar
Cross Ref
- [15] . 2014. A sociolinguistic view of linguistic taboos and euphemistic strategies in the Algerian society: Attitudes and beliefs in Tlemcen speech community. International Journal of Research in Applied, Natural and Social Sciences 2, 3 (2014), 73–88.Google Scholar
- [16] . 2001. The use of euphemisms and taboo terms by young speakers of Russian and English. Master’s Thesis. Department of Modern Languages and Cultural Studies, University of Alberta.Google Scholar
- [17] . 2017. Swearing in Finnish. In Advances in Swearing Research: New Languages and New Contexts. John Benjamins, Amsterdam, Netherlands, 231–256.Google Scholar
- [18] . 2018. An approach to detect abusive Bangla text. In Proceedings of the 2018 International Conference on Innovation in Engineering and Technology (ICIET’18). IEEE, Los Alamitos, CA, 1–5.Google Scholar
Cross Ref
- [19] . 2009. The utility and ubiquity of taboo words. Perspectives on Psychological Science 4, 2 (2009), 153–161.Google Scholar
- [20] . 2008. The pragmatics of swearing. Journal of Politeness Research: Language, Behaviour, Culture 4, 2 (2008), 267–288.Google Scholar
- [21] . 2020. DHOT-repository and classification of offensive tweets in the Hindi language. Procedia Computer Science 171 (2020), 2324–2333.Google Scholar
Cross Ref
- [22] . 2020. Pragmatics. Retrieved April 9, 2022 from https://plato.stanford.edu/archives/spr2020/entries/pragmatics/.Google Scholar
- [23] . 2018. Swear words in Bad Boys II: A semantic analysis. LLT Journal: A Journal on Language and Language Teaching 21, 2 (2018), 191–198.Google Scholar
Cross Ref
- [24] . 2018. Benchmarking aggression identification in social media. In Proceedings of the 1st Workshop on Trolling, Aggression, and Cyberbullying (TRAC’18). 1–11.Google Scholar
- [25] . 2018. Detecting offensive tweets in Hindi-English code-switched language. In Proceedings of the 6th International Workshop on Natural Language Processing for Social Media. 18–26.Google Scholar
Cross Ref
- [26] . 2014. Population Monograph of Nepal: Volume II (Social Demographics). Central Bureau of Statistics, Ramshah Path, Kathmandu, Nepal.Google Scholar
- [27] . 2021. Offensive language detection in Nepali. In Proceedings of the 5th Workshop on Online Abuse and Harms. 67–75.Google Scholar
- [28] . 2016. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web. 145–153.Google Scholar
Digital Library
- [29] . 2017. One-step and two-step classification for abusive language detection on Twitter. arXiv preprint arXiv:1706.01206 (2017).Google Scholar
- [30] . 2011. A sociolinguistic study of the linguistic taboos in the Yemeni society. Modern Journal of Applied Linguistics 3, 2 (2011), 86–104.Google Scholar
- [31] . 2004. The geolinguistics of verbal taboo. ETC: A Review of General Semantics 61, 4 (2004), 444–455.Google Scholar
- [32] . 2019. Hate speech in pixels: Detection of offensive memes towards automatic moderation. arXiv preprint arXiv:1910.02334 (2019).Google Scholar
- [33] . Identification of slang words used in pornographic unsolicited bulk emails. Journal of SCI-TECH Research XX (XXXX), 4–9.Google Scholar
- [34] . 2011. Identification of Hindi words used in pornographic unsolicited bulk e-mails.IUP Journal of Systems Management 9, 2 (2011), 1–8.Google Scholar
- [35] . 2017. A survey on hate speech detection using natural language processing. In Proceedings of the 5th International Workshop on Natural Language Processing for Social Media. 1–10.Google Scholar
Cross Ref
- [36] . 2011. Hate speech in cyberspace: Bitterness without boundaries. Notre Dame Journal of Law, Ethics & Public Policy 25, 1 (2011), Article 9.Google Scholar
- [37] . 2015. The nature of cyberbullying and what we can do about it. Journal of Research in Special Educational Needs 15, 3 (2015), 176–184.Google Scholar
Cross Ref
- [38] . 2012. Automatic identification of personal insults on social news sites. Journal of the American Society for Information Science and Technology 63, 2 (2012), 270–285.Google Scholar
Digital Library
- [39] . 2010. What’s the bloody law on this? Nurses, swearing, and the law in New South Wales, Australia. Contemporary Nurse 34, 2 (2010), 248–257.Google Scholar
Cross Ref
- [40] . Invasions and Racism in Hinduism. Retrieved April 9, 2022 from http://www.trinicenter.com/more/India/invasionsandracism.htm.Google Scholar
- [41] . 2011. An Introduction to Sociolinguistics. Vol. 28. John Wiley & Sons.Google Scholar
- [42] . 2015. Cyberbullying via social media. Journal of School Violence 14, 1 (2015), 11–29.Google Scholar
Cross Ref
- [43] . 2018. Overview of the GermEval 2018 shared task on the identification of offensive language. In Proceedings of the 14th Conference on Natural Language Processing. 1–10.Google Scholar
- [44] . 2019. Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 1415–1420.Google Scholar
- [45] . 2019. SemEval-2019 Task 6: Identifying and categorizing offensive language in social media (OffensEval). arXiv preprint arXiv:1903.08983 (2019).Google Scholar
- [46] . 2011. A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. CoRR abs/1103.2903 (2011).Google Scholar
Index Terms
Linguistic Taboos and Euphemisms in Nepali
Recommendations
Natural language processing for Nepali text: a review
AbstractBecause of the proliferation of Nepali textual documents online, researchers in Nepal and overseas have started working towards its automated analysis for quick inferences, using different machine learning (ML) algorithms, ranging from traditional ...
Exploring Applications of Representation Learning in Nepali
CICLing 2014: Proceedings of the 15th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 8403We explore the applications of representation learning in Nepali, an under-resourced language. Using distributional similarity on a large amount of unlabeled Nepali text, we induce clusters of different sizes. The use of these clusters as features ...
Exploiting linguistic information from Nepali transcripts for early detection of Alzheimer's disease using natural language processing and machine learning techniques
Highlights- A novel manually annotated Alzheimer's disease dataset for low resource language i.e., Nepalese, consisting of 168 Alzheimer's disease and 98 control normal ...
AbstractAlzheimer's disease (AD) is considered as progressing brain disease, which can be slowed down with the early detection and proper treatment by identifying the early symptoms. Language change serves as an early sign that a patient's ...






Comments