Abstract
In today’s era of digitization, social media platforms play a significant role in networking and influencing the perception of the general population. Social network sites have recently been used to carry out harmful attacks against individuals, including political and theological figures, intellectuals, sports and movie stars, and other prominent dignitaries, which may or may not be intentional. However, the exchange of such information across the general population inevitably contributes to social-economic, socio-political turmoil, and even physical violence in society. By classifying the derogatory content of a social media post, this research work helps to eradicate and discourage the upsetting propagation of such hate campaigns. Social networking posts today often include the picture of Memes along with textual remarks and comments, which throw new challenges and opportunities to the research community while identifying the attacks. This article proposes a multimodal deep learning framework by utilizing ensembles of computer vision and natural language processing techniques to train an encapsulated transformer network for handling the classification problem. The proposed framework utilizes the fine-tuned state-of-the-art deep learning-based models (e.g., BERT, Electra) for multilingual text analysis along with face recognition and the optical character recognition model for Meme picture comprehension. For the study, a new Facebook meme-post dataset is created with recorded baseline results. The subject of the created dataset and context of the work is more geared toward multilingual Indian society. The findings demonstrate the efficacy of the proposed method in the identification of social media meme posts featuring derogatory content about a famous/recognized individual.
- [1] . 2019. Detection of hate speech in social networks: A survey on multilingual corpus. In Proceedings of the 6th International Conference on Computer Science and Information Technology.Google Scholar
Cross Ref
- [2] . 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.Google Scholar
Cross Ref
- [3] . 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).Google Scholar
- [4] . 2017. STN-OCR: A single neural network for text detection and text recognition. arXiv preprint arXiv:1707.08831 (2017).Google Scholar
- [5] . 1997. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 7 (1997), 711–720.Google Scholar
Digital Library
- [6] . 2018. Hate Speech Detection Using Natural Language Processing Techniques. Research Paper. Department of Mathematics Faculty of Science, Master Business Analytics.Google Scholar
- [7] . 2013. PhotoOCR: Reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision. 785–792.Google Scholar
Digital Library
- [8] . 2019. Cyberhate: A review and content analysis of intervention strategies. Aggression and Violent Behavior 45 (2019), 163–172.Google Scholar
Cross Ref
- [9] . 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).Google Scholar
- [10] . 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).Google Scholar
- [11] . 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019).Google Scholar
- [12] . 2018. Universal transformers. arXiv preprint arXiv:1807.03819 (2018).Google Scholar
- [13] . 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- [14] . 2016. Multi30k: Multilingual English-German image descriptions. arXiv preprint arXiv:1605.00459 (2016).Google Scholar
- [15] . 2010. Detecting text in natural scenes with stroke width transform. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
IEEE ,Los Alamitos, CA , 2963–2970.Google ScholarCross Ref
- [16] . 2011. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. 315–323.Google Scholar
- [17] . 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford 1, 12 (2009), 2009.Google Scholar
- [18] . 2017. TextProposals: A text-specific selective search algorithm for word spotting in the wild. Pattern Recognition 70 (2017), 60–74.Google Scholar
Digital Library
- [19] . 2020. Exploring hate speech detection in multimodal publications. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 1470–1478.Google Scholar
Cross Ref
- [20] . 2007. Face and eye detection for person authentication in mobile phones. In Proceedings of the 2007 1st ACM/IEEE International Conference on Distributed Smart Cameras.
IEEE ,Los Alamitos, CA , 101–108.Google ScholarCross Ref
- [21] . 2017. Sentiment analysis of social networking sites (SNS) data using machine learning approach for the measurement of depression. In Proceedings of the 2017 International Conference on Information and Communication Technology Convergence (ICTC’17).
IEEE ,Los Alamitos, CA , 138–140.Google ScholarCross Ref
- [22] . 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969.Google Scholar
Cross Ref
- [23] . 2015. Detection of cyberbullying incidents on the Instagram social network. arXiv preprint arXiv:1503.03909 (2015).Google Scholar
- [24] . 2015. Detection of cyberbullying incidents on the Instagram social network. arXiv preprint arXiv:1503.03909 (2015).Google Scholar
- [25] . 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018).Google Scholar
- [26] . 2019. Multilingual detection of hate speech against immigrants and women in Twitter at SemEval-2019 task 5: Frequency analysis interpolation for hate in speech detection. In Proceedings of the 13th International Workshop on Semantic Evaluation. 460–463.Google Scholar
Cross Ref
- [27] . 2016. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision 116, 1 (2016), 1–20.Google Scholar
Digital Library
- [28] . 2017. Analysis of foul language usage in social media text conversation. International Journal of Social Media and Interactive Learning Environments 5, 3 (2017), 227–251.Google Scholar
Cross Ref
- [29] . 2020. The hateful memes challenge: Detecting hate speech in multimodal memes. arXiv preprint arXiv:2005.04790 (2020).Google Scholar
- [30] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [31] . 2018. Study of face recognition techniques: A survey. International Journal of Advanced Computer Science and Applications 9, 6 (2018).Google Scholar
Cross Ref
- [32] . 2017. SphereFace: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 212–220.Google Scholar
Cross Ref
- [33] . 2017. Detecting hate speech in social media. arXiv preprint arXiv:1712.06427 (2017).Google Scholar
- [34] . 2018. Challenges in discriminating profanity from hate speech. Journal of Experimental & Theoretical Artificial Intelligence 30, 2 (2018), 187–202.Google Scholar
Cross Ref
- [35] . 2019. Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in Indo-European languages. In Proceedings of the 11th Forum for Information Retrieval Evaluation. 14–17.Google Scholar
Digital Library
- [36] . 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems. 6294–6305.Google Scholar
- [37] . 2018. Deep face recognition: A survey. arXiv preprint arXiv: 1804.06655 (2018).Google Scholar
- [38] . 2019. Trends in integration of vision and language research: A survey of tasks, datasets, and methods. arXiv preprint arXiv:1907.09358 (2019).Google Scholar
- [39] . 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).Google Scholar
- [40] . 2018. Social media metrics and sentiment analysis to evaluate the effectiveness of social media posts. Procedia Computer Science 130 (2018), 660–666.Google Scholar
Digital Library
- [41] . 2014. Human face recognition: An eigenfaces approach. In Proceedings of the International Conference on Advances in Intelligent Systems in Bioinformatics.Google Scholar
- [42] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.Google Scholar
Digital Library
- [43] . 2018. Characterizing and detecting hateful users on Twitter. arXiv preprint arXiv:1803.08977 (2018).Google Scholar
- [44] . 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).Google Scholar
- [45] . 2019. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1668–1678.Google Scholar
Cross Ref
- [46] . 2017. A survey on hate speech detection using natural language processing. In Proceedings of the 5th International Workshop on Natural Language Processing for Social Media. 1–10.Google Scholar
Cross Ref
- [47] . 2015. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 815–823.Google Scholar
Cross Ref
- [48] . 2011. Real time face detection using skin detection. Journal of Applied Computer Science & Mathematics 10, 5 (2011).Google Scholar
- [49] . 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2556–2565.Google Scholar
Cross Ref
- [50] . 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11 (2016), 2298–2304.Google Scholar
Digital Library
- [51] . 2017. Toward multimodal cyberbullying detection. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems. 2090–2099.Google Scholar
Digital Library
- [52] . 2009. Adapting the Tesseract open source OCR engine for multilingual OCR. In Proceedings of the International Workshop on Multilingual OCR. 1–8.Google Scholar
Digital Library
- [53] . 2016. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the 1st Conference on Machine Translation: Volume 2, Shared Task Papers. 543–553.Google Scholar
Cross Ref
- [54] . 2020. Detection of hate speech text in Hindi-English code-mixed data. Procedia Computer Science 171 (2020), 737–744.Google Scholar
Cross Ref
- [55] . 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.Google Scholar
Digital Library
- [56] . 2015. DeepID3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873 (2015).Google Scholar
- [57] . 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.Google Scholar
Cross Ref
- [58] . 2014. DeepFace: Closing the gap to human-level performance in face verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14), Vol. 5. 6.Google Scholar
Digital Library
- [59] . 2005. Eigenfaces and beyond. Face Processing: Advanced Modeling and Methods, W. Zhao and R. Chellappa (Eds.). Academic Press, 55–86.Google Scholar
- [60] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [61] . 2021. Interpretable multi-modal hate speech detection. arXiv preprint arXiv:2103.01616 (2021).Google Scholar
- [62] . 2004. Robust real-time face detection. International Journal of Computer Vision 57, 2 (2004), 137–154.Google Scholar
Digital Library
- [63] . 2017. NormFace: L2 hypersphere embedding for face verification. In Proceedings of the 25th ACM International Conference on Multimedia. 1041–1049.Google Scholar
Digital Library
- [64] . 2008. Face detection based on template matching and 2DPCA algorithm. In Proceedings of the 2008 Congress on Image and Signal Processing, Vol. 4.
IEEE ,Los Alamitos, CA , 575–579.Google ScholarDigital Library
- [65] . 2011. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision.
IEEE ,Los Alamitos, CA , 1457–1464.Google ScholarDigital Library
- [66] . 2010. Word spotting in the wild. In Proceedings of the European Conference on Computer Vision. 591–604.Google Scholar
Cross Ref
- [67] . 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter. In Proceedings of the 1st Workshop on NLP and Computational Social Science. 138–142.Google Scholar
Cross Ref
- [68] . 2017. Understanding abuse: A typology of abusive language detection subtasks. arXiv preprint arXiv:1705.09899 (2017).Google Scholar
- [69] . 2016. Hateful symbols or hateful people? Predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop. 88–93.Google Scholar
Cross Ref
- [70] . 2019. Exploring deep multimodal fusion of text and photo for hate speech classification. In Proceedings of the 3rd Workshop on Abusive Language Online. 11–18.Google Scholar
Cross Ref
- [71] . 2007. Demographic classification with local binary patterns. In Proceedings of the International Conference on Biometrics. 464–473.Google Scholar
Cross Ref
- [72] . 2019. XlNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems. 5753–5763.Google Scholar
- [73] . 2014. Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4042–4049.Google Scholar
Digital Library
- [74] . 2019. Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666 (2019).Google Scholar
- [75] . 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23, 10 (2016), 1499–1503.Google Scholar
Cross Ref
- [76] . 2016. Content-driven detection of cyberbullying on the Instagram social network. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 3952–3958.Google Scholar
Index Terms
A Multimodal Deep Framework for Derogatory Social Media Post Identification of a Recognized Person
Recommendations
Multimodal Context-Aware Recommender for Post Popularity Prediction in Social Media
Thematic Workshops '17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017Millions of multimodal posts are uploaded, shared, viewed and liked every day in different social networks, where users express their opinions about different items such as products and places. While, some user posts become popular, others are ignored. ...
Automated Identification of Social Media Bots Using Deepfake Text Detection
Information Systems SecurityAbstractSocial networks are playing an increasingly important role in modern society. Social media bots are also on the rise. Bots can propagate misinformation and spam, thereby influencing economy, politics, and healthcare. The progress in Natural ...
Deep fusion of multimodal features for social media retweet time prediction
AbstractThe popularity of various social media platforms (e.g., Twitter, Facebook, Instagram, and Weibo) has led to the generation of millions of micro-blogs each day. Retweet (message forwarding function) is considered to be one of the most effective ...






Comments