Abstract
The unmoderated nature of social media enables the diffusion of hoaxes, which in turn jeopardises the credibility of information gathered from social media platforms. Existing research on automated detection of hoaxes has the limitation of using relatively small datasets, owing to the difficulty of getting labelled data. This, in turn, has limited research exploring early detection of hoaxes as well as exploring other factors such as the effect of the size of the training data or the use of sliding windows. To mitigate this problem, we introduce a semi-automated method that leverages the Wikidata knowledge base to build large-scale datasets for veracity classification, focusing on celebrity death reports. This enables us to create a dataset with 4,007 reports including over 13M tweets, 15% of which are fake. Experiments using class-specific representations of word embeddings show that we can achieve F1 scores nearing 72% within 10 minutes of the first tweet being posted when we expand the size of the training data following our semi-automated means. Our dataset represents a realistic scenario with a real distribution of true, commemorative, and false stories, which we release for further use as a benchmark in future research.
- Sarah A. Alkhodair, Steven H. H. Ding, Benjamin C. M. Fung, and Junqiang Liu. 2020. Detecting breaking news rumors of emerging topics in social media. Inf. Proc. Manag. 57, 2 (2020), 102018.Google Scholar
Cross Ref
- Hunt Allcott and Matthew Gentzkow. 2017. Social Media and Fake News in the 2016 Election. Technical Report. National Bureau of Economic Research.Google Scholar
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, Feb. (2003), 1137--1155.Google Scholar
- Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15). Association for Computational Linguistics (ACL).Google Scholar
- Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Comput. Ling. 18, 4 (1992), 467--479.Google Scholar
Digital Library
- Axel Bruns, Tim Highfield, and Rebecca Ann Lind. 2012. Blogs, Twitter, and breaking news: The produsage of citizen journalism. Produs. Theor. Dig. World: Intersect. Aud. Prod. Contemp. Theor. 80, 2012 (2012), 15--32.Google Scholar
- Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. 2011. Information credibility on Twitter. In Proceedings of the 20th International Conference on World Wide Web. ACM, 675--684.Google Scholar
Digital Library
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 1 (1960), 37--46.Google Scholar
Cross Ref
- Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 670--680.Google Scholar
Cross Ref
- Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 2002. Maintaining stream statistics over sliding windows. SIAM J. Comput. 31, 6 (2002), 1794--1813.Google Scholar
Digital Library
- Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz Zubiaga. 2017. SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours. In Proceedings of the International Workshop on Semantic Evaluation. ACL, 69--76.Google Scholar
Cross Ref
- Nicholas Diakopoulos, Munmun De Choudhury, and Mor Naaman. 2012. Finding and assessing social media information sources in the context of journalism. In Proceedings of the ACM Conference on Human Factors in Computing Systems. ACM, 2451--2460.Google Scholar
Digital Library
- Sebastian Dungs, Ahmet Aker, Norbert Fuhr, and Kalina Bontcheva. 2018. Can rumour stance alone predict veracity?. In Proceedings of the 27th International Conference on Computational Linguistics. 3360--3370.Google Scholar
- Paolo Gerbaudo. 2018. Tweets and the Streets: Social Media and Contemporary Activism. Pluto Press.Google Scholar
Digital Library
- Fantahun Bogale Gereme and William Zhu. 2019. Early detection of fake news “before it flies high.” In Proceedings of the 2nd International Conference on Big Data Technologies. 142--148.Google Scholar
Digital Library
- Jennifer Golbeck, Matthew Mauriello, Brooke Auxier, Keval H. Bhanushali, Christopher Bonk, Mohamed Amine Bouzaghrane, Cody Buntain, Riya Chanduka, Paul Cheakalos, Jennine B. Everett, et al. 2018. Fake news vs. satire: A dataset and analysis. In Proceedings of the 10th ACM Conference on Web Science. 17--21.Google Scholar
Digital Library
- Genevieve Gorrell, Kalina Bontcheva, Leon Derczynski, Elena Kochkina, Maria Liakata, and Arkaitz Zubiaga. 2019. RumourEval 2019: Determining rumour veracity and support for rumours. In Proceedings of the International Workshop on Semantic Evaluation, 845--854.Google Scholar
Cross Ref
- Jeffrey Gottfried and Elisa Shearer. 2016. News Use across Social Media Platforms 2016. Technical Report. Pew Research Center.Google Scholar
- Aditi Gupta, Ponnurangam Kumaraguru, Carlos Castillo, and Patrick Meier. 2014. Tweetcred: Real-time credibility assessment of content on Twitter. In Proceedings of the International Conference on Social Informatics. Springer, 228--243.Google Scholar
Cross Ref
- Alfred Hermida. 2012. Tweets and truth: Journalism as a discipline of collaborative verification. J. Pract. 6, 5--6 (2012), 659--668.Google Scholar
- Zhiwei Jin, Juan Cao, Yongdong Zhang, and Jiebo Luo. 2016. News verification by exploiting conflicting social viewpoints in microblogs. In Proceedings of the AAAI Conference on Artificial Intelligence. 2972--2978.Google Scholar
- Sarthak Jindal, Mayank Vatsa, and Richa Singh. 2019. NewsBag: A Benchmark Dataset for Fake News Detection. Technical Report, Indraprastha Institute of Information Technology. https://repository.iiitd.edu.in/xmlui/bitstream/handle/123456789/779/2015169_SARTHAK.pdf.Google Scholar
- Terry Koo, Xavier Carreras Pérez, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 595--603.Google Scholar
- Srijan Kumar, Robert West, and Jure Leskovec. 2016. Disinformation on the web: Impact, characteristics, and detection of Wikipedia hoaxes. In Proceedings of the World Wide Web Conference. 591--602.Google Scholar
Digital Library
- Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the World Wide Web Conference. ACM, 591--600.Google Scholar
Digital Library
- Sejeong Kwon, Meeyoung Cha, and Kyomin Jung. 2017. Rumor detection over varying time windows. PloS One 12, 1 (2017), e0168344.Google Scholar
Cross Ref
- Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, Rui Fang, and Sameena Shah. 2015. Real-time rumor debunking on Twitter. In Proceedings of theConference on Information and Knowledge Management. ACM, 1867--1870.Google Scholar
Digital Library
- Yang Liu and Yi-Fang Brook Wu. 2018. Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google Scholar
- Jing Ma, Wei Gao, Prasenjit Mitra, Sejeong Kwon, Bernard J. Jansen, Kam-Fai Wong, and Meeyoung Cha. 2016. Detecting rumors from microblogs with recurrent neural networks. In Proceedings of the International Joint Conference on Artificial Intelligence. 3818--3824.Google Scholar
Digital Library
- Curtis Daniel MacDougall. 1958. Hoaxes. Vol. 465. Dover Pubns.Google Scholar
- Filippo Menczer. 2016. The spread of misinformation in social media. In Proceedings of the World Wide Web Conference. 717--717.Google Scholar
Digital Library
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 3111--3119.Google Scholar
- Tanushree Mitra and Eric Gilbert. 2015. Credbank: A large-scale social media corpus with associated credibility annotations. In Proceedings of the 9th International AAAI Conference on Web and Social Media.Google Scholar
- João Moreno and Graça Bressan. 2019. FACTCK. BR: A new dataset to study fake news. In Proceedings of the 25th Brazilian Symposium on Multimedia and the Web. 525--527.Google Scholar
Digital Library
- Meredith Ringel Morris, Scott Counts, Asta Roseway, Aaron Hoff, and Julia Schwarz. 2012. Tweeting is believing?: Understanding microblog credibility perceptions. In Proceedings of the ACM Conference on Computer-supported Cooperative Work and Social Computing. ACM, 441--450.Google Scholar
Digital Library
- Kai Nakamura, Sharon Levy, and William Yang Wang. 2019. r/Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection. Arxiv Preprint Arxiv:1911.03854 (2019).Google Scholar
- Robert Nares. 1822. A Glossary: Or, Collection of Words, Phrases, Names, and Allusions to Customs, Proverbs, 8c., which Have Been Thought to Require Illustration, in the Works of English Authors, Particularly Shakespeare, and His Contemporaries...R. Triphook, London.Google Scholar
- Jeppe Nørregaard, Benjamin D. Horne, and Sibel Adalı. 2019. NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 13. 630--638.Google Scholar
Cross Ref
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 2825--2830.Google Scholar
Digital Library
- Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev, and Qiaozhu Mei. 2011. Rumor has it: Identifying misinformation in microblogs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1589--1599.Google Scholar
- Adwait Ratnaparkhi. 1997. A simple introduction to maximum entropy models for natural language processing. IRCS Tech. Reports Series (1997), 81.Google Scholar
- Julio C. S. Reis, André Correia, Fabrício Murai, Adriano Veloso, Fabrício Benevenuto, and Erik Cambria. 2019. Supervised learning for fake news detection. IEEE Intell. Syst. 34, 2 (2019), 76--81.Google Scholar
Digital Library
- Fatima K. Abu Salem, Roaa Al Feel, Shady Elbassuoni, Mohamad Jaber, and May Farah. 2019. FA-KES: A fake news dataset around the Syrian war. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 13. 573--582.Google Scholar
- Justin Sampson, Fred Morstatter, Liang Wu, and Huan Liu. 2016. Leveraging the implicit structure within social media for emergent rumor detection. In Proceedings of the Conference on Information and Knowledge Management. ACM, 2377--2382.Google Scholar
Digital Library
- Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieberman, and Jon Sperling. 2009. Twitterstand: News in tweets. In Proceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 42--51.Google Scholar
Digital Library
- Chengcheng Shao, Giovanni Luca Ciampaglia, Onur Varol, Kai-Cheng Yang, Alessandro Flammini, and Filippo Menczer. 2018. The spread of low-credibility content by social bots. Nature Commun. 9, 1 (2018), 4787.Google Scholar
Cross Ref
- Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. 2018. FakeNewsNet: A data repository with news content, social context, and spatialtemporal information for studying fake news on social media. Arxiv Preprint Arxiv:1809.01286 (2018).Google Scholar
- Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newslett. 19, 1 (2017), 22--36.Google Scholar
Digital Library
- Eugenio Tacchini, Gabriele Ballarin, Marco L. Della Vedova, Stefano Moret, and Luca de Alfaro. 2017. Some like it hoax: Automated fake news detection in social networks. In Proceedings of the 2nd Workshop on Data Science for Social Good.Google Scholar
- Tetsuro Takahashi and Nobuyuki Igata. 2012. Rumor detection on Twitter. In Proceedings of the Soft Computing and Intelligent Systems Conference. IEEE, 452--457.Google Scholar
Cross Ref
- James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: A large-scale dataset for fact extraction and verification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 809--819.Google Scholar
Cross Ref
- Sebastian Tschiatschek, Adish Singla, Manuel Gomez Rodriguez, Arpit Merchant, and Andreas Krause. 2018. Fake news detection in social networks via crowd signals. In Proceedings of the Web Conference. International World Wide Web Conferences Steering Committee, 517--524.Google Scholar
Digital Library
- Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 384--394.Google Scholar
Digital Library
- Inna Vogel and Peter Jiang. 2019. Fake news detection with the new german dataset “GermanFakeNC.” In Proceedings of the International Conference on Theory and Practice of Digital Libraries. Springer, 288--295.Google Scholar
Cross Ref
- Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78--85.Google Scholar
Digital Library
- William Yang Wang. 2017. “Liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 422--426.Google Scholar
Cross Ref
- Yaqing Wang, Fenglong Ma, Zhiwei Jin, Ye Yuan, Guangxu Xun, Kishlay Jha, Lu Su, and Jing Gao. 2018. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery 8 Data Mining. 849--857.Google Scholar
Digital Library
- Fan Yang, Yang Liu, Xiaohui Yu, and Min Yang. 2012. Automatic detection of rumor on Sina Weibo. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics. ACM, 13.Google Scholar
Digital Library
- Arefeh Yavary, Hedieh Sajedi, and Mohammad Saniee Abadeh. 2020. Information verification in social networks based on user feedback and news agencies. Soc. Netw. Anal. Min. 10, 1 (2020), 2.Google Scholar
Cross Ref
- Xichen Zhang and Ali A. Ghorbani. 2020. An overview of online fake news: Characterization, detection, and discussion. Inf. Proc. Manag. 57, 2 (2020), 102025.Google Scholar
Cross Ref
- Xinyi Zhou, Atishay Jain, Vir V. Phoha, and Reza Zafarani. 2019. Fake news early detection: A theory-driven model. Arxiv Preprint Arxiv:1904.11679 (2019).Google Scholar
- Xinyi Zhou, Reza Zafarani, Kai Shu, and Huan Liu. 2019. Fake news: Fundamental theories, detection strategies, and challenges. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. ACM, 836--837.Google Scholar
Digital Library
- Arkaitz Zubiaga. 2019. Mining social media for newsgathering: A review. Online Soc. Netw. Media 13 (2019), 100049.Google Scholar
Cross Ref
- Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva, Maria Liakata, and Rob Procter. 2018. Detection and resolution of rumours in social media: A survey. ACM Comput. Surv. 51, 2 (2018), 32.Google Scholar
Digital Library
- Arkaitz Zubiaga and Heng Ji. 2014. Tweet, but verify: Epistemic study of information verification on Twitter. Soc. Netw. Anal. Mining 4, 1 (2014), 1--12.Google Scholar
Cross Ref
- Arkaitz Zubiaga, Heng Ji, and Kevin Knight. 2013. Curating and contextualizing Twitter stories to assist with social newsgathering. In Proceedings of the ACM Conference on Intelligent User Interfaces. ACM, 213--224.Google Scholar
Digital Library
- Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob Procter, and Michal Lukasik. 2016. Stance classification in rumours as a sequential task exploiting the tree structure of social media conversations. In Proceedings of the International Conference on Computational Linguistics (COLING’16). 2438--2448.Google Scholar
- Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob Procter, Michal Lukasik, Kalina Bontcheva, Trevor Cohn, and Isabelle Augenstein. 2018. Discourse-aware rumour stance classification in social media using sequential classifiers. Inf. Proc. Manag. 54, 2 (2018), 273--290.Google Scholar
Cross Ref
- Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Peter Tolmie. 2016. Analysing how people orient to and spread rumours in social media by looking at conversational threads. PloS One 11, 3 (2016).Google Scholar
Index Terms
Early Detection of Social Media Hoaxes at Scale
Recommendations
Detection and Resolution of Rumours in Social Media: A Survey
Despite the increasing use of social media platforms for information and news gathering, its unmoderated nature often leads to the emergence and spread of rumours, i.e., items of information that are unverified at the time of posting. At the same time, ...
The diffusion of misinformation on social media
This study examines dynamic communication processes of political misinformation on social media focusing on three components: the temporal pattern, content mutation, and sources of misinformation. We traced the lifecycle of 17 popular political rumors ...
Gaussian Processes for Rumour Stance Classification in Social Media
Social media tend to be rife with rumours while new reports are released piecemeal during breaking news. Interestingly, one can mine multiple reactions expressed by social media users in those situations, exploring their stance towards rumours, ...






Comments