skip to main content
research-article

Early Detection of Social Media Hoaxes at Scale

Published:18 August 2020Publication History
Skip Abstract Section

Abstract

The unmoderated nature of social media enables the diffusion of hoaxes, which in turn jeopardises the credibility of information gathered from social media platforms. Existing research on automated detection of hoaxes has the limitation of using relatively small datasets, owing to the difficulty of getting labelled data. This, in turn, has limited research exploring early detection of hoaxes as well as exploring other factors such as the effect of the size of the training data or the use of sliding windows. To mitigate this problem, we introduce a semi-automated method that leverages the Wikidata knowledge base to build large-scale datasets for veracity classification, focusing on celebrity death reports. This enables us to create a dataset with 4,007 reports including over 13M tweets, 15% of which are fake. Experiments using class-specific representations of word embeddings show that we can achieve F1 scores nearing 72% within 10 minutes of the first tweet being posted when we expand the size of the training data following our semi-automated means. Our dataset represents a realistic scenario with a real distribution of true, commemorative, and false stories, which we release for further use as a benchmark in future research.

References

  1. Sarah A. Alkhodair, Steven H. H. Ding, Benjamin C. M. Fung, and Junqiang Liu. 2020. Detecting breaking news rumors of emerging topics in social media. Inf. Proc. Manag. 57, 2 (2020), 102018.Google ScholarGoogle ScholarCross RefCross Ref
  2. Hunt Allcott and Matthew Gentzkow. 2017. Social Media and Fake News in the 2016 Election. Technical Report. National Bureau of Economic Research.Google ScholarGoogle Scholar
  3. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, Feb. (2003), 1137--1155.Google ScholarGoogle Scholar
  4. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’15). Association for Computational Linguistics (ACL).Google ScholarGoogle Scholar
  5. Peter F. Brown, Peter V. Desouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Comput. Ling. 18, 4 (1992), 467--479.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Axel Bruns, Tim Highfield, and Rebecca Ann Lind. 2012. Blogs, Twitter, and breaking news: The produsage of citizen journalism. Produs. Theor. Dig. World: Intersect. Aud. Prod. Contemp. Theor. 80, 2012 (2012), 15--32.Google ScholarGoogle Scholar
  7. Carlos Castillo, Marcelo Mendoza, and Barbara Poblete. 2011. Information credibility on Twitter. In Proceedings of the 20th International Conference on World Wide Web. ACM, 675--684.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 1 (1960), 37--46.Google ScholarGoogle ScholarCross RefCross Ref
  9. Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 670--680.Google ScholarGoogle ScholarCross RefCross Ref
  10. Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 2002. Maintaining stream statistics over sliding windows. SIAM J. Comput. 31, 6 (2002), 1794--1813.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz Zubiaga. 2017. SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours. In Proceedings of the International Workshop on Semantic Evaluation. ACL, 69--76.Google ScholarGoogle ScholarCross RefCross Ref
  12. Nicholas Diakopoulos, Munmun De Choudhury, and Mor Naaman. 2012. Finding and assessing social media information sources in the context of journalism. In Proceedings of the ACM Conference on Human Factors in Computing Systems. ACM, 2451--2460.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sebastian Dungs, Ahmet Aker, Norbert Fuhr, and Kalina Bontcheva. 2018. Can rumour stance alone predict veracity?. In Proceedings of the 27th International Conference on Computational Linguistics. 3360--3370.Google ScholarGoogle Scholar
  14. Paolo Gerbaudo. 2018. Tweets and the Streets: Social Media and Contemporary Activism. Pluto Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Fantahun Bogale Gereme and William Zhu. 2019. Early detection of fake news “before it flies high.” In Proceedings of the 2nd International Conference on Big Data Technologies. 142--148.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jennifer Golbeck, Matthew Mauriello, Brooke Auxier, Keval H. Bhanushali, Christopher Bonk, Mohamed Amine Bouzaghrane, Cody Buntain, Riya Chanduka, Paul Cheakalos, Jennine B. Everett, et al. 2018. Fake news vs. satire: A dataset and analysis. In Proceedings of the 10th ACM Conference on Web Science. 17--21.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Genevieve Gorrell, Kalina Bontcheva, Leon Derczynski, Elena Kochkina, Maria Liakata, and Arkaitz Zubiaga. 2019. RumourEval 2019: Determining rumour veracity and support for rumours. In Proceedings of the International Workshop on Semantic Evaluation, 845--854.Google ScholarGoogle ScholarCross RefCross Ref
  18. Jeffrey Gottfried and Elisa Shearer. 2016. News Use across Social Media Platforms 2016. Technical Report. Pew Research Center.Google ScholarGoogle Scholar
  19. Aditi Gupta, Ponnurangam Kumaraguru, Carlos Castillo, and Patrick Meier. 2014. Tweetcred: Real-time credibility assessment of content on Twitter. In Proceedings of the International Conference on Social Informatics. Springer, 228--243.Google ScholarGoogle ScholarCross RefCross Ref
  20. Alfred Hermida. 2012. Tweets and truth: Journalism as a discipline of collaborative verification. J. Pract. 6, 5--6 (2012), 659--668.Google ScholarGoogle Scholar
  21. Zhiwei Jin, Juan Cao, Yongdong Zhang, and Jiebo Luo. 2016. News verification by exploiting conflicting social viewpoints in microblogs. In Proceedings of the AAAI Conference on Artificial Intelligence. 2972--2978.Google ScholarGoogle Scholar
  22. Sarthak Jindal, Mayank Vatsa, and Richa Singh. 2019. NewsBag: A Benchmark Dataset for Fake News Detection. Technical Report, Indraprastha Institute of Information Technology. https://repository.iiitd.edu.in/xmlui/bitstream/handle/123456789/779/2015169_SARTHAK.pdf.Google ScholarGoogle Scholar
  23. Terry Koo, Xavier Carreras Pérez, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 595--603.Google ScholarGoogle Scholar
  24. Srijan Kumar, Robert West, and Jure Leskovec. 2016. Disinformation on the web: Impact, characteristics, and detection of Wikipedia hoaxes. In Proceedings of the World Wide Web Conference. 591--602.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the World Wide Web Conference. ACM, 591--600.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sejeong Kwon, Meeyoung Cha, and Kyomin Jung. 2017. Rumor detection over varying time windows. PloS One 12, 1 (2017), e0168344.Google ScholarGoogle ScholarCross RefCross Ref
  27. Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, Rui Fang, and Sameena Shah. 2015. Real-time rumor debunking on Twitter. In Proceedings of theConference on Information and Knowledge Management. ACM, 1867--1870.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yang Liu and Yi-Fang Brook Wu. 2018. Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  29. Jing Ma, Wei Gao, Prasenjit Mitra, Sejeong Kwon, Bernard J. Jansen, Kam-Fai Wong, and Meeyoung Cha. 2016. Detecting rumors from microblogs with recurrent neural networks. In Proceedings of the International Joint Conference on Artificial Intelligence. 3818--3824.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Curtis Daniel MacDougall. 1958. Hoaxes. Vol. 465. Dover Pubns.Google ScholarGoogle Scholar
  31. Filippo Menczer. 2016. The spread of misinformation in social media. In Proceedings of the World Wide Web Conference. 717--717.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 3111--3119.Google ScholarGoogle Scholar
  33. Tanushree Mitra and Eric Gilbert. 2015. Credbank: A large-scale social media corpus with associated credibility annotations. In Proceedings of the 9th International AAAI Conference on Web and Social Media.Google ScholarGoogle Scholar
  34. João Moreno and Graça Bressan. 2019. FACTCK. BR: A new dataset to study fake news. In Proceedings of the 25th Brazilian Symposium on Multimedia and the Web. 525--527.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Meredith Ringel Morris, Scott Counts, Asta Roseway, Aaron Hoff, and Julia Schwarz. 2012. Tweeting is believing?: Understanding microblog credibility perceptions. In Proceedings of the ACM Conference on Computer-supported Cooperative Work and Social Computing. ACM, 441--450.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kai Nakamura, Sharon Levy, and William Yang Wang. 2019. r/Fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection. Arxiv Preprint Arxiv:1911.03854 (2019).Google ScholarGoogle Scholar
  37. Robert Nares. 1822. A Glossary: Or, Collection of Words, Phrases, Names, and Allusions to Customs, Proverbs, 8c., which Have Been Thought to Require Illustration, in the Works of English Authors, Particularly Shakespeare, and His Contemporaries...R. Triphook, London.Google ScholarGoogle Scholar
  38. Jeppe Nørregaard, Benjamin D. Horne, and Sibel Adalı. 2019. NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 13. 630--638.Google ScholarGoogle ScholarCross RefCross Ref
  39. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 2825--2830.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Vahed Qazvinian, Emily Rosengren, Dragomir R. Radev, and Qiaozhu Mei. 2011. Rumor has it: Identifying misinformation in microblogs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1589--1599.Google ScholarGoogle Scholar
  41. Adwait Ratnaparkhi. 1997. A simple introduction to maximum entropy models for natural language processing. IRCS Tech. Reports Series (1997), 81.Google ScholarGoogle Scholar
  42. Julio C. S. Reis, André Correia, Fabrício Murai, Adriano Veloso, Fabrício Benevenuto, and Erik Cambria. 2019. Supervised learning for fake news detection. IEEE Intell. Syst. 34, 2 (2019), 76--81.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Fatima K. Abu Salem, Roaa Al Feel, Shady Elbassuoni, Mohamad Jaber, and May Farah. 2019. FA-KES: A fake news dataset around the Syrian war. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 13. 573--582.Google ScholarGoogle Scholar
  44. Justin Sampson, Fred Morstatter, Liang Wu, and Huan Liu. 2016. Leveraging the implicit structure within social media for emergent rumor detection. In Proceedings of the Conference on Information and Knowledge Management. ACM, 2377--2382.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieberman, and Jon Sperling. 2009. Twitterstand: News in tweets. In Proceedings of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, 42--51.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Chengcheng Shao, Giovanni Luca Ciampaglia, Onur Varol, Kai-Cheng Yang, Alessandro Flammini, and Filippo Menczer. 2018. The spread of low-credibility content by social bots. Nature Commun. 9, 1 (2018), 4787.Google ScholarGoogle ScholarCross RefCross Ref
  47. Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. 2018. FakeNewsNet: A data repository with news content, social context, and spatialtemporal information for studying fake news on social media. Arxiv Preprint Arxiv:1809.01286 (2018).Google ScholarGoogle Scholar
  48. Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor. Newslett. 19, 1 (2017), 22--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Eugenio Tacchini, Gabriele Ballarin, Marco L. Della Vedova, Stefano Moret, and Luca de Alfaro. 2017. Some like it hoax: Automated fake news detection in social networks. In Proceedings of the 2nd Workshop on Data Science for Social Good.Google ScholarGoogle Scholar
  50. Tetsuro Takahashi and Nobuyuki Igata. 2012. Rumor detection on Twitter. In Proceedings of the Soft Computing and Intelligent Systems Conference. IEEE, 452--457.Google ScholarGoogle ScholarCross RefCross Ref
  51. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: A large-scale dataset for fact extraction and verification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 809--819.Google ScholarGoogle ScholarCross RefCross Ref
  52. Sebastian Tschiatschek, Adish Singla, Manuel Gomez Rodriguez, Arpit Merchant, and Andreas Krause. 2018. Fake news detection in social networks via crowd signals. In Proceedings of the Web Conference. International World Wide Web Conferences Steering Committee, 517--524.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 384--394.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Inna Vogel and Peter Jiang. 2019. Fake news detection with the new german dataset “GermanFakeNC.” In Proceedings of the International Conference on Theory and Practice of Digital Libraries. Springer, 288--295.Google ScholarGoogle ScholarCross RefCross Ref
  55. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78--85.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. William Yang Wang. 2017. “Liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 422--426.Google ScholarGoogle ScholarCross RefCross Ref
  57. Yaqing Wang, Fenglong Ma, Zhiwei Jin, Ye Yuan, Guangxu Xun, Kishlay Jha, Lu Su, and Jing Gao. 2018. EANN: Event adversarial neural networks for multi-modal fake news detection. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery 8 Data Mining. 849--857.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Fan Yang, Yang Liu, Xiaohui Yu, and Min Yang. 2012. Automatic detection of rumor on Sina Weibo. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics. ACM, 13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Arefeh Yavary, Hedieh Sajedi, and Mohammad Saniee Abadeh. 2020. Information verification in social networks based on user feedback and news agencies. Soc. Netw. Anal. Min. 10, 1 (2020), 2.Google ScholarGoogle ScholarCross RefCross Ref
  60. Xichen Zhang and Ali A. Ghorbani. 2020. An overview of online fake news: Characterization, detection, and discussion. Inf. Proc. Manag. 57, 2 (2020), 102025.Google ScholarGoogle ScholarCross RefCross Ref
  61. Xinyi Zhou, Atishay Jain, Vir V. Phoha, and Reza Zafarani. 2019. Fake news early detection: A theory-driven model. Arxiv Preprint Arxiv:1904.11679 (2019).Google ScholarGoogle Scholar
  62. Xinyi Zhou, Reza Zafarani, Kai Shu, and Huan Liu. 2019. Fake news: Fundamental theories, detection strategies, and challenges. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining. ACM, 836--837.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Arkaitz Zubiaga. 2019. Mining social media for newsgathering: A review. Online Soc. Netw. Media 13 (2019), 100049.Google ScholarGoogle ScholarCross RefCross Ref
  64. Arkaitz Zubiaga, Ahmet Aker, Kalina Bontcheva, Maria Liakata, and Rob Procter. 2018. Detection and resolution of rumours in social media: A survey. ACM Comput. Surv. 51, 2 (2018), 32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Arkaitz Zubiaga and Heng Ji. 2014. Tweet, but verify: Epistemic study of information verification on Twitter. Soc. Netw. Anal. Mining 4, 1 (2014), 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  66. Arkaitz Zubiaga, Heng Ji, and Kevin Knight. 2013. Curating and contextualizing Twitter stories to assist with social newsgathering. In Proceedings of the ACM Conference on Intelligent User Interfaces. ACM, 213--224.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob Procter, and Michal Lukasik. 2016. Stance classification in rumours as a sequential task exploiting the tree structure of social media conversations. In Proceedings of the International Conference on Computational Linguistics (COLING’16). 2438--2448.Google ScholarGoogle Scholar
  68. Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob Procter, Michal Lukasik, Kalina Bontcheva, Trevor Cohn, and Isabelle Augenstein. 2018. Discourse-aware rumour stance classification in social media using sequential classifiers. Inf. Proc. Manag. 54, 2 (2018), 273--290.Google ScholarGoogle ScholarCross RefCross Ref
  69. Arkaitz Zubiaga, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Peter Tolmie. 2016. Analysing how people orient to and spread rumours in social media by looking at conversational threads. PloS One 11, 3 (2016).Google ScholarGoogle Scholar

Index Terms

  1. Early Detection of Social Media Hoaxes at Scale

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on the Web
          ACM Transactions on the Web  Volume 14, Issue 4
          November 2020
          147 pages
          ISSN:1559-1131
          EISSN:1559-114X
          DOI:10.1145/3414043
          Issue’s Table of Contents

          Copyright © 2020 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 August 2020
          • Accepted: 1 June 2020
          • Revised: 1 May 2020
          • Received: 1 July 2019
          Published in tweb Volume 14, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!