skip to main content
research-article

Should We Use the Sample? Analyzing Datasets Sampled from Twitter’s Stream API

Published:02 June 2015Publication History
Skip Abstract Section

Abstract

Researchers have begun studying content obtained from microblogging services such as Twitter to address a variety of technological, social, and commercial research questions. The large number of Twitter users and even larger volume of tweets often make it impractical to collect and maintain a complete record of activity; therefore, most research and some commercial software applications rely on samples, often relatively small samples, of Twitter data. For the most part, sample sizes have been based on availability and practical considerations. Relatively little attention has been paid to how well these samples represent the underlying stream of Twitter data. To fill this gap, this article performs a comparative analysis on samples obtained from two of Twitter’s streaming APIs with a more complete Twitter dataset to gain an in-depth understanding of the nature of Twitter data samples and their potential for use in various data mining tasks.

References

  1. Yong-Yeol Ahn, Seungyeop Han, Haewoon Kwak, Sue Moon, and Hawoong Jeong. 2007. Analysis of topological characteristics of huge online social networking services. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 835--844. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Eytan Bakshy, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. 2011. Everyone’s an influencer: Quantifying influence on Twitter. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Fabrício Benevenuto, Tiago Rodrigues, Meeyoung Cha, and Virgílio Almeida. 2009. Characterizing user behavior in online social networks. In Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement Conference (IMC’09). ACM, New York, NY, 49--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Shea Bennett. 2012. Twitter Now Seeing 400 Million Tweets per Day, Increased Mobile Ad Revenue, Says [email protected]. Retrieved from http://www.mediabistro.com/alltwitter/twitter-400-million-tweets_b23744.Google ScholarGoogle Scholar
  5. Johan Bollen, Huina Mao, and Xiao-Jun Zeng. 2011. Twitter mood predicts the stock market. Journal of Computer Science 2, 1, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  6. S. Borgatti, K. Carley, and D. Krackhardt. 2006. On the robustness of centrality measures under conditions of imperfect data. Social Networks 28, 2 (May 2006), 124--136.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jamie Callan and Margaret Connell. 2001. Query-based sampling of text databases. ACM Transactions on Information Systems 19, 2 (April 2001), 97--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon. 2007. I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, New York, NY, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011. Find me the right content! Diversity-based sampling of social media spaces for topic-centric search. In Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM’11). The AAAI Press.Google ScholarGoogle Scholar
  10. E. Costenbader. 2003. The stability of centrality measures when networks are sampled. Social Networks 25, 4 (Oct. 2003), 283--307.Google ScholarGoogle ScholarCross RefCross Ref
  11. Saptarshi Ghosh, Muhammad Bilal Zafar, Parantapa Bhattacharya, Naveen Sharma, Niloy Ganguly, and Krishna Gummadi. 2013. On sampling the wisdom of crowds: Random vs. expert sampling of the Twitter stream. In Proceedings of the 22Nd ACM International Conference on Conference on Information &##38; Knowledge Management (CIKM’13). ACM, New York, NY, 1739--1744. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Phillipa Gill, Martin Arlitt, Zongpeng Li, and Anirban Mahanti. 2007. YouTube traffic characterization: A view from the edge. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, New York, NY, 15--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mark Granovetter. 1976. Network sampling: Some first steps. The American Journal of Sociology 81, 6, 1287--1303.Google ScholarGoogle ScholarCross RefCross Ref
  14. H. S. Heaps. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., Orlando, FL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Liangjie Hong, Amr Ahmed, Siva Gurumurthy, Alexander J. Smola, and Kostas Tsioutsiouliklis. 2012. Discovering geographical topics in the Twitter stream. In Proceedings of the 21st International Conference on World Wide Web (WWW’12). ACM, New York, NY, 769--778. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Bernardo A. Huberman, Daniel M. Romero, and Fang Wu. 2009. Social Networks that matter: Twitter under the microscope. First Monday 14, 1.Google ScholarGoogle Scholar
  17. Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. 2007. Why we Twitter: Understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis (WebKDD/SNA-KDD’07). ACM, New York, NY, 56--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gueorgi Kossinets. 2003. Effects of missing data in social networks. Social Networks 28, 247--268.Google ScholarGoogle ScholarCross RefCross Ref
  19. Balachander Krishnamurthy, Phillipa Gill, and Martin Arlitt. 2008. A few chirps about Twitter. In Proceedings of the 1st Workshop on Online Social Networks (WOSN’08). ACM, New York, NY, 19--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ravi Kumar, Jasmine Novak, and Andrew Tomkins. 2006. Structure and evolution of online social networks. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, NY, 611--617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web (WWW’10). ACM, New York, NY, 591--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. SangHoon Lee, Pan-Jun Kim, Hawoong Jeong, and Fang Wu. 2006. Statistical properties of sampled networks. Physical Review E 73, 1.Google ScholarGoogle Scholar
  23. Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, NY, 631--636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Arun S. Maiya and Tanya Y. Berger-Wolf. 2011. Benefits of bias: Towards better characterization of network sampling. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). ACM, New York, NY, 105--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Michael Mathioudakis and Nick Koudas. 2010. TwitterMonitor: Trend detection over the Twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10). ACM, New York, NY, 1155--1158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, New York, NY, 29--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen M. Carley. 2013. Is the sample good enough? Comparing data from Twitter’s streaming API with Twitter Firehose. In Proceedings of the 7th International Conference on Weblog fs and Social Media (ICWSM’13). The AAAI Press.Google ScholarGoogle Scholar
  28. Mor Naaman, Jeffrey Boase, and Chih-Hui Lai. 2010. Is it really about me?: Message content in social awareness streams. In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work (CSCW’10). ACM, New York, NY, 189--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Ohana and B. Tierney. 2009. Sentiment classification of reviews using SentiWordNet. In Proceedings of the Ninth IT&T Conference. 13.Google ScholarGoogle Scholar
  30. Alexei Pozdnoukhov and Christian Kaiser. 2011. Space-time dynamics of topics in streaming text. In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Location-Based Social Networks (LBSN’11). ACM, New York, NY, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web (WWW’10). ACM, New York, NY, 851--860. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Semiocast. 2012. Twitter Reaches Half a Billion Accounts More Than 140 Millions in the U.S. @ONLINE. Retrieved from http://semiocast.com/publications/2012_07_30_Twitter_reaches_half_a_billion_accounts_140m_in_the_US.Google ScholarGoogle Scholar
  33. Mustafa Sofean and Matthew Smith. 2012. A real-time architecture for detection of diseases using social networks: design, implementation and evaluation. In Proceedings of the 23rd ACM Conference on Hypertext and Social Media (HT’12). ACM, New York, NY, 309--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. W. Cohen William and Gosling Samuel. 2010. How does the data sampling strategy impact the discovery of information diffusion in social media. In Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM’10). The AAAI Press.Google ScholarGoogle Scholar
  35. Sooyeon Yoon, Sungmin Lee, Soon-Hyung Yook, and Yup Kin. 2006. Statistical properties of sampled networks by random walk. Physical Review E 73, 1.Google ScholarGoogle Scholar
  36. Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and traditional media using topic models. In Proceedings of the 33rd European Conference on Advances in Information Retrieval (ECIR’11). Springer-Verlag, Berlin, 338--349. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Should We Use the Sample? Analyzing Datasets Sampled from Twitter’s Stream API

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on the Web
        ACM Transactions on the Web  Volume 9, Issue 3
        June 2015
        187 pages
        ISSN:1559-1131
        EISSN:1559-114X
        DOI:10.1145/2788341
        Issue’s Table of Contents

        Copyright © 2015 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 June 2015
        • Accepted: 1 March 2015
        • Revised: 1 December 2014
        • Received: 1 July 2013
        Published in tweb Volume 9, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!