Abstract
Researchers have begun studying content obtained from microblogging services such as Twitter to address a variety of technological, social, and commercial research questions. The large number of Twitter users and even larger volume of tweets often make it impractical to collect and maintain a complete record of activity; therefore, most research and some commercial software applications rely on samples, often relatively small samples, of Twitter data. For the most part, sample sizes have been based on availability and practical considerations. Relatively little attention has been paid to how well these samples represent the underlying stream of Twitter data. To fill this gap, this article performs a comparative analysis on samples obtained from two of Twitter’s streaming APIs with a more complete Twitter dataset to gain an in-depth understanding of the nature of Twitter data samples and their potential for use in various data mining tasks.
- Yong-Yeol Ahn, Seungyeop Han, Haewoon Kwak, Sue Moon, and Hawoong Jeong. 2007. Analysis of topological characteristics of huge online social networking services. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 835--844. Google Scholar
Digital Library
- Eytan Bakshy, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. 2011. Everyone’s an influencer: Quantifying influence on Twitter. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 65--74. Google Scholar
Digital Library
- Fabrício Benevenuto, Tiago Rodrigues, Meeyoung Cha, and Virgílio Almeida. 2009. Characterizing user behavior in online social networks. In Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement Conference (IMC’09). ACM, New York, NY, 49--62. Google Scholar
Digital Library
- Shea Bennett. 2012. Twitter Now Seeing 400 Million Tweets per Day, Increased Mobile Ad Revenue, Says [email protected]. Retrieved from http://www.mediabistro.com/alltwitter/twitter-400-million-tweets_b23744.Google Scholar
- Johan Bollen, Huina Mao, and Xiao-Jun Zeng. 2011. Twitter mood predicts the stock market. Journal of Computer Science 2, 1, 1--8.Google Scholar
Cross Ref
- S. Borgatti, K. Carley, and D. Krackhardt. 2006. On the robustness of centrality measures under conditions of imperfect data. Social Networks 28, 2 (May 2006), 124--136.Google Scholar
Cross Ref
- Jamie Callan and Margaret Connell. 2001. Query-based sampling of text databases. ACM Transactions on Information Systems 19, 2 (April 2001), 97--130. Google Scholar
Digital Library
- Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, and Sue Moon. 2007. I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, New York, NY, 1--14. Google Scholar
Digital Library
- Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011. Find me the right content! Diversity-based sampling of social media spaces for topic-centric search. In Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM’11). The AAAI Press.Google Scholar
- E. Costenbader. 2003. The stability of centrality measures when networks are sampled. Social Networks 25, 4 (Oct. 2003), 283--307.Google Scholar
Cross Ref
- Saptarshi Ghosh, Muhammad Bilal Zafar, Parantapa Bhattacharya, Naveen Sharma, Niloy Ganguly, and Krishna Gummadi. 2013. On sampling the wisdom of crowds: Random vs. expert sampling of the Twitter stream. In Proceedings of the 22Nd ACM International Conference on Conference on Information &##38; Knowledge Management (CIKM’13). ACM, New York, NY, 1739--1744. Google Scholar
Digital Library
- Phillipa Gill, Martin Arlitt, Zongpeng Li, and Anirban Mahanti. 2007. YouTube traffic characterization: A view from the edge. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, New York, NY, 15--28. Google Scholar
Digital Library
- Mark Granovetter. 1976. Network sampling: Some first steps. The American Journal of Sociology 81, 6, 1287--1303.Google Scholar
Cross Ref
- H. S. Heaps. 1978. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., Orlando, FL. Google Scholar
Digital Library
- Liangjie Hong, Amr Ahmed, Siva Gurumurthy, Alexander J. Smola, and Kostas Tsioutsiouliklis. 2012. Discovering geographical topics in the Twitter stream. In Proceedings of the 21st International Conference on World Wide Web (WWW’12). ACM, New York, NY, 769--778. Google Scholar
Digital Library
- Bernardo A. Huberman, Daniel M. Romero, and Fang Wu. 2009. Social Networks that matter: Twitter under the microscope. First Monday 14, 1.Google Scholar
- Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. 2007. Why we Twitter: Understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis (WebKDD/SNA-KDD’07). ACM, New York, NY, 56--65. Google Scholar
Digital Library
- Gueorgi Kossinets. 2003. Effects of missing data in social networks. Social Networks 28, 247--268.Google Scholar
Cross Ref
- Balachander Krishnamurthy, Phillipa Gill, and Martin Arlitt. 2008. A few chirps about Twitter. In Proceedings of the 1st Workshop on Online Social Networks (WOSN’08). ACM, New York, NY, 19--24. Google Scholar
Digital Library
- Ravi Kumar, Jasmine Novak, and Andrew Tomkins. 2006. Structure and evolution of online social networks. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, NY, 611--617. Google Scholar
Digital Library
- Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web (WWW’10). ACM, New York, NY, 591--600. Google Scholar
Digital Library
- SangHoon Lee, Pan-Jun Kim, Hawoong Jeong, and Fang Wu. 2006. Statistical properties of sampled networks. Physical Review E 73, 1.Google Scholar
- Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). ACM, New York, NY, 631--636. Google Scholar
Digital Library
- Arun S. Maiya and Tanya Y. Berger-Wolf. 2011. Benefits of bias: Towards better characterization of network sampling. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11). ACM, New York, NY, 105--113. Google Scholar
Digital Library
- Michael Mathioudakis and Nick Koudas. 2010. TwitterMonitor: Trend detection over the Twitter stream. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10). ACM, New York, NY, 1155--1158. Google Scholar
Digital Library
- Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (IMC’07). ACM, New York, NY, 29--42. Google Scholar
Digital Library
- Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen M. Carley. 2013. Is the sample good enough? Comparing data from Twitter’s streaming API with Twitter Firehose. In Proceedings of the 7th International Conference on Weblog fs and Social Media (ICWSM’13). The AAAI Press.Google Scholar
- Mor Naaman, Jeffrey Boase, and Chih-Hui Lai. 2010. Is it really about me?: Message content in social awareness streams. In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work (CSCW’10). ACM, New York, NY, 189--192. Google Scholar
Digital Library
- B. Ohana and B. Tierney. 2009. Sentiment classification of reviews using SentiWordNet. In Proceedings of the Ninth IT&T Conference. 13.Google Scholar
- Alexei Pozdnoukhov and Christian Kaiser. 2011. Space-time dynamics of topics in streaming text. In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Location-Based Social Networks (LBSN’11). ACM, New York, NY, 1--8. Google Scholar
Digital Library
- Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web (WWW’10). ACM, New York, NY, 851--860. Google Scholar
Digital Library
- Semiocast. 2012. Twitter Reaches Half a Billion Accounts More Than 140 Millions in the U.S. @ONLINE. Retrieved from http://semiocast.com/publications/2012_07_30_Twitter_reaches_half_a_billion_accounts_140m_in_the_US.Google Scholar
- Mustafa Sofean and Matthew Smith. 2012. A real-time architecture for detection of diseases using social networks: design, implementation and evaluation. In Proceedings of the 23rd ACM Conference on Hypertext and Social Media (HT’12). ACM, New York, NY, 309--310. Google Scholar
Digital Library
- W. Cohen William and Gosling Samuel. 2010. How does the data sampling strategy impact the discovery of information diffusion in social media. In Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM’10). The AAAI Press.Google Scholar
- Sooyeon Yoon, Sungmin Lee, Soon-Hyung Yook, and Yup Kin. 2006. Statistical properties of sampled networks by random walk. Physical Review E 73, 1.Google Scholar
- Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing Twitter and traditional media using topic models. In Proceedings of the 33rd European Conference on Advances in Information Retrieval (ECIR’11). Springer-Verlag, Berlin, 338--349. Google Scholar
Digital Library
Index Terms
Should We Use the Sample? Analyzing Datasets Sampled from Twitter’s Stream API
Recommendations
On the endogenesis of Twitter's spritzer and gardenhose sample streams
ASONAM '14: Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and MiningMany recent publications deal with trend analysis, event detection or opinion mining on social media data. Twitter, as the most important microblogging service, is often in the focus of these works, as it offers free access to big volumes of data. The ...
A Tempest in a Teacup? Analyzing Firestorms on Twitter
ASONAM '15: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015'Firestorms,' sudden bursts of negative attention in cases of controversy and outrage, are seemingly widespread on Twitter and are an increasing source of fascination and anxiety in the corporate, governmental, and public spheres. Using media mentions, ...
Twitter Informatics: Tracking and Understanding Public Reaction during the 2009 Swine Flu Pandemic
WI-IAT '11: Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01Much attention has been focused on Twitter because it serves as a central hub for the publishing, dissemination, and discovery of online media. This is true for both traditional news outlets and user generated content, both of which can vary widely in ...






Comments