Abstract
The next generation of Internet services is driven by users and user-generated content. The complex nature of user behavior makes it highly challenging to manage and secure online services. On one hand, service providers cannot effectively prevent attackers from creating large numbers of fake identities to disseminate unwanted content (e.g., spam). On the other hand, abusive behavior from real users also poses significant threats (e.g., cyberbullying).
In this article, we propose clickstream models to characterize user behavior in large online services. By analyzing clickstream traces (i.e., sequences of click events from users), we seek to achieve two goals: (1) detection: to capture distinct user groups for the detection of malicious accounts, and (2) understanding: to extract semantic information from user groups to understand the captured behavior. To achieve these goals, we build two related systems. The first one is a semisupervised system to detect malicious user accounts (Sybils). The core idea is to build a clickstream similarity graph where each node is a user and an edge captures the similarity of two users’ clickstreams. Based on this graph, we propose a coloring scheme to identify groups of malicious accounts without relying on a large labeled dataset. We validate the system using ground-truth clickstream traces of 16,000 real and Sybil users from Renren, a large Chinese social network. The second system is an unsupervised system that aims to capture and understand the fine-grained user behavior. Instead of binary classification (malicious or benign), this model identifies the natural groups of user behavior and automatically extracts features to interpret their semantic meanings. Applying this system to Renren and another online social network, Whisper (100K users), we help service providers identify unexpected user behaviors and even predict users’ future actions. Both systems received positive feedback from our industrial collaborators including Renren, LinkedIn, and Whisper after testing on their internal clickstream data.
- Arindam Banerjee and Joydeep Ghosh. 2000. Concept-based clustering of clickstream data. In Proceedings of the International Conference on Information Technology (ICIT'00).Google Scholar
- Arindam Banerjee and Joydeep Ghosh. 2001. Clickstream clustering using weighted longest common subsequences. In Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining (CDM'01).Google Scholar
- Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida. 2010. Detecting spammers on twitter. In Proceedings of the Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS'10).Google Scholar
- Fabricio Benevenuto, Tiago Rodrigues, Meeyoung Cha, and Virgilio Almeida. 2009. Characterizing user behavior in online social networks. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC'09). Google Scholar
Digital Library
- Alex Beutel. 2016. User behavior modeling with largescale graph analysis. Ph.D. Thesis at Carnegie Mellon University.Google Scholar
- Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, and Christos Faloutsos. 2013. CopyCatch: Stopping group attacks by spotting lockstep behavior in social networks. In Proceedings of the International Conference on World Wide Web (WWW'13). Google Scholar
Digital Library
- Erik W. Black, Kelsey Mezzina, and Lindsay A. Thompson. 2016. Anonymous social media understanding the content and context of Yik Yak. Comput. Human Behav. 57 (2016), 17--22. Google Scholar
Digital Library
- Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment (JSTAT) 10 (2008), P10008. Google Scholar
Cross Ref
- Armand Brahaj. 2009. English Stop Words. Retrieved from http://xpo6.com/list-of-english-stop-words/.Google Scholar
- Qiang Cao, Michael Sirivianos, Xiaowei Yang, and Tiago Pregueiro. 2012. Aiding the detection of fake accounts in large scale social online services. In Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI'12).Google Scholar
- Qiang Cao, Xiaowei Yang, Jieqi Yu, and Christopher Palow. 2014. Uncovering large groups of active malicious accounts in online social networks. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS'14). Google Scholar
Digital Library
- Rory Cellan. 2012. Facebook has more than 83 million illegitimate accounts. BBC News. (August 2012).Google Scholar
- Denzil Correa, Leandro Arajo Silva, Mainack Mondal, Fabrcio Benevenuto, and Krishna P. Gummadi. 2015. The many shades of anonymity: Characterizing anonymous social media content. In Proceedings of the AAAI International Conference on Web and Social Media (ICWSM'15).Google Scholar
- George Danezis and Prateek Mittal. 2009. SybilInfer: Detecting sybil nodes using social networks. In Proceedings of the Network and Distributed System Security Symposium (NDSS'09).Google Scholar
- John R. Douceur. 2002. The Sybil attack. In Proceedings of Revised Papers from the First International Workshop on Peer-to-Peer Systems (IPTPS'02). Google Scholar
Cross Ref
- Martin Ester, Hans Peter Kriegel, Jrg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'96).Google Scholar
- Facebook. 2013. Verify Facebook Account. Retrieved from https://www.facebook.com/help/398085743567023/.Google Scholar
- Hongyu Gao, Jun Hu, Christo Wilson, Zhichun Li, Yan Chen, and Ben Y. Zhao. 2010. Detecting and characterizing social spam campaigns. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC'10).Google Scholar
- R. Stuart Geiger and Aaron Halfaker. 2013. Using edit sessions to measure participation in wikipedia. In Proceedings of the Conference on Computer Supported Cooperative Work (CSCW'13). Google Scholar
Digital Library
- Chris Grier, Kurt Thomas, Vern Paxson, and Michael Zhang. 2010. @spam: The underground on 140 characters or less. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS'10). Google Scholar
Digital Library
- Şule Gündüz and M. Tamer Özsu. 2003. A web page prediction model based on click-stream tree representation of user behavior. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'03). Google Scholar
Digital Library
- John A. Hartigan and Manchek A. Wong. 1979. Algorithm AS 136: A k-means clustering algorithm. Appl. Stat. 28 (1979), 100--108. Google Scholar
Cross Ref
- Jeffrey Heer and Ed H. Chi. 2002a. Mining the structure of user activity using cluster stability. In Proceedings of the Workshop on Web Analytics at SIAM Conference on Data Mining (CDM'02).Google Scholar
- Jeffrey Heer and Ed H. Chi. 2002b. Separating the swarm: Categorization methods for user sessions on the web. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI'02). Google Scholar
Digital Library
- Peter I. Hofgesang and Wojtek Kowalczyk. 2005. Analysing clickstream data: From anomaly detection to visitor profiling. In Proceedings of the ECML/PKDD Discovery Challenge (DC'05).Google Scholar
- Homa Hosseinmardi, Richard Han, Qin Lv, Shivakant Mishra, and Amir Ghasemianlangroodi. 2014. Analyzing negative user behavior in a semi-anonymous social network. CoRR abs/1404.3839 (2014).Google Scholar
- Danesh Irani, Marco Balduzzi, Davide Balzarotti, Engin Kirda, and Calton Pu. 2011. Reverse social engineering attacks in online social networks. In Proceedings of the Conference on Detection of Intrusions and Malware 8 Vulnerability Assessment (DIMVA'11). Google Scholar
Cross Ref
- Jing Jiang, Christo Wilson, Xiao Wang, Peng Huang, Wenpeng Sha, Yafei Dai, and Ben Y. Zhao. 2010. Understanding latent interactions in online social networks. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC'10). Google Scholar
Digital Library
- Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, and Shiqiang Yang. 2014. CatchSync: Catching synchronized behavior in large directed graphs. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'14). Google Scholar
Digital Library
- Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, and Shiqiang Yang. 2016. Catching synchronized behaviors in large networks: A graph mining approach. ACM Transactions on Knowledge Discovery from Data (TKDD) 10, 4 (2016), 35:1--35:27.Google Scholar
Digital Library
- George Karypis and Vipin Kumar. 1998. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. 48 (1998), 96--129. Google Scholar
Digital Library
- Leonard Kaufman and Peter J Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis. Vol. 344. John Wiley 8 Sons.Google Scholar
- Haewoon Kwak, Jeremy Blackburn, and Seungyeop Han. 2015. Exploring cyberbullying and other toxic behavior in team competition online games. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI'15). Google Scholar
Digital Library
- Michael Levandowsky and David Winter. 1971. Distance between sets. Nature 234 (1971), 34--35. Google Scholar
Cross Ref
- Yixuan Li, Oscar Martinez, Xing Chen, Yi Li, and John E. Hopcroft. 2016. In a world that counts: Clustering and detecting fake social engagement at scale. In Proceedings of the International Conference on World Wide Web (WWW'16).Google Scholar
- Lin Lu, Margaret Dunham, and Yu Meng. 2005. Mining significant usage patterns from clickstream data. In Proceedings of the Web Mining and Web Usage Analysis Workshop (WebKDD'05).Google Scholar
- Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2013. Patina: Dynamic heatmaps for visualizing application usage. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI'13). Google Scholar
Digital Library
- Abedelaziz Mohaisen, Aaram Yun, and Yongdae Kim. 2010. Measuring the mixing time of social graphs. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC'10). Google Scholar
Digital Library
- Marti Motoyama, Kirill Levchenko, Chris Kanich, Damon McCoy, Geoffrey M. Voelker, and Stefan Savage. 2010. Re: CAPTCHAs—Understanding CAPTCHA-solving from an economic context. In Proceedings of the USENIX Security Symposium (Usenix Security'10).Google Scholar
- Marti Motoyama, Damon McCoy, Kirill Levchenko, Stefan Savage, and Geoffrey M. Voelker. 2011b. An analysis of underground forums. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC'11). Google Scholar
Digital Library
- Marti Motoyama, Damon McCoy, Kirill Levchenko, Stefan Savage, and Geoffrey M. Voelker. 2011a. Dirty jobs: The role of freelance labor in web service abuse. In Proceedings of the USENIX Security Symposium (Usenix Security'11).Google Scholar
- Arjun Mukherjee, Bing Liu, and Natalie Glance. 2012. Spotting fake reviewer groups in consumer reviews. In Proceedings of the International Conference on World Wide Web (WWW'12). Google Scholar
Digital Library
- Jaimie Y. Park, Neil O’Hare, Rossano Schifanella, Alejandro Jaimes, and Chin-Wan Chung. 2015. A large-scale study of user image search behavior on the web. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI'15). Google Scholar
Digital Library
- John C. Platt. 1999. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In Advances in Kernel Methods. MIT Press, 185--208.Google Scholar
Digital Library
- Narayanan Sadagopan and Jie Li. 2008. Characterizing typical and atypical user sessions in clickstreams. In Proceedings of the International Conference on World Wide Web (WWW'08). Google Scholar
Digital Library
- Stan Salvador and Philip Chan. 2004. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In Proceedings of the Annual IEEE International Conference on Tools with Artificial Intelligence (ICTAI'04). Google Scholar
Digital Library
- Fabian Schneider, Anja Feldmann, Balachander Krishnamurthy, and Walter Willinger. 2009. Understanding online social network usage from a network perspective. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC'11). Google Scholar
Digital Library
- Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang N. Tan. 2000. Web usage mining: Discovery and applications of usage patterns from Web data. SIGKDD Explor. Newsl. 1, 2 (2000), 12--23. Google Scholar
Digital Library
- Gianluca Stringhini, Christopher Kruegel, and Giovanni Vigna. 2010. Detecting spammers on social networks. In Proceedings of the Annual Computer Security Applications Conference (ACSAC'10). Google Scholar
Digital Library
- Qiang Su and Lu Chen. 2015. A method for discovering clusters of e-commerce interest patterns using click-stream data. Electronic Commerce Research and Applications (ECRA) 14, 1 (2015), 1--13. Google Scholar
Digital Library
- John R. Suler and Wende L. Phillips. 1998. The bad boys of cyberspace: Deviant behavior in a multimedia chat community. Cyberpsy. Behav. Soc. Network. 1, 3 (1998), 275--294. Google Scholar
Cross Ref
- Kurt Thomas, Chris Grier, and Vern Paxson. 2012. Adapting social spam infrastructure for political censorship. In Proceedings of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET'12).Google Scholar
- Kurt Thomas, Chris Grier, Dawn Song, and Vern Paxson. 2011. Suspended accounts in retrospect: An analysis of twitter spam. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC'11). Google Scholar
Digital Library
- I-Hsien Ting, Chris Kimble, and Daniel Kudenko. 2005. UBB mining: Finding unexpected browsing behaviour in clickstream data to improve a web site’s design. In Proceedings of the International Conference on Web Intelligence (ICWI'05).Google Scholar
- Nguyen Tran, Bonan Min, Jinyang Li, and Lakshminarayanan Subramanian. 2009. Sybil-resilient online content voting. In Proceedings of the USENIX Conference on Networked Systems Design and Implementation (NSDI'09).Google Scholar
- Cecilia Vega. 2012. Yelp Outs Companies That Pay for Positive Reviews. ABC News. Retrieved from http://abcnews.go.com/blogs/business/2012/11/yelp-outs-companies-that-pay-for-positive-reviews.Google Scholar
- Bimal Viswanath, Ansley Post, Krishna P. Gummadi, and Alan Mislove. 2010. An analysis of social network-based sybil defenses. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM'10). Google Scholar
Digital Library
- Alex Hai Wang. 2010. Don’t follow me: Spam detection on twitter. In Proceedings of the International Conference on Security and Cryptography (SECRYPT'10).Google Scholar
- Gang Wang, Manish Mohanlal, Christo Wilson, Xiao Wang, Miriam Metzger, Haitao Zheng, and Ben Y. Zhao. 2013. Social turing tests: Crowdsourcing sybil detection. In Proceedings of the Network and Distributed System Security Symposium (NDSS'13).Google Scholar
- Gang Wang, Bolun Wang, Tianyi Wang, Ana Nika, Haitao Zheng, and Ben Y. Zhao. 2014. Whispers in the dark: Analysis of an anonymous social network. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC'11). Google Scholar
Digital Library
- Gang Wang, Christo Wilson, Xiaohan Zhao, Yibo Zhu, Manish Mohanlal, Haitao Zheng, and Ben Y. Zhao. 2012. Serf and turf: Crowdturfing for fun and profit. In Proceedings of the International Conference on World Wide Web (WWW'12). Google Scholar
Digital Library
- Weixin Wang, Hui Wang, Guozhong Dai, and Hongan Wang. 2006. Visualization of large hierarchical data by circle packing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI'06). Google Scholar
Digital Library
- Jishang Wei, Zeqian Shen, Neel Sundaresan, and Kwan-Liu Ma. 2012. Visual cluster exploration of web clickstream data. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST'12). Google Scholar
Digital Library
- Christo Wilson, Bryce Boe, Alessandra Sala, Krishna P. N. Puttaswamy, and Ben Y. Zhao. 2009. User interactions in social networks and their implications. In Proceedings of the European Conference on Computer Systems (EuroSys'09). Google Scholar
Digital Library
- Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of the International Conference on Machine Learning (ICML'97).Google Scholar
Digital Library
- Zhi Yang, Christo Wilson, Xiao Wang, Tingting Gao, Ben Y. Zhao, and Yafei Dai. 2011. Uncovering social network sybils in the wild. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC'11). Google Scholar
Digital Library
- Sarita Yardi, Daniel Romero, Grant Schoenebeck, and Danah Boyd. 2010. Detecting spam in a Twitter network. First Monday 15, 1 (2010).Google Scholar
- Haifeng Yu, Phillip B. Gibbons, Michael Kaminsky, and Feng Xiao. 2008. SybilLimit: A near-optimal social network defense against Sybil attacks. In Proceedings of the IEEE Symposium on Security and Privacy (IEEE S&P'08). Google Scholar
Digital Library
- Haifeng Yu, Michael Kaminsky, Phillip B. Gibbons, and Abraham Flaxman. 2006. SybilGuard: Defending against Sybil attacks via social networks. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication (SIGCOMM'06). Google Scholar
Digital Library
- Jian Zhao, Zhicheng Liu, Mira Dontcheva, Aaron Hertzmann, and Alan Wilson. 2015. MatrixWave: Visual comparison of event sequence data. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI'15). Google Scholar
Digital Library
Index Terms
Clickstream User Behavior Models
Recommendations
Unsupervised Clickstream Clustering for User Behavior Analysis
CHI '16: Proceedings of the 2016 CHI Conference on Human Factors in Computing SystemsOnline services are increasingly dependent on user participation. Whether it's online social networks or crowdsourcing services, understanding user behavior is important yet challenging. In this paper, we build an unsupervised system to capture ...
Characterizing user behavior in online social networks
IMC '09: Proceedings of the 9th ACM SIGCOMM conference on Internet measurementUnderstanding how users behave when they connect to social networking sites creates opportunities for better interface design, richer studies of social interactions, and improved design of content distribution systems. In this paper, we present a first ...
Characterizing user navigation and interactions in online social networks
Understanding how users navigate and interact when they connect to social networking sites creates opportunities for better interface design, richer studies of social interactions, and improved design of content distribution systems. In this paper, we ...






Comments