Abstract
Today’s Web provides many different functionalities, including communication, entertainment, social networking, and information retrieval. In this article, we analyze traces of HTTP activity from a large enterprise and from a large university to identify and characterize Web-based service usage. Our work provides an initial methodology for the analysis of Web-based services. While it is nontrivial to identify the classes, instances, and providers for each transaction, our results show that most of the traffic comes from a small subset of providers, which can be classified manually. Furthermore, we assess both qualitatively and quantitatively how the Web has evolved over the past decade, and discuss the implications of these changes.
- Adamic, L. 2009. Zipf, power-laws, and Pareto - a ranking tutorial. http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html.Google Scholar
- Adamic, L. and Huberman, B. 2002. Zipf’s law and the Internet. Glottometrics 3, 143--150.Google Scholar
- Arlitt, M., Friedrich, R., and Jin, T. 1999. Workload characterization of a Web proxy in a cable modem environment. ACM SIGMETRICS Perf. Eval. Rev. 27, 2 (Sept.), 25--36. Google Scholar
Digital Library
- Atkinson, K. 2008. Kevin’s word list page (12 dicts package). http://wordlist.sourceforge.net/.Google Scholar
- Baeza-Yates, R., Castillo, C., and Efthimiadis, E. 2007. Characterization of national Web domains. ACM Trans. Internet Tech. 7, 2. Google Scholar
Digital Library
- Bent, L., Rabinovich, M., Voelker, G., and Xiao, Z. 2006. Characterization of a large Web site population with implications for content delivery. WWW J. 9, 4, 505--536. Google Scholar
Digital Library
- Berners-Lee, T., Cailliau, R., Luotonen, A., Frystyk-Nielsen, H., and Secret, A. 1994. The world wide Web. Comm. ACM 37, 8, 76--82. Google Scholar
Digital Library
- Breslau, L., Cao, P., Fan, L., Phillips, G., and Shenker, S. 1999. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the IEEE INFOCOM.Google Scholar
- Bro Intrusion Detection System. 2008. http://www.bro-ids.org/.Google Scholar
- Clauset, A., Shalizi, C., and Newman, M. 2009. Power-law distributions in empirical data. SIAM Rev. 51, 4, 661--703. Google Scholar
Digital Library
- Cormode, G. and Krishnamurthy, B. 2008. Key differences between Web 1.0 and Web 2.0. First Monday.Google Scholar
- Crovella, M. and Bestavros, A. 1997. Self-similarity in world wide Web traffic: Evidence and possible causes. IEEE/ACM Trans. Netw. 5, 6, 835--846. Google Scholar
Digital Library
- Cunha, C., Bestavros, A., and Crovella, M. 1995. Characteristics of world wide Web client-based traces. Tech. rep. BUCS-TR-1995-010, Computer Science Department, Boston University, Boston, MA. Google Scholar
Digital Library
- Duska, B., Marwood, D., and Freeley, M. 1997. The measured access characteristics of world wide Web client proxy caches. In Proceedings of the USENIX Symposium on Internet Technologies and Systems (USITS). Google Scholar
Digital Library
- Fetterly, D., Manasse, M., Najork, M., and Wiener, J. 2003. A large-scale study of the evolution of Web pages. In Proceedings of the 11th International Conference on World Wide Web (WWW). ACM. Google Scholar
Digital Library
- Glassman, S. 1994. A caching relay for the world wide Web. Comput. Netw. ISDN Syst. 27, 2, 69--76. Google Scholar
Digital Library
- Google Apps Education Edition. 2009. http://www.google.com/educators/p_apps.html.Google Scholar
- Han, E. and Karypis, G. 2000. Centroid-based document classification: Analysis and experimental results. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). Google Scholar
Digital Library
- Kelly, T. and Mogul, J. 2002. Aliasing on the World Wide Web: Prevalence and performance implications. In Proceedings of the 11th International Conference on World Wide Web (WWW). ACM. Google Scholar
Digital Library
- Krishnamurthy, B. and Wills, C. 2006a. Cat and mouse: Content delivery tradeoffs in Web access. In Proceedings of the 15th International Conference on World Wide Web (WWW). ACM. Google Scholar
Digital Library
- Krishnamurthy, B. and Wills, C. 2006b. Generating a privacy footprint on the Internet. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC). Google Scholar
Digital Library
- Krishnamurthy, B. and Wills, C. 2009. Privacy diffusion on the Web: A longitudinal perspective. In Proceedings of the 18th International Conference on World Wide Web (WWW). ACM. Google Scholar
Digital Library
- Kwan, O. and Lee, J. 2003. Text categorization based on k-nearest neighbors approach for Web site classification. Inf. Proc. Manage. 39, 1, 25--44. Google Scholar
Digital Library
- Li, W., Moore, A., and Canini, M. 2008. Classifying http traffic in the new age. In Proceedings of ACM SIGCOMM Conference (Poster).Google Scholar
- Ma, J., Levchenko, K., Kreibich, C., Savage, S., and Voelker, G. 2006. Unexpected means of protocol inference. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement (IMC). Google Scholar
Digital Library
- Mahanti, A., Williamson, C., and Eager, D. 2000. Traffic analysis of a Web proxy caching hierarchy. IEEE Netw. 14, 3, 16--23. Google Scholar
Digital Library
- Manning, C., Raghavan, P., and Schütze, H. 2009. An Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK.Google Scholar
- Newman, M. 2005. Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46, 5, 323--351.Google Scholar
Cross Ref
- Qi, X. and Davison, B. 2007. Web page classification: Features and algorithms. Tech. rep. LU-CSE-07-010, Lehigh University.Google Scholar
- Schneider, F., Agarwal, S., Alpcan, T., and Feldmann, A. 2008. The new Web: Characterizing Ajax traffic. In Proceedings of the Conference on Passive and Active Network Measurement (PAM). Google Scholar
Digital Library
- Trestian, I., Ranjan, S., Kuzmanovic, A., and Nucci, A. 2008. Unconstrained endpoint profiling (googling the Internet). In Proceedings of ACM SIGCOMM. Google Scholar
Digital Library
- Wikipedia Article. 2009. Domain hack. http://en.wikipedia.org/wiki/Domain_hack.Google Scholar
- Williams, A., Arlitt, M., Williamson, C., and Barker, K. 2005. Web workload characterization: Ten years later. Web Content Delivery, 3--21.Google Scholar
- Wolman, A., Voelker, G., Sharma, N., Cardwell, N., Karlin, A., and Levy, H. 1999. On the scale and performance of cooperative Web proxy caching. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP). Google Scholar
Digital Library
Index Terms
Characterizing Organizational Use of Web-Based Services: Methodology, Challenges, Observations, and Insights
Recommendations
Requirements for QoS-Based Web Service Description and Discovery
The goal of Service Oriented Architectures (SOAs) is to enable the creation of business applications through the automatic discovery and composition of independently developed and deployed (Web) services. Automatic discovery of Web Services (WSs) can be ...
Mixed-Integer Programming for QoS-Based Web Service Matchmaking
QoS-based Web Service (WS) discovery has been recognized as the main solution for filtering and selecting between functionally equivalent WSs stored in registries or other types of repositories. There are two main techniques for QoS-based WS matchmaking ...
Analysis of web-usage behavior for focused web sites: a case study
Special issue: Web site evolutionThe number of Web users and the diversity of their interests increase continuously; Web-content providers seek to infer these interests and to adapt their Web sites to improve accessibility of the offered content. Usage-pattern mining is a promising ...






Comments