Abstract
Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.
- Alex, P., Chirita, R., Costache, S., Nejdl, W., and Handschuh, S. 2007. P-tag: Large scale automatic generation of personalized annotation tags for the Web. In Proceedings of the International Conference on World Wide Web (WWW). 8--12. Google Scholar
Digital Library
- Avesani, P., Giunchiglia, F., and Yatskevich, M. 2005. A large scale taxonomy mapping evaluation. In Proceedings of the International Semantic Web Conference (ISWC). 67--81. Google Scholar
Digital Library
- Baykan, E., Henzinger, M., and Weber, I. 2008. Web page language identification based on URLs. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 176--187.Google Scholar
- Baykan, E., Henzinger, M., Marian, L., and Weber, I. 2009. Purely URL-based topic classification. In Proceedings of the International Conference on World Wide Web (WWW). 1109--1110. Google Scholar
Digital Library
- Chaker, J. and Habib, O. 2007. Genre categorization of Web pages. In Proceedings of the International Conference on Data Mining Workshops (ICDMW). 455--464. Google Scholar
Digital Library
- Chakrabarti, S., Dom, B., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of the International Conference on Management of Data (SIGMOD). 307--318. Google Scholar
Digital Library
- Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. Comput. Netw. 31, 11--16, 1623--1640. Google Scholar
Digital Library
- Chen, H. and Dumais, S. 2000. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the Conference on Human Factors in Computing Systems (CHI). 145--152. Google Scholar
Digital Library
- Cover, T. and Thomas, J. 1991. Elements of Information Theory. Wiley & Sons. Google Scholar
Digital Library
- Dasgupta, A., Kumar, R., and Sasturkar, A. 2008. De-Duping URLs via rewrite rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD). 186--194. Google Scholar
Digital Library
- Devi, M. I., Rajaram, R., and Selvakuberan, K. 2007. Machine learning techniques for automated Web page classification using URL features. In Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA). 116--120. Google Scholar
Digital Library
- Domingos, P. and Pazzani, M. 1997. On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn. 29, 103--130. Google Scholar
Digital Library
- Freud, Y. and Schapire, R. 1996. Experiments with a new boosting algorithm. In Proceedings of the International Conference on Machine Learning (ICML). 148--156.Google Scholar
- Freudiger, J., Vratonjic, N., and Hubaux, J.-P. 2009. Towards privacy-friendly online advertising. In Proceedings of the IEEE Web 2.0 Security and Privacy Conference (W2SP).Google Scholar
- Friedman, J., Hastie, T., and Tibshirani, R. 2000. Additive logistic regression: A statistical view of boosting. Ann. Statist. 38, 2, 337--374.Google Scholar
Cross Ref
- Hastie, T., Tibshirani, R., and Friedman, J. H. 2001. The Elements of Statistical Learning. Springer.Google Scholar
- Heymann, P., Ramage, D., and Garcia-Molina, H. 2008. Social tag prediction. In Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR). 531--538. Google Scholar
Digital Library
- Jäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., and Stumme, G. 2007. Tag recommendations in folksonomies. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). 506--514. Google Scholar
Digital Library
- Joachims, T. 1999. Making Large-Scale Support Vector Machine Learning Practical. MIT Press, 169--184. http://svmlight.joachims.org/. Google Scholar
Digital Library
- Kan, M.-Y. 2004. Web page classification without the Web page. In Proceedings of the International World Wide Web Conference on Alternate Track Papers and Posters. 262--263. Google Scholar
Digital Library
- Kan, M.-Y. and Nguyen, H. O. T. 2005. Fast Webpage classification using URL features. In Proceedings of the International Conference on Information and Knowledge Management (CIKM). 325--326. Google Scholar
Digital Library
- Koppula, H. S., Leela, K., Agarwal, A., Chitrapura, K. P., Garg, S., and Sasturkar, A. 2010. Learning URL patterns for Webpage de-duplication. In Proceedings of the International Conference on Web Search and Data Mining. 381--390. Google Scholar
Digital Library
- Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google Scholar
Digital Library
- McCallum, A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In Proceedings of the AAAI Workshop on Learning for Text Categorization. 41--48.Google Scholar
- McCallum, A. K. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow.Google Scholar
- McGuinness, D., Fikes, R., Rice, J., and Wilder, S. 2000. An environment for merging and testing large ontologies. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR). 483--493.Google Scholar
- Nigam, K., Lafferty, J., and McCallum, A. 1999. Using maximum entropy for text classification. In Proceedings of the Workshop on Machine Learning for Information Filtering. 61--67.Google Scholar
- Noy, N. 2004. Tools for mapping and merging ontologies. In Handbook on Ontologies, S. Staab and R. Studer Eds., Springer, 365--384.Google Scholar
- P, D. and Khemani, D. 2006. Unsupervised learning from URL corpora. In Proceedings of the International Conference on Management of Data (COMAD’06).Google Scholar
- Poola, K. L. and Ramanujapuram, A. 2007. Techniques for keyword extraction from URLs using statistical analysis. http://www.faqs.org/patents/app/20090089278. US Patent application.Google Scholar
- Power, R., Chen, J., Karthik, T., and Subramanian, L. 2009. Document classification for focused topics. In Proceedings of the AAAI Spring Symposium on AI for Development.Google Scholar
- Qi, X. and Davison, B. D. 2006. Knowing a web page by the company it keeps. In Proceedings of the International Conference on Information and Knowledge Management (CIKM). 228--237. Google Scholar
Digital Library
- Qi, X. and Davison, B. D. 2008. Classifiers without borders: Incorporating fielded text from neighboring Web pages. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR). 643--650. Google Scholar
Digital Library
- Qi, X. and Davison, B. D. 2009. Web page classification: Features and algorithms. ACM Comput. Surv. 41, 2, 1--31. Google Scholar
Digital Library
- Shen, D., Chen, Z., Yang, Q., Zeng, H., Zhang, B., Lu, Y., and Ma, W. 2004. Web-Page classification through summarization. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR). 242--249. Google Scholar
Digital Library
- Shih, L. K. and Karger, D. R. 2004. Using URLs and table layout for Web classification tasks. In Proceedings of the International Conference on World Wide Web (WWW). 193--202. Google Scholar
Digital Library
- Silvestri, F. 2007. Sorting out the document identifier assignment problem. In Proceedings of the European Conference on IR Research (ECIR). 101--112. Google Scholar
Digital Library
- Stumme, G. and Maedche, A. 2001. FCA-MERGE: Bottom-Up merging of ontologies. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 225--230. Google Scholar
Digital Library
- Umbrich, J., Karnstedt, M., and Harth, A. 2009. Fast and scalable pattern mining for media-type focused crawling. In Proceedings of the Knowledge Discovery, Data Mining, and Machine Learning Workshop.Google Scholar
- Vezhnevets, A. and Vezhnevets, V. 2005. Modest AdaBoost - Teaching AdaBoost to generalize better. In Proceedings of the Computer Graphics and Applications Conference (GraphiCon). 322--325.Google Scholar
- Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques 2nd Ed. Morgan Kaufmann. Google Scholar
Digital Library
- Zesch, T. and Gurevych, I. 2007. Analysis of the Wikipedia category graph for NLP applications. In Proceedings of the Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing (NAACL). 1--8.Google Scholar
- Zhang, D. and Lee, W. S. 2004. Web taxonomy integration using support vector machines. In Proceedings of the International Conference on World Wide Web (WWW). 472--481. Google Scholar
Digital Library
- Zhang, J., Qin, J., and Yan, Q. 2006. The role of URLs in objectionable Web content categorization. In Proceedings of the International Conference on Web Intelligence (WI). 277--283. Google Scholar
Digital Library
Index Terms
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
Recommendations
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth ...
Purely URL-based topic classification
WWW '09: Proceedings of the 18th international conference on World wide webGiven only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable ...
Twitter Trending Topic Classification
ICDMW '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining WorkshopsWith the increasing popularity of microblogging sites, we are in the era of information explosion. As of June 2011, about 200 million tweets are being generated everyday. Although Twitter provides a list of most popular topics people tweet about known ...






Comments