skip to main content
research-article

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Published:01 July 2011Publication History
Skip Abstract Section

Abstract

Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.

References

  1. Alex, P., Chirita, R., Costache, S., Nejdl, W., and Handschuh, S. 2007. P-tag: Large scale automatic generation of personalized annotation tags for the Web. In Proceedings of the International Conference on World Wide Web (WWW). 8--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Avesani, P., Giunchiglia, F., and Yatskevich, M. 2005. A large scale taxonomy mapping evaluation. In Proceedings of the International Semantic Web Conference (ISWC). 67--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baykan, E., Henzinger, M., and Weber, I. 2008. Web page language identification based on URLs. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 176--187.Google ScholarGoogle Scholar
  4. Baykan, E., Henzinger, M., Marian, L., and Weber, I. 2009. Purely URL-based topic classification. In Proceedings of the International Conference on World Wide Web (WWW). 1109--1110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chaker, J. and Habib, O. 2007. Genre categorization of Web pages. In Proceedings of the International Conference on Data Mining Workshops (ICDMW). 455--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chakrabarti, S., Dom, B., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of the International Conference on Management of Data (SIGMOD). 307--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. Comput. Netw. 31, 11--16, 1623--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chen, H. and Dumais, S. 2000. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the Conference on Human Factors in Computing Systems (CHI). 145--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cover, T. and Thomas, J. 1991. Elements of Information Theory. Wiley & Sons. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dasgupta, A., Kumar, R., and Sasturkar, A. 2008. De-Duping URLs via rewrite rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD). 186--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Devi, M. I., Rajaram, R., and Selvakuberan, K. 2007. Machine learning techniques for automated Web page classification using URL features. In Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA). 116--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Domingos, P. and Pazzani, M. 1997. On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn. 29, 103--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Freud, Y. and Schapire, R. 1996. Experiments with a new boosting algorithm. In Proceedings of the International Conference on Machine Learning (ICML). 148--156.Google ScholarGoogle Scholar
  14. Freudiger, J., Vratonjic, N., and Hubaux, J.-P. 2009. Towards privacy-friendly online advertising. In Proceedings of the IEEE Web 2.0 Security and Privacy Conference (W2SP).Google ScholarGoogle Scholar
  15. Friedman, J., Hastie, T., and Tibshirani, R. 2000. Additive logistic regression: A statistical view of boosting. Ann. Statist. 38, 2, 337--374.Google ScholarGoogle ScholarCross RefCross Ref
  16. Hastie, T., Tibshirani, R., and Friedman, J. H. 2001. The Elements of Statistical Learning. Springer.Google ScholarGoogle Scholar
  17. Heymann, P., Ramage, D., and Garcia-Molina, H. 2008. Social tag prediction. In Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR). 531--538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., and Stumme, G. 2007. Tag recommendations in folksonomies. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). 506--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Joachims, T. 1999. Making Large-Scale Support Vector Machine Learning Practical. MIT Press, 169--184. http://svmlight.joachims.org/. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kan, M.-Y. 2004. Web page classification without the Web page. In Proceedings of the International World Wide Web Conference on Alternate Track Papers and Posters. 262--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kan, M.-Y. and Nguyen, H. O. T. 2005. Fast Webpage classification using URL features. In Proceedings of the International Conference on Information and Knowledge Management (CIKM). 325--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Koppula, H. S., Leela, K., Agarwal, A., Chitrapura, K. P., Garg, S., and Sasturkar, A. 2010. Learning URL patterns for Webpage de-duplication. In Proceedings of the International Conference on Web Search and Data Mining. 381--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. McCallum, A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In Proceedings of the AAAI Workshop on Learning for Text Categorization. 41--48.Google ScholarGoogle Scholar
  25. McCallum, A. K. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow.Google ScholarGoogle Scholar
  26. McGuinness, D., Fikes, R., Rice, J., and Wilder, S. 2000. An environment for merging and testing large ontologies. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR). 483--493.Google ScholarGoogle Scholar
  27. Nigam, K., Lafferty, J., and McCallum, A. 1999. Using maximum entropy for text classification. In Proceedings of the Workshop on Machine Learning for Information Filtering. 61--67.Google ScholarGoogle Scholar
  28. Noy, N. 2004. Tools for mapping and merging ontologies. In Handbook on Ontologies, S. Staab and R. Studer Eds., Springer, 365--384.Google ScholarGoogle Scholar
  29. P, D. and Khemani, D. 2006. Unsupervised learning from URL corpora. In Proceedings of the International Conference on Management of Data (COMAD’06).Google ScholarGoogle Scholar
  30. Poola, K. L. and Ramanujapuram, A. 2007. Techniques for keyword extraction from URLs using statistical analysis. http://www.faqs.org/patents/app/20090089278. US Patent application.Google ScholarGoogle Scholar
  31. Power, R., Chen, J., Karthik, T., and Subramanian, L. 2009. Document classification for focused topics. In Proceedings of the AAAI Spring Symposium on AI for Development.Google ScholarGoogle Scholar
  32. Qi, X. and Davison, B. D. 2006. Knowing a web page by the company it keeps. In Proceedings of the International Conference on Information and Knowledge Management (CIKM). 228--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Qi, X. and Davison, B. D. 2008. Classifiers without borders: Incorporating fielded text from neighboring Web pages. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR). 643--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Qi, X. and Davison, B. D. 2009. Web page classification: Features and algorithms. ACM Comput. Surv. 41, 2, 1--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Shen, D., Chen, Z., Yang, Q., Zeng, H., Zhang, B., Lu, Y., and Ma, W. 2004. Web-Page classification through summarization. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR). 242--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Shih, L. K. and Karger, D. R. 2004. Using URLs and table layout for Web classification tasks. In Proceedings of the International Conference on World Wide Web (WWW). 193--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Silvestri, F. 2007. Sorting out the document identifier assignment problem. In Proceedings of the European Conference on IR Research (ECIR). 101--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Stumme, G. and Maedche, A. 2001. FCA-MERGE: Bottom-Up merging of ontologies. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 225--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Umbrich, J., Karnstedt, M., and Harth, A. 2009. Fast and scalable pattern mining for media-type focused crawling. In Proceedings of the Knowledge Discovery, Data Mining, and Machine Learning Workshop.Google ScholarGoogle Scholar
  40. Vezhnevets, A. and Vezhnevets, V. 2005. Modest AdaBoost - Teaching AdaBoost to generalize better. In Proceedings of the Computer Graphics and Applications Conference (GraphiCon). 322--325.Google ScholarGoogle Scholar
  41. Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques 2nd Ed. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zesch, T. and Gurevych, I. 2007. Analysis of the Wikipedia category graph for NLP applications. In Proceedings of the Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing (NAACL). 1--8.Google ScholarGoogle Scholar
  43. Zhang, D. and Lee, W. S. 2004. Web taxonomy integration using support vector machines. In Proceedings of the International Conference on World Wide Web (WWW). 472--481. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Zhang, J., Qin, J., and Yan, Q. 2006. The role of URLs in objectionable Web content categorization. In Proceedings of the International Conference on Web Intelligence (WI). 277--283. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 5, Issue 3
      July 2011
      177 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/1993053
      Issue’s Table of Contents

      Copyright © 2011 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 July 2011
      • Accepted: 1 December 2010
      • Revised: 1 September 2010
      • Received: 1 June 2009
      Published in tweb Volume 5, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!