skip to main content
10.1145/1367497.1367549acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

A graph-theoretic approach to webpage segmentation

Published:21 April 2008Publication History

ABSTRACT

We consider the problem of segmenting a webpage into visually and semantically cohesive pieces. Our approach is based on formulating an appropriate optimization problem on weighted graphs, where the weights capture if two nodes in the DOM tree should be placed together or apart in the segmentation; we present a learning framework to learn these weights from manually labeled data in a principled manner. Our work is a significant departure from previous heuristic and rule-based solutions to the segmentation problem. The results of our empirical analysis bring out interesting aspects of our framework, including variants of the optimization problem and the role of learning.

References

  1. N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. In 37th STOC, pages 684--693, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Baluja. Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In 15th WWW, pages 33--42, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In 11th WWW, pages 580--591, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Bharat, A. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. JASIS, 51(12):1114--1122, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. PAMI, 23(11):1222--1239, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. WWW6 / Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting content structure for web pages based on visual representation. In 5th Asia Pacific Web Conference, pages 406--415, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Chakrabarti, R. Kumar, and K. Punera. Page-level template detection via isotonic smoothing. In 16th WWW, pages 61--70, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chakrabarti, M. Joshi, and V. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks. In 24th SIGIR, pages 208--216, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Chen, X. Xie, W.-Y. Ma, and H.-J. Zhang. Adapting web pages for small-screen devices. Internet Computing, 9(1):50--56, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In 14th WWW (Special interest tracks and posters), pages 830--839, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Hubert and P. Arabie. Comparing partitions. J. Classification, 2:193--218, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  13. H.-Y. Kao, J.-M. Ho, and M.-S. Chen. WISDOM: Web intrapage informative structure mining based on document object model. TKDE, 17(5):614--627, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. M. Kleinberg and É. Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. J. ACM, 49(5):616--639, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? PAMI, 26(2):147--159, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Milligan and M. Cooper. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21(4):441--458, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  17. T. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Strehl and J. Ghosh. Cluster ensembles - A knowledge reuse framework for combining multiple partitions. JMLR, 3:583--617, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS 2006, pages 1473--1480, 2006.Google ScholarGoogle Scholar
  20. X. Yin and W. S. Lee. Using link analysis to improve layout on mobile devices. In 13th WWW, pages 338--344, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A graph-theoretic approach to webpage segmentation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WWW '08: Proceedings of the 17th international conference on World Wide Web
        April 2008
        1326 pages
        ISBN:9781605580852
        DOI:10.1145/1367497

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 April 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

        Upcoming Conference

        WWW '24
        The ACM Web Conference 2024
        May 13 - 17, 2024
        Singapore , Singapore

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader