ABSTRACT
We consider the problem of segmenting a webpage into visually and semantically cohesive pieces. Our approach is based on formulating an appropriate optimization problem on weighted graphs, where the weights capture if two nodes in the DOM tree should be placed together or apart in the segmentation; we present a learning framework to learn these weights from manually labeled data in a principled manner. Our work is a significant departure from previous heuristic and rule-based solutions to the segmentation problem. The results of our empirical analysis bring out interesting aspects of our framework, including variants of the optimization problem and the role of learning.
- N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. In 37th STOC, pages 684--693, 2005. Google Scholar
Digital Library
- S. Baluja. Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In 15th WWW, pages 33--42, 2006. Google Scholar
Digital Library
- Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In 11th WWW, pages 580--591, 2002. Google Scholar
Digital Library
- K. Bharat, A. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. JASIS, 51(12):1114--1122, 2000. Google Scholar
Digital Library
- Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. PAMI, 23(11):1222--1239, 2001. Google Scholar
Digital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. WWW6 / Computer Networks, 29(8-13):1157--1166, 1997. Google Scholar
Digital Library
- D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting content structure for web pages based on visual representation. In 5th Asia Pacific Web Conference, pages 406--415, 2003. Google Scholar
Digital Library
- D. Chakrabarti, R. Kumar, and K. Punera. Page-level template detection via isotonic smoothing. In 16th WWW, pages 61--70, 2007. Google Scholar
Digital Library
- S. Chakrabarti, M. Joshi, and V. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks. In 24th SIGIR, pages 208--216, 2001. Google Scholar
Digital Library
- Y. Chen, X. Xie, W.-Y. Ma, and H.-J. Zhang. Adapting web pages for small-screen devices. Internet Computing, 9(1):50--56, 2005. Google Scholar
Digital Library
- D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In 14th WWW (Special interest tracks and posters), pages 830--839, 2005. Google Scholar
Digital Library
- L. Hubert and P. Arabie. Comparing partitions. J. Classification, 2:193--218, 1985.Google Scholar
Cross Ref
- H.-Y. Kao, J.-M. Ho, and M.-S. Chen. WISDOM: Web intrapage informative structure mining based on document object model. TKDE, 17(5):614--627, 2005. Google Scholar
Digital Library
- J. M. Kleinberg and É. Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. J. ACM, 49(5):616--639, 2002. Google Scholar
Digital Library
- V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? PAMI, 26(2):147--159, 2004. Google Scholar
Digital Library
- G. Milligan and M. Cooper. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21(4):441--458, 1986.Google Scholar
Cross Ref
- T. Mitchell. Machine Learning. McGraw Hill, 1997. Google Scholar
Digital Library
- A. Strehl and J. Ghosh. Cluster ensembles - A knowledge reuse framework for combining multiple partitions. JMLR, 3:583--617, 2002. Google Scholar
Digital Library
- K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS 2006, pages 1473--1480, 2006.Google Scholar
- X. Yin and W. S. Lee. Using link analysis to improve layout on mobile devices. In 13th WWW, pages 338--344, 2004. Google Scholar
Digital Library
Index Terms
A graph-theoretic approach to webpage segmentation
Recommendations
Lung tumor segmentation in PET images using graph cuts
The aim of segmentation of tumor regions in positron emission tomography (PET) is to provide more accurate measurements of tumor size and extension into adjacent structures, than is possible with visual assessment alone and hence improve patient ...
A novel approach of lung segmentation on chest CT images using graph cuts
Lung segmentation is often performed as a preprocessing step on chest Computed Tomography (CT) images because it is important for identifying lung diseases in clinical evaluation. Hence, research on lung segmentation has received much attention. Most of ...
Fast communication: Constrained multiplicative graph cuts based active contour model for magnetic resonance brain image series segmentation
Graph cuts-based active contour model (GCACM) is often used in image segmentation, which can be categorized into additive GCACM and multiplicative GCACM. However, both the additive GCACM and multiplicative GCACM are insufficient for magnetic resonance (...





Comments