Abstract
Webpage segmentation is the basic building block for a wide range of webpage analysis methods. The rapid development of Web technologies results in more dynamic and complex webpages, which bring new challenges to this area. To improve the performance of webpage segmentation, we propose a two-stage segmentation method that can combine visual, logic, and semantic features of the contents on a webpage. Specifically, we devise a new model to measure the similarities of the elements on webpages based on both visual layout and logic organization in the first stage, and we propose a novel block regrouping method using semantic statistics and visual positions in the second stage. This two-stage method can effectively conduct webpage segmentation on complicated and dynamic webpages. The performance and accuracy of the method are verified by comparing with two existing webpage segmentation methods. The experiment results show that the proposed method significantly outperforms the existing state of the art in terms of higher precision, recall, and accuracy.
- Alexa. 2016. The top 500 sites on the web. Retrieved from http://www.alexa.com/topsites.Google Scholar
- Shumeet Baluja. 2006. Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In Proceedings of the 15th International Conference on World Wide Web. ACM, 33--42. Google Scholar
Digital Library
- Ziv Bar-Yossef and Sridhar Rajagopalan. 2002. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web. ACM, 580--591. Google Scholar
Digital Library
- Lidong Bing, Rui Guo, Wai Lam, Zheng-Yu Niu, and Haifeng Wang. 2014. Web page segmentation with structured prediction and its application in web page classification. In Proceedings of the 37th International ACM SIGIR Conference on Research 8 Development in Information Retrieval. ACM, 767--776. Google Scholar
Digital Library
- Ahmet Selman Bozkir and Ebru Akcapinar Sezer. 2018. Layout-based computation of web page similarity ranks. International Journal of Human-Computer Studies 110 (2018), 95--114. Google Scholar
Digital Library
- Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. 2003. VIPS: A Visionbased Page Segmentation Algorithm. Technical Report. Microsoft technical report, MSR-TR-2003-79.Google Scholar
- Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2007. Page-level template detection via isotonic smoothing. In Proceedings of the 16th International Conference on World Wide Web. ACM, 61--70. Google Scholar
Digital Library
- Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2008. A graph-theoretic approach to webpage segmentation. In Proceedings of the 17th International Conference on World Wide Web. ACM, 377--386. Google Scholar
Digital Library
- Yu Chen, Wei-Ying Ma, and Hong-Jiang Zhang. 2003. Detecting web page structure for adaptive viewing on small form factor devices. In Proceedings of the 12th International Conference on World Wide Web. ACM, 225--233. Google Scholar
Digital Library
- dataset-popular 2014. A dataset of popular pages (taken from dir.yahoo.com) with manually marked up semantic blocks. Retrieved from https://github.com/rkrzr/dataset-popular.Google Scholar
- dataset-random 2014. A dataset of random pages with manually marked up semantic blocks. Retrieved from https://github.com/rkrzr/dataset-random.Google Scholar
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, Vol. 96. 226--231. Google Scholar
Digital Library
- Evernote. 2016. Evernote Web Clipper. Retrieved from https://evernote.com/webclipper/.Google Scholar
- ExtJs. 2016. Sencha Ext JS. Retrieved from https://www.sencha.com/products/extjs/.Google Scholar
- Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. 2003. DOM-based content extraction of HTML documents. In Proceedings of the 12th International Conference on World Wide Web. ACM, 207--214. Google Scholar
Digital Library
- PhantomJS - Scriptable Headless WebKit. https://github.com/ariya/phantomjs.Google Scholar
- Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou. 2015. Short text understanding through lexical-semantic analysis. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering. IEEE, 495--506.Google Scholar
Cross Ref
- Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193--218.Google Scholar
Cross Ref
- Zexun Jiang, Ruifeng Kuang, Jiaying Gong, Hao Yin, Yongqiang Lyu, and Xu Zhang. 2018. What makes a great mobile app? A quantitative study using a new mobile crawler. In Proceedings of the 2018 IEEE Symposium on Service-Oriented System Engineering (SOSE). IEEE, 222--227.Google Scholar
Cross Ref
- Christian Kohlschütter and Wolfgang Nejdl. 2008. A densitometric approach to web page segmentation. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, 1173--1182. Google Scholar
Digital Library
- Rupesh R. Mehta, Pabitra Mitra, and Harish Karnick. 2005. Extracting semantic structure of web documents using content and visual information. In Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. ACM, 928--929. Google Scholar
Digital Library
- William M. Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 336 (1971), 846--850.Google Scholar
Cross Ref
- React. 2017. A JavaSscript Library for Building User Interfaces. Retrieved from https://facebook.github.io/react/.Google Scholar
- Andres Sanoja and Stephane Gancarski. 2014. Block-o-matic: A web page segmentation framework. In Proceedings of the International Conference on Multimedia Computing and Systems (ICMCS’14). IEEE, 595--600.Google Scholar
Cross Ref
- Yayuan Tang, Hao Wang, Kehua Guo, Yizhe Xiao, and Tao Chi. 2018. Relevant feedback based accurate and intelligent retrieval on capturing user intention for personalized websites. IEEE Access 6 (2018), 24239--24248.Google Scholar
Cross Ref
- Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, Joao Cavalcanti, and Juliana Freire. 2006. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management. ACM, 258--267. Google Scholar
Digital Library
- VIPS-JAVA {n.d.}. Implementation of Vision Based Page Segmentation Algorithm in Java. Retrieved from https://github.com/tpopela/vips-java.Google Scholar
- Tim Weninger, William H Hsu, and Jiawei Han. 2010. CETR: Content extraction via tag ratios. In Proceedings of the 19th International Conference on World Wide Web. ACM, 971--980. Google Scholar
Digital Library
- Yulei Wu, Fei Hu, Geyong Min, and Albert Y. Zomaya. 2017. Big Data and Computational Intelligence in Networking. CRC Press.Google Scholar
- Jan Zeleny, Radek Burget, and Jaroslav Zendulka. 2017. Box clustering segmentation: A new method for vision-based web page preprocessing. Information Processing 8 Management 53, 3 (2017), 735--750. Google Scholar
Digital Library
Index Terms
Constructing Novel Block Layouts for Webpage Analysis
Recommendations
A graph-theoretic approach to webpage segmentation
WWW '08: Proceedings of the 17th international conference on World Wide WebWe consider the problem of segmenting a webpage into visually and semantically cohesive pieces. Our approach is based on formulating an appropriate optimization problem on weighted graphs, where the weights capture if two nodes in the DOM tree should be ...
Enhancement of Flash Webpage Segmentation for Web Mining Applications
ICONIAAC '14: Proceedings of the 2014 International Conference on Interdisciplinary Advances in Applied ComputingWeb page segmentation is a crucial step for many applications like information retrieval, text classification, noise removal, full text search and automatic page adaptation can benefit from this structure. In literature, many methods have been proposed ...
A novel webpage layout aesthetic evaluation model for quantifying webpage layout design
AbstractIt is widely accepted that webpage layout is important for website aesthetics. However, knowledge is limited about which kind of webpage layout is aesthetic. In this paper, we conduct a large-scale analysis of 13,017 real webpages with ...






Comments