Abstract
A Web template is a resource that implements the structure and format of a website, making it ready for plugging content into already formatted and prepared pages. For this reason, templates are one of the main development resources for website engineers, because they increase productivity. Templates are also useful for the final user, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information, such as advertisements, menus, and banners. Processing and storing this information leads to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. There exist many techniques and tools for template extraction, but, unfortunately, it is not clear at all which template extractor should a user/system use, because they have never been compared, and because they present different (complementary) features such as precision, recall, and efficiency. In this work, we compare the most advanced template extractors. We implemented and evaluated five of the most advanced template extractors in the literature. To compare all of them, we implemented a workbench, where they have been integrated and evaluated. Thanks to this workbench, we can provide a fair empirical comparison of all methods using the same benchmarks, technology, implementation language, and evaluation criteria.
- Julián Alarte, David Insa, Josep Silva, and Salvador Tamarit. 2015. TeMex: The web template extractor. In Proceedings of the 24th International Conference on World Wide Web (WWW’15 Companion). ACM, New York, NY, 155--158. Google Scholar
Digital Library
- Julián Alarte, David Insa, Josep Silva, and Salvador Tamarit. 2016. Site-Level Web Template Extraction Based on DOM Analysis. Springer International Publishing, Cham, 36--49.Google Scholar
- Derar Alassi and Reda Alhajj. 2013. Effectiveness of template detection on noise reduction and websites summarization. Info. Sci. 219 (2013), 41--72. Google Scholar
Digital Library
- Ziv Bar-Yossef and Sridhar Rajagopalan. 2002. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web (WWW’02). ACM, New York, NY, 580--591. Google Scholar
Digital Library
- Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. 2008. Cleaneval: A competition for cleaning web pages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’08).Google Scholar
- Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2007. Page-level template detection via isotonic smoothing. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 61--70. Google Scholar
Digital Library
- Liang Chen, Shaozhi Ye, and Xing Li. 2006. Template detection for large scale search engines. In Proceedings of the 2006 ACM Symposium on Applied Computing (SAC’06). ACM, New York, NY, 1094--1098. Google Scholar
Digital Library
- Nirmala Devi et al. 2015. Noisy elimination for web mining based on style tree approach. Int. J. Engineer. Technol. Comput. Res. 3, 2 (2015).Google Scholar
- Amit Dutta, Sudipta Paria, Tanmoy Golui, and Dipak Kumar Kole. 2014. Noise elimination from web page based on regular expressions for web content mining. In Advanced Computing, Networking and Informatics—Volume 1, Malay Kumar Kundu, Durga Prasad Mohapatra, Amit Konar, and Aruna Chakraborty (Eds.). Springer International Publishing, Cham, 545--554.Google Scholar
- A. Dutta, S. Paria, T. Golui, and D. K. Kole. 2014. Structural analysis and regular expressions based noise elimination from web pages for web content mining. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI’14). 1445--1451.Google Scholar
- Hassan F. Eldirdiery and A. H. Ahmed. 2015. Detecting and removing noisy data on web document using text density approach. Int. J. Comput. Appl. 112, 5 (2015).Google Scholar
- Stefan Evert. 2008. A lightweight and efficient tool for cleaning web pages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’08). European Language Resources Association. Retrieved from http://www.lrec-conf.org/proceedings/lrec2008/summaries/885.html.Google Scholar
- Bo Gao and Qifeng Fan. 2014. Multiple template detection based on segments. In Advances in Data Mining. Applications and Theoretical Aspects, Petra Perner (Ed.). Springer International Publishing, Cham, 24--38.Google Scholar
- Filippo Geraci and Marco Maggini. 2011. A multi-sequence alignment algorithm for web template detection. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR’11). 121--128.Google Scholar
- David Gibson, Kunal Punera, and Andrew Tomkins. 2005. The volume and evolution of web page templates. In Proceedings of the 14th International Conference on World Wide Web (WWW’05), Allan Ellis and Tatsuya Hagino (Eds.). ACM, 830--839. Google Scholar
Digital Library
- Christian Girardi. 2007. Htmcleaner: Extracting the relevant text from the web pages. In Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval: Building and Exploring Web Corpora (WAC’07), Vol. 4. 141--143.Google Scholar
- Gaurav Gupta and Indu Chhabra. 2017. Optimized template detection and extraction algorithm for web scraping of dynamic web pages. Global J. Pure Appl. Math. 13, 2 (2017), 719--732.Google Scholar
- Kulkarni A. H. and Patil B. M. 2014. Article: Template extraction from heterogeneous web pages with cosine similarity. Int. J. Comput. Appl. 87, 3 (Feb. 2014), 4--8.Google Scholar
- Vidya Kadam and Prakash R. Devale. 2012. A methodology for template extraction from heterogeneous web pages. Indian J. Comput. Sci. Engineer. 3, 3 (2012).Google Scholar
- Byeong Ho Kang and Yang Sok Kim. 2006. Noise elimination from the web documents by using URL paths and information redundancy. In Proceedings of the International Conference on Information and Knowledge Engineering (IKE’06).Google Scholar
- C. Kim and K. Shim. 2011. TEXT: Automatic template extraction from heterogeneous web pages. IEEE Trans. Knowl. Data Engineer. 23, 4 (Apr. 2011), 612--626. Google Scholar
Digital Library
- Barbara Ann Kitchenham, David Budgen, and Pearl Brereton. 2015. Evidence-Based Software Engineering and Systematic Reviews. Chapman 8 Hall/CRC. Google Scholar
Digital Library
- Aleksander Kocz and Wen-tau Yih. 2007. Site-independent template-block detection. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 152--163.Google Scholar
Cross Ref
- Christian Kohlschütter. 2009. A densitometric analysis of web template content. In Proceedings of the 18th International Conference on World Wide Web. ACM, 1165--1166. Google Scholar
Digital Library
- Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate detection using shallow text features. In Proceedings of the 3rd International Conference on Web Search and Web Data Mining (WSDM’10), Brian D. Davison, Torsten Suel, Nick Craswell, and Bing Liu (Eds.). ACM, 441--450. Google Scholar
Digital Library
- A. H. Kulkarni and B. M. Patil. 2014. Template extraction from heterogeneous web pages with cosine similarity. Int. J. Comput. Appl. 87, 3 (2014).Google Scholar
- Nicholas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. 1997. Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI’97), Martha E. Pollack (Ed.). Morgan Kaufmann, 729--737.Google Scholar
- Xiao Yan Le. 2014. A web text de-noising algorithm based on machine learning. In Applied Mechanics and Materials, Vol. 536. Trans Tech Publications, 516--519.Google Scholar
- Kristina Lerman, Steven N. Minton, and Craig A. Knoblock. 2003. Wrapper maintenance: A machine learning approach. J. Artific. Intell. Res. 18 (2003), 149--181. Google Scholar
Digital Library
- Jing Li and C. I. Ezeife. 2006. Cleaning web pages for effective web content mining. In Database and Expert Systems Applications, Stéphane Bressan, Josef Küng, and Roland Wagner (Eds.). Springer, Berlin, 560--571. Google Scholar
Digital Library
- Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York, Inc., Secaucus, NJ. Google Scholar
Digital Library
- Ling Liu, Wei Han, David Buttler, Calton Pu, and Wei Tang. 1999. An XJML-based wrapper generator for web information extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99). ACM, New York, NY, 540--543. Google Scholar
Digital Library
- L. Lo, V. T. Ng, P. Ng, and S. C. Chan. 2006. Automatic template detection for structured web pages. In Proceedings of the 10th International Conference on Computer Supported Cooperative Work in Design. 1--6.Google Scholar
- Ling Ma, Nazli Goharian, Abdur Chowdhury, and Misun Chung. 2003. Extracting unstructured data from template generated web documents. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM’03). ACM, New York, NY, 512--515. Google Scholar
Digital Library
- Trupti B. Mane and Girish P. Potdar. 2012. Template extraction from heterogeneous web pages. Int. J. Adv. Comput. Res. 2, 4 (2012), 197.Google Scholar
- R Manjula and A Chilambuchelvan. 2013. Extracting templates from web pages. In Proceedings of the International Conference on Green Computing, Communication and Conservation of Energy (ICGCE’13). IEEE, 788--791.Google Scholar
Cross Ref
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich SchÃijtze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY. Google Scholar
Digital Library
- Michal Marek, Pavel Pecina, and Miroslav Spousta. 2007. Web page cleaning with conditional random fields. In Proceedings of the 5h Web as Corpus Workshop, Incorporationg CleanEval: Building and Exploring Web Corpora (WAC’07). 155--162.Google Scholar
- Xiaofeng Meng, Dongdong Hu, and Chen Li. 2003. Schema-guided wrapper maintenance for web-data extraction. In Proceedings of the 5th ACM International Workshop on Web Information and Data Management (WIDM’03). ACM, New York, NY, 1--8. Google Scholar
Digital Library
- Ion Muslea, Steven Minton, and Craig A. Knoblock. 2003. Wrapper Induction by Hierarchical Data Analysis. U.S. Patent 6,606,625.Google Scholar
- Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham, and The Duy Bui. 2009. A fast template-based approach to automatically identify primary text content of a web page. In Proceedings of the International Conference on Knowledge and Systems Engineering (KSE’09). IEEE, 232--236. Google Scholar
Digital Library
- Alpa K. Oza and Shailendra Mishra. 2013. Elimination of noisy information from web pages. Int. J. Recent Technol. Engineer. 2, 1 (2013), 115--117.Google Scholar
- A. Pouramini and S. Nasiri. 2015. Web content extraction using contextual rules. In Proceedings of the 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI’15). 1014--1018.Google Scholar
- Xin Qi and JianPeng Sun. 2011. Eliminating noisy information in webpage through heuristic rules. In Proceedings of the International Conference on Computer Science and Information Technology.Google Scholar
- Neeraj Raheja and V. K. Katiyar. 2013. A noise reduction approach based on NX 1 table and XSL display method for efficient web data extraction. Int. J. Comput. Appl. 64, 11 (2013).Google Scholar
- Pan Ei San. 2014. Boilerplate removal and content extraction from dynamic web pages. Int. J. Comput. Sci. Engineer. Appl. 4, 6 (2014), 27.Google Scholar
- Roland Schäfer. 2017. Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Lang. Resources Eval. 51, 3 (Sep. 2017), 873--889. Google Scholar
Digital Library
- P. Sivakumar. 2015. Effectual web content mining using noise removal from web pages. Wireless Personal Commun. 84, 1 (Sep. 2015), 99--121. Google Scholar
Digital Library
- Dandan Song, Fei Sun, and Lejian Liao. 2015. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowl. Info. Syst. 42, 1 (Jan. 2015), 75--96. Google Scholar
Digital Library
- Rashmi D. Thakare and Manisha R. Patil. 2015. Extraction of template using clustering from heterogeneous web documents. Int. J. Comput. Appl. 119, 11 (2015).Google Scholar
- R. Uma and B. Latha. 2018. Noise elimination from web pages for efficacious information retrieval. Cluster Comput. (Mar. 2018). https://link.springer.com/article/10.1007/s10586-018-2366-x#citeas.Google Scholar
- Erdinç Uzun, Hayri Volkan Agun, and Tarik Yerlikaya. 2013. A hybrid approach for extracting informative content from web pages. Info. Process. Manage. 49, 4 (2013), 928--944. Google Scholar
Digital Library
- Karane Vieira, André Luiz da Costa Carvalho, Klessius Berlt, Edleno S. de Moura, Altigran S. da Silva, and Juliana Freire. 2009. On finding templates on web collections. World Wide Web 12, 2 (2009), 171--211. Google Scholar
Digital Library
- Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, João M. B. Cavalcanti, and Juliana Freire. 2006. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM’06). ACM, New York, NY, 258--267. Google Scholar
Digital Library
- Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff. 2018. Web2Text: Deep structured boilerplate removal. CoRR abs/1801.02607 (2018). Retrieved from http://arxiv.org/abs/1801.02607.Google Scholar
- Yu Wang, Bingxing Fang, Xueqi Cheng, Li Guo, and Hongbo Xu. 2008. Incremental web page template detection. In Proceedings of the 17th International Conference on World Wide Web (WWW’08). ACM, New York, NY, 1247--1248. Google Scholar
Digital Library
- Lan Yi, Bing Liu, and Xiaoli Li. 2003. Eliminating noisy information in web pages for data mining. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD’03). ACM, New York, NY, 296--305. Google Scholar
Digital Library
- Chenxu Zhao, Rui Zhang, and Jianzhong Qi. 2018. Web page template and data separation for better maintainability. In Web Information Systems Engineering (WISE’18), Hakim Hacid, Wojciech Cellary, Hua Wang, Hye-Young Paik, and Rui Zhou (Eds.). Springer International Publishing, Cham, 439--449.Google Scholar
- Shuyi Zheng, Ruihua Song, Ji-Rong Wen, and C. Lee Giles. 2009. Efficient record-level wrapper induction. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, NY, 47--56. Google Scholar
Digital Library
- Shuyi Zheng, Ruihua Song, Ji-Rong Wen, and Di Wu. 2007. Joint optimization of wrapper generation and template detection. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’07). ACM, New York, NY, 894--902. Google Scholar
Digital Library
Index Terms
What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors
Recommendations
Page-Level Main Content Extraction From Heterogeneous Webpages
The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, and comments. For crawlers and indexers, isolating the main content from the template and other ...
TeMex: The Web Template Extractor
WWW '15 Companion: Proceedings of the 24th International Conference on World Wide WebThis paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, ...
HybEx: A Hybrid Tool for Template Extraction
WWW '22: Companion Proceedings of the Web Conference 2022HybEx is a site-level web template extractor that combines two algorithms for template and content extraction: (i) TemEx, a site-level template detection technique, and (ii) Page-level ConEx, a content extraction technique. The key idea is to add a ...






Comments