skip to main content
research-article

What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors

Published:27 March 2019Publication History
Skip Abstract Section

Abstract

A Web template is a resource that implements the structure and format of a website, making it ready for plugging content into already formatted and prepared pages. For this reason, templates are one of the main development resources for website engineers, because they increase productivity. Templates are also useful for the final user, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information, such as advertisements, menus, and banners. Processing and storing this information leads to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. There exist many techniques and tools for template extraction, but, unfortunately, it is not clear at all which template extractor should a user/system use, because they have never been compared, and because they present different (complementary) features such as precision, recall, and efficiency. In this work, we compare the most advanced template extractors. We implemented and evaluated five of the most advanced template extractors in the literature. To compare all of them, we implemented a workbench, where they have been integrated and evaluated. Thanks to this workbench, we can provide a fair empirical comparison of all methods using the same benchmarks, technology, implementation language, and evaluation criteria.

References

  1. Julián Alarte, David Insa, Josep Silva, and Salvador Tamarit. 2015. TeMex: The web template extractor. In Proceedings of the 24th International Conference on World Wide Web (WWW’15 Companion). ACM, New York, NY, 155--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Julián Alarte, David Insa, Josep Silva, and Salvador Tamarit. 2016. Site-Level Web Template Extraction Based on DOM Analysis. Springer International Publishing, Cham, 36--49.Google ScholarGoogle Scholar
  3. Derar Alassi and Reda Alhajj. 2013. Effectiveness of template detection on noise reduction and websites summarization. Info. Sci. 219 (2013), 41--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ziv Bar-Yossef and Sridhar Rajagopalan. 2002. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web (WWW’02). ACM, New York, NY, 580--591. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Marco Baroni, Francis Chantree, Adam Kilgarriff, and Serge Sharoff. 2008. Cleaneval: A competition for cleaning web pages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’08).Google ScholarGoogle Scholar
  6. Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. 2007. Page-level template detection via isotonic smoothing. In Proceedings of the 16th International Conference on World Wide Web (WWW’07). ACM, New York, NY, 61--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Liang Chen, Shaozhi Ye, and Xing Li. 2006. Template detection for large scale search engines. In Proceedings of the 2006 ACM Symposium on Applied Computing (SAC’06). ACM, New York, NY, 1094--1098. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nirmala Devi et al. 2015. Noisy elimination for web mining based on style tree approach. Int. J. Engineer. Technol. Comput. Res. 3, 2 (2015).Google ScholarGoogle Scholar
  9. Amit Dutta, Sudipta Paria, Tanmoy Golui, and Dipak Kumar Kole. 2014. Noise elimination from web page based on regular expressions for web content mining. In Advanced Computing, Networking and Informatics—Volume 1, Malay Kumar Kundu, Durga Prasad Mohapatra, Amit Konar, and Aruna Chakraborty (Eds.). Springer International Publishing, Cham, 545--554.Google ScholarGoogle Scholar
  10. A. Dutta, S. Paria, T. Golui, and D. K. Kole. 2014. Structural analysis and regular expressions based noise elimination from web pages for web content mining. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI’14). 1445--1451.Google ScholarGoogle Scholar
  11. Hassan F. Eldirdiery and A. H. Ahmed. 2015. Detecting and removing noisy data on web document using text density approach. Int. J. Comput. Appl. 112, 5 (2015).Google ScholarGoogle Scholar
  12. Stefan Evert. 2008. A lightweight and efficient tool for cleaning web pages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’08). European Language Resources Association. Retrieved from http://www.lrec-conf.org/proceedings/lrec2008/summaries/885.html.Google ScholarGoogle Scholar
  13. Bo Gao and Qifeng Fan. 2014. Multiple template detection based on segments. In Advances in Data Mining. Applications and Theoretical Aspects, Petra Perner (Ed.). Springer International Publishing, Cham, 24--38.Google ScholarGoogle Scholar
  14. Filippo Geraci and Marco Maggini. 2011. A multi-sequence alignment algorithm for web template detection. In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR’11). 121--128.Google ScholarGoogle Scholar
  15. David Gibson, Kunal Punera, and Andrew Tomkins. 2005. The volume and evolution of web page templates. In Proceedings of the 14th International Conference on World Wide Web (WWW’05), Allan Ellis and Tatsuya Hagino (Eds.). ACM, 830--839. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Christian Girardi. 2007. Htmcleaner: Extracting the relevant text from the web pages. In Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval: Building and Exploring Web Corpora (WAC’07), Vol. 4. 141--143.Google ScholarGoogle Scholar
  17. Gaurav Gupta and Indu Chhabra. 2017. Optimized template detection and extraction algorithm for web scraping of dynamic web pages. Global J. Pure Appl. Math. 13, 2 (2017), 719--732.Google ScholarGoogle Scholar
  18. Kulkarni A. H. and Patil B. M. 2014. Article: Template extraction from heterogeneous web pages with cosine similarity. Int. J. Comput. Appl. 87, 3 (Feb. 2014), 4--8.Google ScholarGoogle Scholar
  19. Vidya Kadam and Prakash R. Devale. 2012. A methodology for template extraction from heterogeneous web pages. Indian J. Comput. Sci. Engineer. 3, 3 (2012).Google ScholarGoogle Scholar
  20. Byeong Ho Kang and Yang Sok Kim. 2006. Noise elimination from the web documents by using URL paths and information redundancy. In Proceedings of the International Conference on Information and Knowledge Engineering (IKE’06).Google ScholarGoogle Scholar
  21. C. Kim and K. Shim. 2011. TEXT: Automatic template extraction from heterogeneous web pages. IEEE Trans. Knowl. Data Engineer. 23, 4 (Apr. 2011), 612--626. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Barbara Ann Kitchenham, David Budgen, and Pearl Brereton. 2015. Evidence-Based Software Engineering and Systematic Reviews. Chapman 8 Hall/CRC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Aleksander Kocz and Wen-tau Yih. 2007. Site-independent template-block detection. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 152--163.Google ScholarGoogle ScholarCross RefCross Ref
  24. Christian Kohlschütter. 2009. A densitometric analysis of web template content. In Proceedings of the 18th International Conference on World Wide Web. ACM, 1165--1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Christian Kohlschütter, Peter Fankhauser, and Wolfgang Nejdl. 2010. Boilerplate detection using shallow text features. In Proceedings of the 3rd International Conference on Web Search and Web Data Mining (WSDM’10), Brian D. Davison, Torsten Suel, Nick Craswell, and Bing Liu (Eds.). ACM, 441--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. H. Kulkarni and B. M. Patil. 2014. Template extraction from heterogeneous web pages with cosine similarity. Int. J. Comput. Appl. 87, 3 (2014).Google ScholarGoogle Scholar
  27. Nicholas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. 1997. Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI’97), Martha E. Pollack (Ed.). Morgan Kaufmann, 729--737.Google ScholarGoogle Scholar
  28. Xiao Yan Le. 2014. A web text de-noising algorithm based on machine learning. In Applied Mechanics and Materials, Vol. 536. Trans Tech Publications, 516--519.Google ScholarGoogle Scholar
  29. Kristina Lerman, Steven N. Minton, and Craig A. Knoblock. 2003. Wrapper maintenance: A machine learning approach. J. Artific. Intell. Res. 18 (2003), 149--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jing Li and C. I. Ezeife. 2006. Cleaning web pages for effective web content mining. In Database and Expert Systems Applications, Stéphane Bressan, Josef Küng, and Roland Wagner (Eds.). Springer, Berlin, 560--571. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York, Inc., Secaucus, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ling Liu, Wei Han, David Buttler, Calton Pu, and Wei Tang. 1999. An XJML-based wrapper generator for web information extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’99). ACM, New York, NY, 540--543. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. Lo, V. T. Ng, P. Ng, and S. C. Chan. 2006. Automatic template detection for structured web pages. In Proceedings of the 10th International Conference on Computer Supported Cooperative Work in Design. 1--6.Google ScholarGoogle Scholar
  34. Ling Ma, Nazli Goharian, Abdur Chowdhury, and Misun Chung. 2003. Extracting unstructured data from template generated web documents. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM’03). ACM, New York, NY, 512--515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Trupti B. Mane and Girish P. Potdar. 2012. Template extraction from heterogeneous web pages. Int. J. Adv. Comput. Res. 2, 4 (2012), 197.Google ScholarGoogle Scholar
  36. R Manjula and A Chilambuchelvan. 2013. Extracting templates from web pages. In Proceedings of the International Conference on Green Computing, Communication and Conservation of Energy (ICGCE’13). IEEE, 788--791.Google ScholarGoogle ScholarCross RefCross Ref
  37. Christopher D. Manning, Prabhakar Raghavan, and Hinrich SchÃijtze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Michal Marek, Pavel Pecina, and Miroslav Spousta. 2007. Web page cleaning with conditional random fields. In Proceedings of the 5h Web as Corpus Workshop, Incorporationg CleanEval: Building and Exploring Web Corpora (WAC’07). 155--162.Google ScholarGoogle Scholar
  39. Xiaofeng Meng, Dongdong Hu, and Chen Li. 2003. Schema-guided wrapper maintenance for web-data extraction. In Proceedings of the 5th ACM International Workshop on Web Information and Data Management (WIDM’03). ACM, New York, NY, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ion Muslea, Steven Minton, and Craig A. Knoblock. 2003. Wrapper Induction by Hierarchical Data Analysis. U.S. Patent 6,606,625.Google ScholarGoogle Scholar
  41. Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham, and The Duy Bui. 2009. A fast template-based approach to automatically identify primary text content of a web page. In Proceedings of the International Conference on Knowledge and Systems Engineering (KSE’09). IEEE, 232--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Alpa K. Oza and Shailendra Mishra. 2013. Elimination of noisy information from web pages. Int. J. Recent Technol. Engineer. 2, 1 (2013), 115--117.Google ScholarGoogle Scholar
  43. A. Pouramini and S. Nasiri. 2015. Web content extraction using contextual rules. In Proceedings of the 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI’15). 1014--1018.Google ScholarGoogle Scholar
  44. Xin Qi and JianPeng Sun. 2011. Eliminating noisy information in webpage through heuristic rules. In Proceedings of the International Conference on Computer Science and Information Technology.Google ScholarGoogle Scholar
  45. Neeraj Raheja and V. K. Katiyar. 2013. A noise reduction approach based on NX 1 table and XSL display method for efficient web data extraction. Int. J. Comput. Appl. 64, 11 (2013).Google ScholarGoogle Scholar
  46. Pan Ei San. 2014. Boilerplate removal and content extraction from dynamic web pages. Int. J. Comput. Sci. Engineer. Appl. 4, 6 (2014), 27.Google ScholarGoogle Scholar
  47. Roland Schäfer. 2017. Accurate and efficient general-purpose boilerplate detection for crawled web corpora. Lang. Resources Eval. 51, 3 (Sep. 2017), 873--889. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. P. Sivakumar. 2015. Effectual web content mining using noise removal from web pages. Wireless Personal Commun. 84, 1 (Sep. 2015), 99--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Dandan Song, Fei Sun, and Lejian Liao. 2015. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowl. Info. Syst. 42, 1 (Jan. 2015), 75--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Rashmi D. Thakare and Manisha R. Patil. 2015. Extraction of template using clustering from heterogeneous web documents. Int. J. Comput. Appl. 119, 11 (2015).Google ScholarGoogle Scholar
  51. R. Uma and B. Latha. 2018. Noise elimination from web pages for efficacious information retrieval. Cluster Comput. (Mar. 2018). https://link.springer.com/article/10.1007/s10586-018-2366-x#citeas.Google ScholarGoogle Scholar
  52. Erdinç Uzun, Hayri Volkan Agun, and Tarik Yerlikaya. 2013. A hybrid approach for extracting informative content from web pages. Info. Process. Manage. 49, 4 (2013), 928--944. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Karane Vieira, André Luiz da Costa Carvalho, Klessius Berlt, Edleno S. de Moura, Altigran S. da Silva, and Juliana Freire. 2009. On finding templates on web collections. World Wide Web 12, 2 (2009), 171--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, João M. B. Cavalcanti, and Juliana Freire. 2006. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM’06). ACM, New York, NY, 258--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Thijs Vogels, Octavian-Eugen Ganea, and Carsten Eickhoff. 2018. Web2Text: Deep structured boilerplate removal. CoRR abs/1801.02607 (2018). Retrieved from http://arxiv.org/abs/1801.02607.Google ScholarGoogle Scholar
  56. Yu Wang, Bingxing Fang, Xueqi Cheng, Li Guo, and Hongbo Xu. 2008. Incremental web page template detection. In Proceedings of the 17th International Conference on World Wide Web (WWW’08). ACM, New York, NY, 1247--1248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Lan Yi, Bing Liu, and Xiaoli Li. 2003. Eliminating noisy information in web pages for data mining. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD’03). ACM, New York, NY, 296--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Chenxu Zhao, Rui Zhang, and Jianzhong Qi. 2018. Web page template and data separation for better maintainability. In Web Information Systems Engineering (WISE’18), Hakim Hacid, Wojciech Cellary, Hua Wang, Hye-Young Paik, and Rui Zhou (Eds.). Springer International Publishing, Cham, 439--449.Google ScholarGoogle Scholar
  59. Shuyi Zheng, Ruihua Song, Ji-Rong Wen, and C. Lee Giles. 2009. Efficient record-level wrapper induction. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, NY, 47--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Shuyi Zheng, Ruihua Song, Ji-Rong Wen, and Di Wu. 2007. Joint optimization of wrapper generation and template detection. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’07). ACM, New York, NY, 894--902. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template Extractors

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!