skip to main content
research-article

A Pure Visual Approach for Automatically Extracting and Aligning Structured Web Data

Authors Info & Claims
Published:01 November 2019Publication History
Skip Abstract Section

Abstract

Database-driven websites and the amount of data stored in their databases are growing enormously. Web databases retrieve relevant information in response to users’ queries; the retrieved information is encoded in dynamically generated web pages as structured data records. Identifying and extracting retrieved data records is a fundamental task for many applications, such as competitive intelligence and comparison shopping. This task is challenging due to the complex underlying structure of such web pages and the existence of irrelevant information. Numerous approaches have been introduced to address this problem, but most of them are HTML-dependent solutions that may no longer be functional with the continuous development of HTML. Although a few vision-based techniques have been introduced, various issues exist that inhibit their performance. To overcome this, we propose a novel visual approach, i.e., programming-language-independent, for automatically extracting structured web data. The proposed approach makes full use of the natural human tendency of visual object perception and the Gestalt laws of grouping. The extraction system consists of two tasks: (1) data record extraction, where we apply three of the Gestalt laws (i.e., laws of continuity, proximity, and similarity), which are used to group the adjacently aligned visually similar data records on a web page; and (2) data item extraction and alignment, where we employ the Gestalt law of similarity, which is utilized to group the visually identical data items. Our experiments upon large-scale test sets show that the proposed system is highly effective and outperforms the two state-of-art vision-based approaches, ViDE and rExtractor. The experiments produce an average F1 score of 86.02%, which is approximately 55% and 36% better than that of ViDE and rExtractor for data record extraction, respectively; and an average F1 score of 86.19%, which is approximately 39% better than that of ViDE for data item extraction.

References

  1. B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. 2007. Accessing the deep web. Commun. ACM 50, 5 (2007), 94--101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Y. Saissi, A. Zellou, and A. Idri. 2014. Extraction of relational schema from deep web sources: A form driven approach. In Proceedings of the 2nd World Conference on Complex Systems (WCCS’14).Google ScholarGoogle Scholar
  3. P. Liakos, A. Ntoulas, A. Labrinidis, and A. Delis. 2016. Focused crawling for the hidden web. World Wide Web 19, 4 (2016), 605--631.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. Su, J. Wang, F. H. Lochovsky, and Y. Liu. 2012. Combining tag and value similarity for data extraction and alignment. IEEE Trans. Knowl. Data Eng. 24, 7 (2012), 1186--1200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Shi, C. Liu, Y. Shen, C. Yuan, and Y. Huang. 2015. AutoRM: An effective approach for automatic web data record mining. Knowl.-based Syst. 89, (2015), 314--331.Google ScholarGoogle Scholar
  6. M. I. Varlamov and D. Y. Turdakov. 2016. A survey of methods for the extraction of information from web resources. Prog. Comput. Softw. 42, 5 (2016), 279--291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. A. Sleiman and R. Corchuelo. 2013. A survey on region extractors from web documents. IEEE Trans. Knowl. Data Eng. 25, 9 (2013), 1960--1981.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. 2014. Web data extraction, applications and techniques: A survey. Knowl.-based Syst. 70 (2014), 301--323.Google ScholarGoogle Scholar
  9. Z. Xu and J. Miller. 2016. Identifying semantic blocks in web pages using Gestalt laws of grouping. World Wide Web 19, 5 (2016), 957--978.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Anderson and J. Hong. 2013. Visually extracting data records from the deep web. In Proceedings of the 22nd International Conference on World Wide Web.Google ScholarGoogle Scholar
  11. L. Bing, W. Lam, and Y. Gu. 2011. Towards a unified solution: Data record region detection and segmentation. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management.Google ScholarGoogle Scholar
  12. S. Fan, X. Wang, and Y. Dong. 2014. Web data extraction based on visual information and partial tree alignment. In Proceedings of the 11th Web Information System and Application Conference (WISA’14).Google ScholarGoogle Scholar
  13. W. Liu, X. Meng, and W. Meng. 2010. ViDE: A vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22, 3 (2010), 447--460.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Alpuente and D. Romero. 2009. A visual technique for web pages comparison. Electron. Notes Theor. Comput. Sci. 235 (2009), 3--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Ahmadi and J. Kong. 2012. User-centric adaptation of web information for small screens. J. Vis. Lang. Comput. 23, 1 (2012), 13--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Shahbazi and J. Miller. 2014. Extended subtree: A new similarity function for tree structured data. IEEE Trans. Knowl. Data Eng. 26, 4 (2014), 864--877.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Yamada, N. Craswell, T. Nakatoh, and S. Hirokawa. 2004. Testbed for information extraction from deep web. In Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers 8 Posters.Google ScholarGoogle Scholar
  18. T. Grigalis and A. Čenys. 2014. Unsupervised structured data extraction from template-generated web pages. J. Univ. Comput. Sci. 20, 3 (2014), 169--192.Google ScholarGoogle Scholar
  19. D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. 2003. VIPS: A vision-based page segmentation algorithm. Microsoft Technical Report. MSR-TR-2003-79.Google ScholarGoogle Scholar
  20. J. Cohen. 1988. Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.Google ScholarGoogle Scholar
  21. N. Cliff. 1993. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol. Bull. 114, 3 (1993), 494.Google ScholarGoogle Scholar
  22. S. S. Sawilowsky. 2009. New effect size rules of thumb. J. Mod. Appl. Stat. Meth. 8, 2 (2009), 26.Google ScholarGoogle ScholarCross RefCross Ref
  23. J. Romano, J. D. Kromrey, J. Coraggio, and J. Skowronek. 2006. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen's d for evaluating group differences on the NSSE and other surveys? In Proceedings of the Meeting of the Florida Association of Institutional Research. 1--33.Google ScholarGoogle Scholar
  24. F. Estuka and J. Miller. 2018. Source code and datasets of this research. Retrieved from: https://github.com/Fadwa-estuka?tab=repositories.Google ScholarGoogle Scholar
  25. schema.org [n.d.]. Retrieved from: http://schema.org/.Google ScholarGoogle Scholar
  26. I. Hernández, C. R. Rivero, and D. Ruiz. 2018. Deep web crawling: A survey. World Wide Web 22, 4 (2018), 1--34.Google ScholarGoogle Scholar
  27. K. Khurana and M. B. Chandak. 2016. Survey of techniques for deep web source selection and surfacing the hidden web content. Int. J. Adv. Comput. Sci. Appl. 7, 5 (2016), 409--418.Google ScholarGoogle Scholar
  28. D. Dou, H. Wang, and H. Liu. 2015. Semantic data mining: A survey of ontology-based approaches. In Proceedings of the IEEE 9th International Conference on Semantic Computing (ICSC’15). 244--251.Google ScholarGoogle Scholar

Index Terms

  1. A Pure Visual Approach for Automatically Extracting and Aligning Structured Web Data

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Internet Technology
          ACM Transactions on Internet Technology  Volume 19, Issue 4
          Special Section on Trust and AI and Regular Papers
          November 2019
          201 pages
          ISSN:1533-5399
          EISSN:1557-6051
          DOI:10.1145/3362102
          • Editor:
          • Ling Liu
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 November 2019
          • Accepted: 1 August 2019
          • Revised: 1 July 2019
          • Received: 1 March 2018
          Published in toit Volume 19, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!