Abstract
Database-driven websites and the amount of data stored in their databases are growing enormously. Web databases retrieve relevant information in response to users’ queries; the retrieved information is encoded in dynamically generated web pages as structured data records. Identifying and extracting retrieved data records is a fundamental task for many applications, such as competitive intelligence and comparison shopping. This task is challenging due to the complex underlying structure of such web pages and the existence of irrelevant information. Numerous approaches have been introduced to address this problem, but most of them are HTML-dependent solutions that may no longer be functional with the continuous development of HTML. Although a few vision-based techniques have been introduced, various issues exist that inhibit their performance. To overcome this, we propose a novel visual approach, i.e., programming-language-independent, for automatically extracting structured web data. The proposed approach makes full use of the natural human tendency of visual object perception and the Gestalt laws of grouping. The extraction system consists of two tasks: (1) data record extraction, where we apply three of the Gestalt laws (i.e., laws of continuity, proximity, and similarity), which are used to group the adjacently aligned visually similar data records on a web page; and (2) data item extraction and alignment, where we employ the Gestalt law of similarity, which is utilized to group the visually identical data items. Our experiments upon large-scale test sets show that the proposed system is highly effective and outperforms the two state-of-art vision-based approaches, ViDE and rExtractor. The experiments produce an average F1 score of 86.02%, which is approximately 55% and 36% better than that of ViDE and rExtractor for data record extraction, respectively; and an average F1 score of 86.19%, which is approximately 39% better than that of ViDE for data item extraction.
- B. He, M. Patel, Z. Zhang, and K. C.-C. Chang. 2007. Accessing the deep web. Commun. ACM 50, 5 (2007), 94--101.Google Scholar
Digital Library
- Y. Saissi, A. Zellou, and A. Idri. 2014. Extraction of relational schema from deep web sources: A form driven approach. In Proceedings of the 2nd World Conference on Complex Systems (WCCS’14).Google Scholar
- P. Liakos, A. Ntoulas, A. Labrinidis, and A. Delis. 2016. Focused crawling for the hidden web. World Wide Web 19, 4 (2016), 605--631.Google Scholar
Digital Library
- W. Su, J. Wang, F. H. Lochovsky, and Y. Liu. 2012. Combining tag and value similarity for data extraction and alignment. IEEE Trans. Knowl. Data Eng. 24, 7 (2012), 1186--1200.Google Scholar
Digital Library
- S. Shi, C. Liu, Y. Shen, C. Yuan, and Y. Huang. 2015. AutoRM: An effective approach for automatic web data record mining. Knowl.-based Syst. 89, (2015), 314--331.Google Scholar
- M. I. Varlamov and D. Y. Turdakov. 2016. A survey of methods for the extraction of information from web resources. Prog. Comput. Softw. 42, 5 (2016), 279--291.Google Scholar
Digital Library
- H. A. Sleiman and R. Corchuelo. 2013. A survey on region extractors from web documents. IEEE Trans. Knowl. Data Eng. 25, 9 (2013), 1960--1981.Google Scholar
Digital Library
- E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner. 2014. Web data extraction, applications and techniques: A survey. Knowl.-based Syst. 70 (2014), 301--323.Google Scholar
- Z. Xu and J. Miller. 2016. Identifying semantic blocks in web pages using Gestalt laws of grouping. World Wide Web 19, 5 (2016), 957--978.Google Scholar
Digital Library
- N. Anderson and J. Hong. 2013. Visually extracting data records from the deep web. In Proceedings of the 22nd International Conference on World Wide Web.Google Scholar
- L. Bing, W. Lam, and Y. Gu. 2011. Towards a unified solution: Data record region detection and segmentation. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management.Google Scholar
- S. Fan, X. Wang, and Y. Dong. 2014. Web data extraction based on visual information and partial tree alignment. In Proceedings of the 11th Web Information System and Application Conference (WISA’14).Google Scholar
- W. Liu, X. Meng, and W. Meng. 2010. ViDE: A vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22, 3 (2010), 447--460.Google Scholar
Digital Library
- M. Alpuente and D. Romero. 2009. A visual technique for web pages comparison. Electron. Notes Theor. Comput. Sci. 235 (2009), 3--18.Google Scholar
Digital Library
- H. Ahmadi and J. Kong. 2012. User-centric adaptation of web information for small screens. J. Vis. Lang. Comput. 23, 1 (2012), 13--28.Google Scholar
Digital Library
- A. Shahbazi and J. Miller. 2014. Extended subtree: A new similarity function for tree structured data. IEEE Trans. Knowl. Data Eng. 26, 4 (2014), 864--877.Google Scholar
Digital Library
- Y. Yamada, N. Craswell, T. Nakatoh, and S. Hirokawa. 2004. Testbed for information extraction from deep web. In Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers 8 Posters.Google Scholar
- T. Grigalis and A. Čenys. 2014. Unsupervised structured data extraction from template-generated web pages. J. Univ. Comput. Sci. 20, 3 (2014), 169--192.Google Scholar
- D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. 2003. VIPS: A vision-based page segmentation algorithm. Microsoft Technical Report. MSR-TR-2003-79.Google Scholar
- J. Cohen. 1988. Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.Google Scholar
- N. Cliff. 1993. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol. Bull. 114, 3 (1993), 494.Google Scholar
- S. S. Sawilowsky. 2009. New effect size rules of thumb. J. Mod. Appl. Stat. Meth. 8, 2 (2009), 26.Google Scholar
Cross Ref
- J. Romano, J. D. Kromrey, J. Coraggio, and J. Skowronek. 2006. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen's d for evaluating group differences on the NSSE and other surveys? In Proceedings of the Meeting of the Florida Association of Institutional Research. 1--33.Google Scholar
- F. Estuka and J. Miller. 2018. Source code and datasets of this research. Retrieved from: https://github.com/Fadwa-estuka?tab=repositories.Google Scholar
- schema.org [n.d.]. Retrieved from: http://schema.org/.Google Scholar
- I. Hernández, C. R. Rivero, and D. Ruiz. 2018. Deep web crawling: A survey. World Wide Web 22, 4 (2018), 1--34.Google Scholar
- K. Khurana and M. B. Chandak. 2016. Survey of techniques for deep web source selection and surfacing the hidden web content. Int. J. Adv. Comput. Sci. Appl. 7, 5 (2016), 409--418.Google Scholar
- D. Dou, H. Wang, and H. Liu. 2015. Semantic data mining: A survey of ontology-based approaches. In Proceedings of the IEEE 9th International Conference on Semantic Computing (ICSC’15). 244--251.Google Scholar
Index Terms
A Pure Visual Approach for Automatically Extracting and Aligning Structured Web Data
Recommendations
Cross-Browser Differences Detection Based on an Empirical Metric for Web Page Visual Similarity
Special Issue on Artificial Intelligence for Secruity and Privacy and Regular PapersThis article aims to develop a method to detect visual differences introduced into web pages when they are rendered in different browsers. To achieve this goal, we propose an empirical visual similarity metric by mimicking human mechanisms of ...
Extraction Rule Language for Web Information Extraction and Integration
WISA '13: Proceedings of the 2013 10th Web Information System and Application ConferenceThe Web is the largest data source that contains a lot of valuable information of interests to users or applications. However, how to automatically navigate and extract useful data from web pages is an important issue to study. There have been a number ...
Web data extracion using visual features
ICWET '10: Proceedings of the International Conference and Workshop on Emerging Trends in TechnologyAutomatic data extraction from Web pages is a challenging yet significant problem in the fields of Information Retrieval and Data Mining. The problem arises particularly on the World-Wide Web, because search engines wrap up the results of user queries ...






Comments