skip to main content
research-article

Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

Published:01 November 2013Publication History
Skip Abstract Section

Abstract

Web data record extraction aims at extracting a set of similar object records from a single webpage. These records have similar attributes or fields and are presented with a regular format in a coherent region of the page. To tackle this problem, most existing works analyze the DOM tree of an input page. One major limitation of these methods is that the lack of a global view in detecting data records from an input page results in a myopic decision. Their brute-force searching manner in detecting various types of records degrades the flexibility and robustness. We propose a Structure-Knowledge-Oriented Global Analysis (Skoga) framework which can perform robust detection of different-kinds of data records and record regions. The major component of the Skoga framework is a DOM structure-knowledge-driven detection model which can conduct a global analysis on the DOM structure to achieve effective detection. The DOM structure knowledge consists of background knowledge as well as statistical knowledge capturing different characteristics of data records and record regions, as exhibited in the DOM structure. The background knowledge encodes the semantics of labels indicating general constituents of data records and regions. The statistical knowledge is represented by some carefully designed features that capture different characteristics of a single node or a node group in the DOM. The feature weights are determined using a development dataset via a parameter estimation algorithm based on a structured output support vector machine. An optimization method based on the divide-and-conquer principle is developed making use of the DOM structure knowledge to quantitatively infer and recognize appropriate records and regions for a page. Extensive experiments have been conducted on four datasets. The experimental results demonstrate that our framework achieves higher accuracy compared with state-of-the-art methods.

References

  1. Arasu, A. and Garcia-Molina, H. 2003. Extracting structured data from web pages. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD) 337--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arocena, G. O. and Mendelzon, A. O. 1999. Weboql: Restructuring documents, databases, and webs. Theory Practice Object Syst. 5, 127--141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baumgartner, R., Gottlob, G., and Herzog, M. 2009. Scalable Web data extraction for online market intelligence. Proc. VLDB Endow. 2, 1512--1523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bing, L., Lam, W., and Gu, Y. 2011. Towards a unified solution: Data record region detection and segmentation. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). 1265--1274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bing, L., Lam, W., and Wong, T.-L. 2013. Wikipedia entity expansion and attribute extraction from the Web using semi-supervised learning. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM). 567--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Buttler, D., Liu, L., and Pu, C. 2001. A fully automated object extraction system for the World Wide Web. In Proceedings of the the 21st International Conference on Distributed Computing Systems (ICDCS). 361--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cafarella, M. J., Halevy, A., and Madhavan, J. 2011. Structured data on the Web. Comm. ACM 54, 72--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., and Zhang, Y. 2008. Webtables: Exploring the power of tables on the Web. Proc. VLDB Endow. 1, 538--549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. 2003. VIPS: A vision-based page segmentation algorithm. Tech. rep. MSR-TR-2003-79. Microsoft Research.Google ScholarGoogle Scholar
  10. Chang, C.-H., Kayed, M., Girgis, M. R., and Shaalan, K. F. 2006. A survey of Web information extraction systems. IEEE Trans. Knowl. Data Eng. 18, 1411--1428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chang, C.-H. and Lui, S.-C. 2001. Iepad: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web (WWW). 681--688. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Crescenzi, V., Mecca, G., and Merialdo, P. 2001. Roadrunner: Towards automatic data extraction from large Web sites. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB). 109--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Elmeleegy, H., Madhavan, J., and Halevy, A. 2009. Harvesting relational tables from lists on the Web. Proc. VLDB Endow. 2, 1078--1089. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Embley, D. W., Campbell, D. M., Jiang, Y. S., Liddle, S. W., Lonsdale, D. W., Ng, Y.-K., and Smith, R. D. 1999a. Conceptual-model-based data extraction from multiple-record Web pages. Data Knowl. Eng. 31, 227--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Embley, D. W., Jiang, Y., and Ng, Y.-K. 1999b. Record-boundary discovery in Web documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 467--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., and Pollak, B. 2007. Towards domain-independent information extraction from Web tables. In Proceedings of the 16th International Conference on World Wide Web (WWW). 71--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hao, Q., Cai, R., Pang, Y., and Zhang, L. 2011. From one tree to a forest: A unified solution for structured Web data extraction. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 775--784. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. He, B., Patel, M., Zhang, Z., and Chang, K. C.-C. 2007. Accessing the deep Web. Comm. ACM 50, 94--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hogue, A. and Karger, D. 2005. Thresher: Automating the unwrapping of semantic content from the World Wide Web. In Proceedings of the 14th International Conference on World Wide Web (WWW). 86--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hsu, C.-N. and Dung, M.-T. 1998. Generating finite-state transducers for semi-structured data extraction from the web. Info. Syst. 23, 521--538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kayed, M. and Chang, C.-H. 2010. Fivatech: Page-level Web data extraction from template pages. IEEE Trans. Knowl. Data Eng. 22, 249--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kushmerick, N. 2000. Wrapper induction: Efficiency and expressiveness. Artificial Intell. 118, 15--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Laender, A. H. F., Ribeiro-Neto, B., and da Silva, A. S. 2002. Debye - date extraction by example. Data Knowl. Eng. 40, 121--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Liu, B., Grossman, R., and Zhai, Y. 2003. Mining data records in Web pages. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 601--606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Liu, B. and Zhai, Y. 2005. Net -- a system for extracting Web data from flat and nested data records. In Proceedings of the 6th International Conference on Web Information Systems Engineering (WISE). 487--495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Liu, L., Pu, C., and Han, W. 2000. Xwrap: An xml-enabled wrapper construction system for web information sources. In Proceedings of the 16th International Conference on Data Engineering (ICDE). 611--621. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Liu, W., Meng, X., and Meng, W. 2010. Vide: A vision-based approach for deep Web data extraction. IEEE Trans. Knowl. Data Eng. 22, 447--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Luo, X., Xu, Z., Yu, J., and Chen, X. 2011. Building association link network for semantic link on Web resources. IEEE Trans. Autom. Sci. Eng. 8, 3, 482--494.Google ScholarGoogle ScholarCross RefCross Ref
  29. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., and Halevy, A. 2008. Google's deep Web crawl. Proc. VLDB Endow. 1, 1241--1252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., and Moser, L. E. 2009. Extracting data records from the Web using tag path clustering. In Proceedings of the 18th International Conference on World Wide Web (WWW). 981--990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Muslea, I., Minton, S., and Knoblock, C. A. 2001. Hierarchical wrapper induction for semistructured information sources. Auton. Agents Multi-Agent Syst. 4, 93--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ryan, M. S. and Nudd, G. R. 1993. The viterbi algorithm. Tech. rep. Department of Computer Science, University of Warnick. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Simon, K. and Lausen, G. 2005. Viper: Augmenting automatic information extraction with visual perceptions. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM). 381--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Sleiman, H. A. and Corchuelo, R. 2012. A survey on region extractors from Web documents. IEEE Trans. Knowl. Data Eng. 99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Song, X., Liu, J., Cao, Y., Lin, C.-Y., and Hon, H.-W. 2010. Automatic extraction of Web data records containing user-generated content. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM). 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Su, W., Wang, J., and Lochovsky, F. H. 2009. Ode: Ontology-assisted data extraction. ACM Trans. Database Syst. 34, 12:1--12:35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. 2005. Large margin methods for structured and interdependent output variables. J. Machine Learn. Res. 6, 1453--1484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Wang, J. and Lochovsky, F. H. 2003. Data extraction and label assignment for Web databases. In Proceedings of the 12th International Conference on World Wide Web (WWW). 187--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wong, T.-L. and Lam, W. 2010. Learning to adapt Web information extraction knowledge and discovering new attributes via a bayesian approach. IEEE Trans. Knowl. Data Eng. 22, 523--536. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Wong, T.-L., Lam, W., and Chan, S.-K. 2006. Collaborative information extraction and mining from multiple Web documents. In Proceedings of the SIAM International Conference on Data Mining (SDM). 440--450.Google ScholarGoogle Scholar
  41. Wong, T.-L., Lam, W., and Chen, B. 2009. Mining employment market via text block detection and adaptive cross-domain information extraction. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 283--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yamada, Y., Craswell, N., Nakatoh, T., and Hirokawa, S. 2004. Testbed for information extraction from deep Web. In Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters (WWW Alt). 346--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Yang, C., Cao, Y., Nie, Z., Zhou, J., and Wen, J.-R. 2010. Closing the loop in Webpage understanding. IEEE Trans. Knowl. Data Eng. 22, 639--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yang, J.-M., Cai, R., Wang, Y., Zhu, J., Zhang, L., and Ma, W.-Y. 2009. Incorporating site-level knowledge to extract structured data from Web forums. In Proceedings of the 18th International Conference on World Wide Web WWW. 181--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Zhai, Y. and Liu, B. 2006. Structured data extraction from the Web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18, 1614--1628. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Zhai, Y. and Liu, B. 2007. Extracting Web data using instance-based learning. J. World Wide Web 10, 2, 113--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zhao, B., Yin, X., and Xing, E. P. 2011. Max margin learning on domain-independent Web information extraction. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM). 1305--1310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Zhao, H., Meng, W., Wu, Z., Raghavan, V., and Yu, C. 2005. Fully automatic wrapper generation for search engines. In Proceedings of the 14th International Conference on World Wide Web (WWW). 66--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zhao, H., Meng, W., and Yu, C. 2006. Automatic extraction of dynamic record sections from search engine result pages. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB). 989--1000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Zheng, S., Song, R., Wen, J.-R., and Giles, C. L. 2009. Efficient record-level wrapper induction. In Proceeding of the 18th ACM Conference on Information and Knowledge Management (CIKM). 47--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zheng, S., Song, R., Wen, J.-R., and Wu, D. 2007. Joint optimization of wrapper generation and template detection. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 894--902. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y. 2006. Simultaneous record detection and attribute labeling in Web data extraction. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 494--503. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on the Web
        ACM Transactions on the Web  Volume 7, Issue 4
        October 2013
        220 pages
        ISSN:1559-1131
        EISSN:1559-114X
        DOI:10.1145/2540635
        Issue’s Table of Contents

        Copyright © 2013 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 November 2013
        • Accepted: 1 May 2013
        • Revised: 1 December 2012
        • Received: 1 July 2012
        Published in tweb Volume 7, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!