skip to main content
research-article

Understanding query interfaces by statistical parsing

Authors Info & Claims
Published:29 May 2013Publication History
Skip Abstract Section

Abstract

Users submit queries to an online database via its query interface. Query interface parsing, which is important for many applications, understands the query capabilities of a query interface. Since most query interfaces are organized hierarchically, we present a novel query interface parsing method, StatParser (Statistical Parser), to automatically extract the hierarchical query capabilities of query interfaces. StatParser automatically learns from a set of parsed query interfaces and parses new query interfaces. StatParser starts from a small grammar and enhances the grammar with a set of probabilities learned from parsed query interfaces under the maximum-entropy principle. Given a new query interface, the probability-enhanced grammar identifies the parse tree with the largest global probability to be the query capabilities of the query interface. Experimental results show that StatParser very accurately extracts the query capabilities and can effectively overcome the problems of existing query interface parsers.

References

  1. Barbosa, L. and Freire, J. 2007a. An adaptive crawler for locating hidden-web entry points. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM Press, New York, 441--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Barbosa, L. and Freire, J. 2007b. Combining classifiers to identify online databases. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM Press, New York, 431--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Benslimane, S. M., Malki, M., Rahmouni, M. K., and Benslimane, D. 2007. Extracting personalized ontology from data-intensive web application: An html forms-based reverse engineering approach. Informatica 18, 4, 511--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bergman, M. K. 2001. White paper: The deep web: Surfacing hidden value. J. Electron. Publish. 7, 1.Google ScholarGoogle ScholarCross RefCross Ref
  5. Borthwick, A. 1999. A maximum entropy approach to named entity recognition. Doctoral dissertation, New York University, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chang, K. C.-C., He, B., Li, C., Patel, M., and Zhang, Z. 2004. Structured databases on the Web: Observations and implications. ACM SIGMOD Rec. 33, 3, 61--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Charniak, E. 2000. A maximum-entropy-inspired parser. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference (NAACL'00). Association for Computational Linguistics, 132--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dragut, E. C., Kabisch, T., Yu, C., and Leser, U. 2009. A hierarchical approach to model web query interfaces for web source integration. Proc. VLDB Endow. 2, 1, 325--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dragut, E. C., Meng, W., and Yu, C. T. 2012. Deep Web Query Interface Understanding and Integration. Morgan and Claypool Publishers, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dragut, E., Wu, W., Sistla, P., Yu, C., and Meng, W. 2006. Merging source query interfaces on web databases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06). IEEE Computer Society, 679--690. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Feiner, A., Kraus, S., and Korf, R. E. 2003. KBFS: K-best-first search. Ann. Math. Artif. Intell. 39, 1--2, 19--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., and Schallhart, C. 2011. Real understanding of real estate forms. In Proceedings of the International Conference on Web Intelligence, Mining and Semantics (WIMS'11). ACM Press, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., and Schallhart, C. 2012. OPAL: Automated form understanding for the deep web. In Proceedings of the 21st International Conference on World Wide Web (WWW'12). ACM Press, New York, 829--838. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Guo, X., Kranzdorf, J., Furche, T., Grasso, G., Orsi, G., and Schallhart, C. 2012. OPAL: A passepartout for web forms. In Proceedings of the 21st International Conference Companion on World Wide Web. ACM Press, New York, 353--356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. He, B., Zhang, Z., and Chang, K. C.-C. 2005a. MetaQuerier: Querying structured web sources on-the-fly. In Proceedings of ACM SIGMOD Conference (SIGMOD'05). ACM Press, New York, 927--929. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. He, H., Meng, W., Lu, Y., Yu, C., and Wu, Z. 2007. Towards deeper understanding of the search interfaces of the deep web. World Wide Web 10, 2, 133--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. He, H., Meng, W., Lu, Y., Yu, C., and Wu, Z. 2005b. Constructing interface schemas for search interfaces of web databases. In Proceedings of the 6th International Conference on Web Information Systems Engineering (WISE'05). Springer, 29--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kaljuvee, O., Buyukkokten, O., Garcia-Molina, H., and Paepcke, A. 2001. Efficient web form entry on PDAs. In Proceedings of the 10th International Conference on World Wide Web (WWW'01). ACM Press, New York, 663--672. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Khare, R. and An, Y. 2009. An empirical study on using hidden markov model for search interface segmentation. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM Press, New York, 17--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Khare, R., An, Y., and Song, I.-Y. 2010. Understanding deep web search interfaces: A survey. ACM SIGMOD Rec. 39, 1, 33--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Madhavan, J., Cohen, S., Dong, X. L., Halevy, A. Y., Jeffery, S. R., Ko, D., and Yu, C. 2007. Web-scale data integration: You can afford to pay as you go. In Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research (CIDR'07). 342--350.Google ScholarGoogle Scholar
  22. Minka, T. P. 2003. A comparison of numerical optimizers for logistic regression. Tech. rep., Department of Statistics, Carnegie Mellon University. October.Google ScholarGoogle Scholar
  23. Nguyen, H., Nguyen, T., and Freire, J. 2008. Learning to extract form labels. Proc. VLDB Endow. 1, 1, 684--694. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Raghavan, S. and Garcia-Molina, H. 2001. Crawling the hidden web. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB'01). Morgan Kaufmann Publishers, San Francisco, CA, 129--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ratnaparkhi, A. 1996. A maximum entropy model for part-of-speech tagging. In Proceedings of the 1st Empirical Methods in Natural Language Processing Conference. 133--142.Google ScholarGoogle Scholar
  26. Sheng, C., Zhang, N., Tao, Y., and Jin, X. 2012. Optimal algorithms for crawling a hidden database in the web. Proc. VLDB Endow. 5, 11, 1112--1123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Shestakov, D., Bhowmick, S. S., and Lim, E.-P. 2005. DEQUE: Querying the deep web. Data Knowl. Engin. 52, 3, 273--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Su, W., Wang, J., and Lochovsky, F. H. 2006a. Automatic hierarchical classification of structured deep web databases. In Proceedings of the 7th International Conference on Web Information Systems Engineering (WISE'06). Springer, 210--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Su, W., Wang, J., and Lochovsky, F. H. 2006b. Holistic schema matching for web query interfaces. In Proceedings of the 10th International Conference on Extending Database Technology (EDBT'06). Springer, 77--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Su, W., Wang, J., and Lochovsky, F. H. 2009. ODE: Ontology-assisted data extraction. ACM Trans. Datab. Syst. 34, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Vieira, K., Barbosa, L., Freire, J., and Silva, A. 2008. Siphon++: A hidden-web crawler for keyword-based interfaces. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08). ACM Press, New York, 1361--1362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Wu, P., Wen, J.-R., Liu, H., and Ma, W.-Y. 2006. Query selection techniques for efficient crawling of structured web sources. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06). IEEE Computer Society, 47--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Wu, W., Doan, A., Yu, C., and Meng, W. 2009. Modeling and extracting deep-web query interfaces. In Advances in Information and Intelligent Systems, Springer, 65--90.Google ScholarGoogle Scholar
  34. Wu, W., Yu, C., Doan, A., and Meng, W. 2004. An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proceedings of the ACM SIGMOD Conference (SIGMOD'04). ACM Press, New York, 95--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zhang, T. and Oles, F. J. 2001. Text categorization based on regularized linear classification methods. Inf. Retr. 4, 1, 5--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zhang, Z., He, B., and Chang, K. C.-C. 2004. Understanding web query interfaces: Best-effort parsing with hidden syntax. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD'04). ACM Press, New York, 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Understanding query interfaces by statistical parsing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on the Web
        ACM Transactions on the Web  Volume 7, Issue 2
        May 2013
        244 pages
        ISSN:1559-1131
        EISSN:1559-114X
        DOI:10.1145/2460383
        Issue’s Table of Contents

        Copyright © 2013 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 29 May 2013
        • Accepted: 1 January 2013
        • Revised: 1 October 2012
        • Received: 1 March 2012
        Published in tweb Volume 7, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!