Abstract
Users submit queries to an online database via its query interface. Query interface parsing, which is important for many applications, understands the query capabilities of a query interface. Since most query interfaces are organized hierarchically, we present a novel query interface parsing method, StatParser (Statistical Parser), to automatically extract the hierarchical query capabilities of query interfaces. StatParser automatically learns from a set of parsed query interfaces and parses new query interfaces. StatParser starts from a small grammar and enhances the grammar with a set of probabilities learned from parsed query interfaces under the maximum-entropy principle. Given a new query interface, the probability-enhanced grammar identifies the parse tree with the largest global probability to be the query capabilities of the query interface. Experimental results show that StatParser very accurately extracts the query capabilities and can effectively overcome the problems of existing query interface parsers.
- Barbosa, L. and Freire, J. 2007a. An adaptive crawler for locating hidden-web entry points. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM Press, New York, 441--450. Google Scholar
Digital Library
- Barbosa, L. and Freire, J. 2007b. Combining classifiers to identify online databases. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM Press, New York, 431--440. Google Scholar
Digital Library
- Benslimane, S. M., Malki, M., Rahmouni, M. K., and Benslimane, D. 2007. Extracting personalized ontology from data-intensive web application: An html forms-based reverse engineering approach. Informatica 18, 4, 511--534. Google Scholar
Digital Library
- Bergman, M. K. 2001. White paper: The deep web: Surfacing hidden value. J. Electron. Publish. 7, 1.Google Scholar
Cross Ref
- Borthwick, A. 1999. A maximum entropy approach to named entity recognition. Doctoral dissertation, New York University, New York. Google Scholar
Digital Library
- Chang, K. C.-C., He, B., Li, C., Patel, M., and Zhang, Z. 2004. Structured databases on the Web: Observations and implications. ACM SIGMOD Rec. 33, 3, 61--70. Google Scholar
Digital Library
- Charniak, E. 2000. A maximum-entropy-inspired parser. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference (NAACL'00). Association for Computational Linguistics, 132--139. Google Scholar
Digital Library
- Dragut, E. C., Kabisch, T., Yu, C., and Leser, U. 2009. A hierarchical approach to model web query interfaces for web source integration. Proc. VLDB Endow. 2, 1, 325--336. Google Scholar
Digital Library
- Dragut, E. C., Meng, W., and Yu, C. T. 2012. Deep Web Query Interface Understanding and Integration. Morgan and Claypool Publishers, San Francisco, CA. Google Scholar
Digital Library
- Dragut, E., Wu, W., Sistla, P., Yu, C., and Meng, W. 2006. Merging source query interfaces on web databases. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06). IEEE Computer Society, 679--690. Google Scholar
Digital Library
- Feiner, A., Kraus, S., and Korf, R. E. 2003. KBFS: K-best-first search. Ann. Math. Artif. Intell. 39, 1--2, 19--39. Google Scholar
Digital Library
- Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., and Schallhart, C. 2011. Real understanding of real estate forms. In Proceedings of the International Conference on Web Intelligence, Mining and Semantics (WIMS'11). ACM Press, New York. Google Scholar
Digital Library
- Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., and Schallhart, C. 2012. OPAL: Automated form understanding for the deep web. In Proceedings of the 21st International Conference on World Wide Web (WWW'12). ACM Press, New York, 829--838. Google Scholar
Digital Library
- Guo, X., Kranzdorf, J., Furche, T., Grasso, G., Orsi, G., and Schallhart, C. 2012. OPAL: A passepartout for web forms. In Proceedings of the 21st International Conference Companion on World Wide Web. ACM Press, New York, 353--356. Google Scholar
Digital Library
- He, B., Zhang, Z., and Chang, K. C.-C. 2005a. MetaQuerier: Querying structured web sources on-the-fly. In Proceedings of ACM SIGMOD Conference (SIGMOD'05). ACM Press, New York, 927--929. Google Scholar
Digital Library
- He, H., Meng, W., Lu, Y., Yu, C., and Wu, Z. 2007. Towards deeper understanding of the search interfaces of the deep web. World Wide Web 10, 2, 133--155. Google Scholar
Digital Library
- He, H., Meng, W., Lu, Y., Yu, C., and Wu, Z. 2005b. Constructing interface schemas for search interfaces of web databases. In Proceedings of the 6th International Conference on Web Information Systems Engineering (WISE'05). Springer, 29--42. Google Scholar
Digital Library
- Kaljuvee, O., Buyukkokten, O., Garcia-Molina, H., and Paepcke, A. 2001. Efficient web form entry on PDAs. In Proceedings of the 10th International Conference on World Wide Web (WWW'01). ACM Press, New York, 663--672. Google Scholar
Digital Library
- Khare, R. and An, Y. 2009. An empirical study on using hidden markov model for search interface segmentation. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM Press, New York, 17--26. Google Scholar
Digital Library
- Khare, R., An, Y., and Song, I.-Y. 2010. Understanding deep web search interfaces: A survey. ACM SIGMOD Rec. 39, 1, 33--40. Google Scholar
Digital Library
- Madhavan, J., Cohen, S., Dong, X. L., Halevy, A. Y., Jeffery, S. R., Ko, D., and Yu, C. 2007. Web-scale data integration: You can afford to pay as you go. In Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research (CIDR'07). 342--350.Google Scholar
- Minka, T. P. 2003. A comparison of numerical optimizers for logistic regression. Tech. rep., Department of Statistics, Carnegie Mellon University. October.Google Scholar
- Nguyen, H., Nguyen, T., and Freire, J. 2008. Learning to extract form labels. Proc. VLDB Endow. 1, 1, 684--694. Google Scholar
Digital Library
- Raghavan, S. and Garcia-Molina, H. 2001. Crawling the hidden web. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB'01). Morgan Kaufmann Publishers, San Francisco, CA, 129--138. Google Scholar
Digital Library
- Ratnaparkhi, A. 1996. A maximum entropy model for part-of-speech tagging. In Proceedings of the 1st Empirical Methods in Natural Language Processing Conference. 133--142.Google Scholar
- Sheng, C., Zhang, N., Tao, Y., and Jin, X. 2012. Optimal algorithms for crawling a hidden database in the web. Proc. VLDB Endow. 5, 11, 1112--1123. Google Scholar
Digital Library
- Shestakov, D., Bhowmick, S. S., and Lim, E.-P. 2005. DEQUE: Querying the deep web. Data Knowl. Engin. 52, 3, 273--311. Google Scholar
Digital Library
- Su, W., Wang, J., and Lochovsky, F. H. 2006a. Automatic hierarchical classification of structured deep web databases. In Proceedings of the 7th International Conference on Web Information Systems Engineering (WISE'06). Springer, 210--221. Google Scholar
Digital Library
- Su, W., Wang, J., and Lochovsky, F. H. 2006b. Holistic schema matching for web query interfaces. In Proceedings of the 10th International Conference on Extending Database Technology (EDBT'06). Springer, 77--94. Google Scholar
Digital Library
- Su, W., Wang, J., and Lochovsky, F. H. 2009. ODE: Ontology-assisted data extraction. ACM Trans. Datab. Syst. 34, 2. Google Scholar
Digital Library
- Vieira, K., Barbosa, L., Freire, J., and Silva, A. 2008. Siphon++: A hidden-web crawler for keyword-based interfaces. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08). ACM Press, New York, 1361--1362. Google Scholar
Digital Library
- Wu, P., Wen, J.-R., Liu, H., and Ma, W.-Y. 2006. Query selection techniques for efficient crawling of structured web sources. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06). IEEE Computer Society, 47--58. Google Scholar
Digital Library
- Wu, W., Doan, A., Yu, C., and Meng, W. 2009. Modeling and extracting deep-web query interfaces. In Advances in Information and Intelligent Systems, Springer, 65--90.Google Scholar
- Wu, W., Yu, C., Doan, A., and Meng, W. 2004. An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proceedings of the ACM SIGMOD Conference (SIGMOD'04). ACM Press, New York, 95--106. Google Scholar
Digital Library
- Zhang, T. and Oles, F. J. 2001. Text categorization based on regularized linear classification methods. Inf. Retr. 4, 1, 5--31. Google Scholar
Digital Library
- Zhang, Z., He, B., and Chang, K. C.-C. 2004. Understanding web query interfaces: Best-effort parsing with hidden syntax. In Proceedings of ACM SIGMOD Conference on Management of Data (SIGMOD'04). ACM Press, New York, 107--118. Google Scholar
Digital Library
Index Terms
Understanding query interfaces by statistical parsing
Recommendations
Query interfaces understanding by statistical parsing
WWW '14 Companion: Proceedings of the 23rd International Conference on World Wide WebUsers submit queries to an online database via its query interface. Query interface parsing, which is important for many applications, understands the query capabilities of a query interface. Since most query interfaces are organized hierarchically, we ...
Parsing query interfaces of deep web: from specialization to generalization
IITA'09: Proceedings of the 3rd international conference on Intelligent information technology applicationE-commerce Web sites provide the economic information by query interfaces. Query interfaces are considered as Deep Web services that indirectly reflect the real schema of the hidden databases. The valid accessing method of Deep Web is through query ...






Comments