Abstract
Deep web search engines face the formidable challenge of retrieving high-quality results from the vast collection of searchable databases. Deep web search is a two-step process of selecting the high-quality sources and ranking the results from the selected sources. Though there are existing methods for both the steps, they assess the relevance of the sources and the results using the query-result similarity. When applied to the deep web these methods have two deficiencies. First is that they are agnostic to the correctness (trustworthiness) of the results. Second, the query-based relevance does not consider the importance of the results and sources. These two considerations are essential for the deep web and open collections in general. Since a number of deep web sources provide answers to any query, we conjuncture that the agreements between these answers are helpful in assessing the importance and the trustworthiness of the sources and the results. For assessing source quality, we compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for the possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source, that we call SourceRank, is calculated as the stationary visit probability of a random walk. For ranking results, we analyze the second-order agreement between the results. Further extending SourceRank to multidomain search, we propose a source ranking sensitive to the query domains. Multiple domain-specific rankings of a source are computed, and these ranks are combined for the final ranking. We perform extensive evaluations on online and hundreds of Google Base sources spanning across domains. The proposed result and source rankings are implemented in the deep web search engine Factal. We demonstrate that the agreement analysis tracks source corruption. Further, our relevance evaluations show that our methods improve precision significantly over Google Base and the other baseline methods. The result ranking and the domain-specific source ranking are evaluated separately.
- Agrawal, S., Chakrabarti, K., Chaudhuri, S., Ganti, V., Konig, A., and Xin, D. 2009. Exploiting web search engines to search structured databases. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 501--510. Google Scholar
Digital Library
- Arasu, A. and Garcia-Molina, H. 2003. Extracting structured data from web pages. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 337--348. Google Scholar
Digital Library
- Balakrishnan, R. and Kambhampati, S. 2010. SourceRank: Relevance and trust assessment for deep web sources based on inter-source agreement. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 1055--1056. Google Scholar
Digital Library
- Balakrishnan, R. and Kambhampati, S. 2011a. Factal: Integrating deep web based on trust and relevance. In Proceedings of the International Conference on World Wide Web. ACM Press, New York. Google Scholar
Digital Library
- Balakrishnan, R. and Kambhampati, S. 2011b. Sourcerank: Relevance and trust assessment for deep web sources based on inter-source agreement. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 227--236. Google Scholar
Digital Library
- Barbosa, L., Freire, J., and Silva, A. 2007. Organizing hidden-web databases by clustering visible web documents. In Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE'07). 326--335.Google Scholar
- Bender, M., Michel, S., Triantafillou, P., Weikum, G., and Zimmer, C. 2005. Improving collection selection with overlap awareness in p2p search engines. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Informational Retrieval (SIGIR'05). 67--74. Google Scholar
Digital Library
- Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., and Sudarshan, S. 2002. Keyword searching and browsing in databases using banks. In Proceedings of the 18th International Conference on Data Engineering (ICDE'02). 431--440. Google Scholar
Digital Library
- Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117. Google Scholar
Digital Library
- Callan, J. and Connell, M. 2001. Query-Based sampling of text databases. ACM Trans. Inf. Syst. 19, 2, 97--130. Google Scholar
Digital Library
- Callan, J., Lu, Z., and Croft, W. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th International ACM SIGIR Conference on Research and Development Information Retrieval. ACM Press, New York, 21--28. Google Scholar
Digital Library
- Chaudhuri, S., Das, G., Hristidis, V., and Weikum, G. 2004. Probabilistic ranking of database query results. In Proceedings of the 13th International Conference on Very Large Data Bases-Volume 30. VLDB Endowment, 888--899. Google Scholar
Digital Library
- Cohen, W. 1998. Integration of heterogeneous databases without common domains using queries based on textual similarity. ACM SIGMOD Rec. 27, 2, 201--212. Google Scholar
Digital Library
- Cohen, W., Ravikumar, P., and Fienberg, S. 2003. A comparison of string distance metrics for namematching tasks. In Proceedings of the Workshop on Information Integration on the Web (IIWeb'03).Google Scholar
- Croft, W. 2000. Combining approaches to information retrieval. Adv. Inf. Retr. 7, 1--36.Google Scholar
- Dasgupta, A., Das, G., and Mannila, H. 2007. A random walk approach to sampling hidden databases. In Proceedings of ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 629--640 Google Scholar
Digital Library
- DMOZ Movies 2011. Open directory project movies. http://www.dmoz.org/Arts/Movies/Titles/.Google Scholar
- Dong, X., Berti-Equille, L., Hu, Y., and Srivastava, D. 2010. Global detection of complex copying relationships between sources. Proc. VLDB Endow. 3, 1--2, 1358--1369. Google Scholar
Digital Library
- Dong, X., Berti-Equille, L., and Srivastava, D. 2009. Integrating conflicting data: The role of source dependence. Proc. VLDB Endow. 2, 1, 550--561. Google Scholar
Digital Library
- Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 328, 1183--1210.Google Scholar
Cross Ref
- Fuhr, N. 1999. A decision-theoretic approach to database selection in networked ir. ACM Trans. Inf. Syst. 17, 3, 229--249. Google Scholar
Digital Library
- Galland, A., Abiteboul, S., Marian, A., and Senellart, P. 2010. Corroborating information from disagreeing views. In Proceedings of the 3rd ACM International on Web Search and Data Mining (WSDM'10). 131--140. Google Scholar
Digital Library
- Gleich, D., Constantine, P., Flaxman, A., and Gunawardana, A. 2010. Tracking the random surfer: Empirically measured teleportation parameters in pagerank. In Proceedings of the 19th International Conference on World Wide Web. Google Scholar
Digital Library
- Google Products. 2011. Google products. http://www.google.com/products.Google Scholar
- Gravano, L., Ipeirotis, P., and Sahami, M. 2003. QProber: A system for automatic classification of hidden-web databases. ACM Trans. Inf. Syst. 21, 1, 1--41. Google Scholar
Digital Library
- Gummadi, R., Khulbe, A., Kalavagattu, A., Salvi, S., and Kambhampati, S. 2011. Smartint: Using mined attribute dependencies to integrate fragmented web databases. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 51--52. Google Scholar
Digital Library
- Gupta, M. and Han, J. 2011. Heterogeneous network-based trust analysis: A survey. ACM SIGKDD Explor. Newlett. 13, 1, 54--71. Google Scholar
Digital Library
- Gupta, M., Sun, Y., and Han, J. 2011. Trust analysis with clustering. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 53--54. Google Scholar
Digital Library
- Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating web spam with trustrank. In Proceedings of the 13th International Conference on Very Large Databases -- Volume 30. 576--587. Google Scholar
Digital Library
- Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. 1997. Extracting semistructured information from the web. In Proceedings of the Workshop on Management of Semistructured Data. ACM Press, New York, 18--25.Google Scholar
- Haveliwala, T. 2003. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Trans. Knowl. Data Engin. 15, 4, 784--796. Google Scholar
Digital Library
- He, B. and Chang, K. 2003. Statistical schema matching across web query interfaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 217--228. Google Scholar
Digital Library
- He, B., Tao, T., and Chang, K. 2004. Organizing structured web sources by query schemas: A clustering approach. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management. ACM Press, New York, 22--31. Google Scholar
Digital Library
- IMDB 2011. IMDB movie database. http://www.imdb.com.Google Scholar
- Ipeirotis, P. and Gravano, L. 2004. When one sample is not enough: Improving text database selection using shrinkage. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 767--778. Google Scholar
Digital Library
- Kleinberg, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. Google Scholar
Digital Library
- Koudas, N., Sarawagi, S., and Srivastava, D. 2006. Record linkage: Similarity measures and algorithms. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 802--803. Google Scholar
Digital Library
- Kurland, O. and Lee, L. 2005. Pagerank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 306--313. Google Scholar
Digital Library
- Lee, J. 1997. Analyses of multiple evidence combination. ACM SIGIR Forum 31, 267--276. Google Scholar
Digital Library
- Liang, P., Klein, D., and Jordan, M. 2008. Agreement-based learning. Adv. Neural Inf. Process. Syst. 20, 913--920.Google Scholar
- Madhavan, J., Bernstein, P., Doan, A., and Halevy, A. 2005. Corpus-based schema matching. In Proceedings of the 21st International Conference on Data Engineering (ICDE'05). 57--68. Google Scholar
Digital Library
- Madhavan, J., Halevy, A., Cohen, S., Dong, X., Jeffery, S., Ko, D., and Yu, C. 2006. Structured data meets the web: A few observations. Data Engin. Bull. 31, 4.Google Scholar
- Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., and Halevy, A. 2008. Google's deep web crawl. Proc. VLDB Endow. 1, 2, 1241--1252. Google Scholar
Digital Library
- Nie, Z. and Kambhampati, S. 2004. A frequency-based approach for mining coverage statistics in data integration. In Proceedings of the 20th International Conference on Data Engineering (ICDE'04). 387--398. Google Scholar
Digital Library
- Nyt Movie Guide. 2010. New York times guide to best 1000 movies. http://www.nytimes.com/ref/movies/1000best.html.Google Scholar
- Nyt Top Books. 2010. New york times books best sellers. http://www.hawes.com/number1s.htm.Google Scholar
- Pbase Cameras. 2011. Pbase camera list. http://www.pbase.com/cameras.Google Scholar
- Richardson, M., Dominowska, E., and Ragno, R. 2007. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web. ACM Press, New York, 521--530. Google Scholar
Digital Library
- Shokouhi, M. and Zobel, J. 2007. Federated text retrieval from uncooperative overlapped collections. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 495--502. Google Scholar
Digital Library
- Si, L. and Callan, J. 2003. Relevant document distribution estimation method for resource selection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 298--305. Google Scholar
Digital Library
- UIUC TEL-8. 2003. UIUC tel-8 repository. http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/index.html.Google Scholar
- Wang, J. and Lochovsky, F. 2003. Data extraction and label assignment for web databases. In Proceedings of the 12th International Conference on World Wide Web. ACM Press, New York, 187--196. Google Scholar
Digital Library
- Wang, J., Wen, J., Lochovsky, F., and Ma, W. 2004b. Instance-Based schema matching for web databases by domain-specific query probing. Proceedings of the 13th International Conference on Very Large Databases. volume 30, VLDB Endowment, 408--419. Google Scholar
Digital Library
- Wiki Top Music. 2011. Best selling albums worldwide. http://en.wikipedia.org/wiki/List_of_best-selling_albums_worldwide.Google Scholar
- Wolf, G., Kalavagattu, A., Khatri, H., Balakrishnan, R., Chokshi, B., Fan, J., Chen, Y., and Kambhampati, S. 2009. Query processing over incomplete autonomous databases: Query rewriting using learned data dependencies. Very Large Data J. 18, 5, 1167--1190. Google Scholar
Digital Library
- Wright, A. 2008. Searching the deep web. Comm. ACM 51, 10, 14--15. Google Scholar
Digital Library
- Yin, X., Han, J., and Yu, P. S. 2008. Truth discovery with multiple conflicting information providers on the web. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google Scholar
Digital Library
- Yin, X. and Tan, W. 2011. Semi-supervised truth discovery. In Proceedings of the 20th International Conference on World Wide Web. ACM Press, New York, 217--226. Google Scholar
Digital Library
- Zhai, Y. and Liu, B. 2005. Web data extraction based on partial tree alignment. In Proceedings of the 14th International Conference on World Wide Web. ACM Press, New York, 76--85. Google Scholar
Digital Library
Index Terms
Assessing relevance and trust of the deep web sources and results based on inter-source agreement
Recommendations
Investigating the relevance of sponsored results for web ecommerce queries
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalAre sponsored links, the primary business model for Web search engines, providing Web consumers with relevant results? This research addresses this issue by investigating the relevance of sponsored and non-sponsored links for ecommerce queries from the ...
Relevance Estimation with Multiple Information Sources on Search Engine Result Pages
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge ManagementRelevance estimation is among the most important tasks in the ranking of search results because most search engines follow the Probability Ranking Principle. Current relevance estimation methodologies mainly concentrate on text matching between the ...
A study of results overlap and uniqueness among major web search engines
The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results ...






Comments