skip to main content
research-article

Assessing relevance and trust of the deep web sources and results based on inter-source agreement

Published:29 May 2013Publication History
Skip Abstract Section

Abstract

Deep web search engines face the formidable challenge of retrieving high-quality results from the vast collection of searchable databases. Deep web search is a two-step process of selecting the high-quality sources and ranking the results from the selected sources. Though there are existing methods for both the steps, they assess the relevance of the sources and the results using the query-result similarity. When applied to the deep web these methods have two deficiencies. First is that they are agnostic to the correctness (trustworthiness) of the results. Second, the query-based relevance does not consider the importance of the results and sources. These two considerations are essential for the deep web and open collections in general. Since a number of deep web sources provide answers to any query, we conjuncture that the agreements between these answers are helpful in assessing the importance and the trustworthiness of the sources and the results. For assessing source quality, we compute the agreement between the sources as the agreement of the answers returned. While computing the agreement, we also measure and compensate for the possible collusion between the sources. This adjusted agreement is modeled as a graph with sources at the vertices. On this agreement graph, a quality score of a source, that we call SourceRank, is calculated as the stationary visit probability of a random walk. For ranking results, we analyze the second-order agreement between the results. Further extending SourceRank to multidomain search, we propose a source ranking sensitive to the query domains. Multiple domain-specific rankings of a source are computed, and these ranks are combined for the final ranking. We perform extensive evaluations on online and hundreds of Google Base sources spanning across domains. The proposed result and source rankings are implemented in the deep web search engine Factal. We demonstrate that the agreement analysis tracks source corruption. Further, our relevance evaluations show that our methods improve precision significantly over Google Base and the other baseline methods. The result ranking and the domain-specific source ranking are evaluated separately.

References

  1. Agrawal, S., Chakrabarti, K., Chaudhuri, S., Ganti, V., Konig, A., and Xin, D. 2009. Exploiting web search engines to search structured databases. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 501--510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arasu, A. and Garcia-Molina, H. 2003. Extracting structured data from web pages. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 337--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Balakrishnan, R. and Kambhampati, S. 2010. SourceRank: Relevance and trust assessment for deep web sources based on inter-source agreement. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 1055--1056. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Balakrishnan, R. and Kambhampati, S. 2011a. Factal: Integrating deep web based on trust and relevance. In Proceedings of the International Conference on World Wide Web. ACM Press, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Balakrishnan, R. and Kambhampati, S. 2011b. Sourcerank: Relevance and trust assessment for deep web sources based on inter-source agreement. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 227--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Barbosa, L., Freire, J., and Silva, A. 2007. Organizing hidden-web databases by clustering visible web documents. In Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE'07). 326--335.Google ScholarGoogle Scholar
  7. Bender, M., Michel, S., Triantafillou, P., Weikum, G., and Zimmer, C. 2005. Improving collection selection with overlap awareness in p2p search engines. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Informational Retrieval (SIGIR'05). 67--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., and Sudarshan, S. 2002. Keyword searching and browsing in databases using banks. In Proceedings of the 18th International Conference on Data Engineering (ICDE'02). 431--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 1--7, 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Callan, J. and Connell, M. 2001. Query-Based sampling of text databases. ACM Trans. Inf. Syst. 19, 2, 97--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Callan, J., Lu, Z., and Croft, W. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th International ACM SIGIR Conference on Research and Development Information Retrieval. ACM Press, New York, 21--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chaudhuri, S., Das, G., Hristidis, V., and Weikum, G. 2004. Probabilistic ranking of database query results. In Proceedings of the 13th International Conference on Very Large Data Bases-Volume 30. VLDB Endowment, 888--899. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Cohen, W. 1998. Integration of heterogeneous databases without common domains using queries based on textual similarity. ACM SIGMOD Rec. 27, 2, 201--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cohen, W., Ravikumar, P., and Fienberg, S. 2003. A comparison of string distance metrics for namematching tasks. In Proceedings of the Workshop on Information Integration on the Web (IIWeb'03).Google ScholarGoogle Scholar
  15. Croft, W. 2000. Combining approaches to information retrieval. Adv. Inf. Retr. 7, 1--36.Google ScholarGoogle Scholar
  16. Dasgupta, A., Das, G., and Mannila, H. 2007. A random walk approach to sampling hidden databases. In Proceedings of ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 629--640 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. DMOZ Movies 2011. Open directory project movies. http://www.dmoz.org/Arts/Movies/Titles/.Google ScholarGoogle Scholar
  18. Dong, X., Berti-Equille, L., Hu, Y., and Srivastava, D. 2010. Global detection of complex copying relationships between sources. Proc. VLDB Endow. 3, 1--2, 1358--1369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dong, X., Berti-Equille, L., and Srivastava, D. 2009. Integrating conflicting data: The role of source dependence. Proc. VLDB Endow. 2, 1, 550--561. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Fellegi, I. and Sunter, A. 1969. A theory for record linkage. J. Amer. Statist. Assoc. 64, 328, 1183--1210.Google ScholarGoogle ScholarCross RefCross Ref
  21. Fuhr, N. 1999. A decision-theoretic approach to database selection in networked ir. ACM Trans. Inf. Syst. 17, 3, 229--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Galland, A., Abiteboul, S., Marian, A., and Senellart, P. 2010. Corroborating information from disagreeing views. In Proceedings of the 3rd ACM International on Web Search and Data Mining (WSDM'10). 131--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gleich, D., Constantine, P., Flaxman, A., and Gunawardana, A. 2010. Tracking the random surfer: Empirically measured teleportation parameters in pagerank. In Proceedings of the 19th International Conference on World Wide Web. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Google Products. 2011. Google products. http://www.google.com/products.Google ScholarGoogle Scholar
  25. Gravano, L., Ipeirotis, P., and Sahami, M. 2003. QProber: A system for automatic classification of hidden-web databases. ACM Trans. Inf. Syst. 21, 1, 1--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gummadi, R., Khulbe, A., Kalavagattu, A., Salvi, S., and Kambhampati, S. 2011. Smartint: Using mined attribute dependencies to integrate fragmented web databases. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 51--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Gupta, M. and Han, J. 2011. Heterogeneous network-based trust analysis: A survey. ACM SIGKDD Explor. Newlett. 13, 1, 54--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gupta, M., Sun, Y., and Han, J. 2011. Trust analysis with clustering. In Proceedings of the International Conference on World Wide Web. ACM Press, New York, 53--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating web spam with trustrank. In Proceedings of the 13th International Conference on Very Large Databases -- Volume 30. 576--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. 1997. Extracting semistructured information from the web. In Proceedings of the Workshop on Management of Semistructured Data. ACM Press, New York, 18--25.Google ScholarGoogle Scholar
  31. Haveliwala, T. 2003. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Trans. Knowl. Data Engin. 15, 4, 784--796. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. He, B. and Chang, K. 2003. Statistical schema matching across web query interfaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 217--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. He, B., Tao, T., and Chang, K. 2004. Organizing structured web sources by query schemas: A clustering approach. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management. ACM Press, New York, 22--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. IMDB 2011. IMDB movie database. http://www.imdb.com.Google ScholarGoogle Scholar
  35. Ipeirotis, P. and Gravano, L. 2004. When one sample is not enough: Improving text database selection using shrinkage. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 767--778. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kleinberg, J. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Koudas, N., Sarawagi, S., and Srivastava, D. 2006. Record linkage: Similarity measures and algorithms. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, New York, 802--803. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Kurland, O. and Lee, L. 2005. Pagerank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 306--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lee, J. 1997. Analyses of multiple evidence combination. ACM SIGIR Forum 31, 267--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Liang, P., Klein, D., and Jordan, M. 2008. Agreement-based learning. Adv. Neural Inf. Process. Syst. 20, 913--920.Google ScholarGoogle Scholar
  41. Madhavan, J., Bernstein, P., Doan, A., and Halevy, A. 2005. Corpus-based schema matching. In Proceedings of the 21st International Conference on Data Engineering (ICDE'05). 57--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Madhavan, J., Halevy, A., Cohen, S., Dong, X., Jeffery, S., Ko, D., and Yu, C. 2006. Structured data meets the web: A few observations. Data Engin. Bull. 31, 4.Google ScholarGoogle Scholar
  43. Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., and Halevy, A. 2008. Google's deep web crawl. Proc. VLDB Endow. 1, 2, 1241--1252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Nie, Z. and Kambhampati, S. 2004. A frequency-based approach for mining coverage statistics in data integration. In Proceedings of the 20th International Conference on Data Engineering (ICDE'04). 387--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Nyt Movie Guide. 2010. New York times guide to best 1000 movies. http://www.nytimes.com/ref/movies/1000best.html.Google ScholarGoogle Scholar
  46. Nyt Top Books. 2010. New york times books best sellers. http://www.hawes.com/number1s.htm.Google ScholarGoogle Scholar
  47. Pbase Cameras. 2011. Pbase camera list. http://www.pbase.com/cameras.Google ScholarGoogle Scholar
  48. Richardson, M., Dominowska, E., and Ragno, R. 2007. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of the 16th International Conference on World Wide Web. ACM Press, New York, 521--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Shokouhi, M. and Zobel, J. 2007. Federated text retrieval from uncooperative overlapped collections. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 495--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Si, L. and Callan, J. 2003. Relevant document distribution estimation method for resource selection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 298--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. UIUC TEL-8. 2003. UIUC tel-8 repository. http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/index.html.Google ScholarGoogle Scholar
  52. Wang, J. and Lochovsky, F. 2003. Data extraction and label assignment for web databases. In Proceedings of the 12th International Conference on World Wide Web. ACM Press, New York, 187--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Wang, J., Wen, J., Lochovsky, F., and Ma, W. 2004b. Instance-Based schema matching for web databases by domain-specific query probing. Proceedings of the 13th International Conference on Very Large Databases. volume 30, VLDB Endowment, 408--419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Wiki Top Music. 2011. Best selling albums worldwide. http://en.wikipedia.org/wiki/List_of_best-selling_albums_worldwide.Google ScholarGoogle Scholar
  55. Wolf, G., Kalavagattu, A., Khatri, H., Balakrishnan, R., Chokshi, B., Fan, J., Chen, Y., and Kambhampati, S. 2009. Query processing over incomplete autonomous databases: Query rewriting using learned data dependencies. Very Large Data J. 18, 5, 1167--1190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Wright, A. 2008. Searching the deep web. Comm. ACM 51, 10, 14--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Yin, X., Han, J., and Yu, P. S. 2008. Truth discovery with multiple conflicting information providers on the web. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Yin, X. and Tan, W. 2011. Semi-supervised truth discovery. In Proceedings of the 20th International Conference on World Wide Web. ACM Press, New York, 217--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Zhai, Y. and Liu, B. 2005. Web data extraction based on partial tree alignment. In Proceedings of the 14th International Conference on World Wide Web. ACM Press, New York, 76--85. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Assessing relevance and trust of the deep web sources and results based on inter-source agreement

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!