skip to main content
research-article

Topic Distillation with Query-Dependent Link Connections and Page Characteristics

Published:01 May 2011Publication History
Skip Abstract Section

Abstract

Searchers on the Web often aim to find key resources about a topic. Finding such results is called topic distillation. Previous research has shown that the use of sources of evidence such as page indegree and URL structure can significantly improve search performance on interconnected collections such as the Web, beyond the use of simple term distribution statistics. This article presents a new approach to improve topic distillation by exploring the use of external sources of evidence: link structure, including query dependent indegree and outdegree; and web page characteristics, such as the density of anchor links.

Our experiments with the TREC .GOV collection, an 18GB crawl of the US .gov domain from 2002, show that using such evidence can significantly improve search effectiveness, with combinations of evidence leading to significant performance gains over both full-text and anchor-text baselines. Moreover, we demonstrate that, at a different scope level, both local query-dependent outdegree and query-dependent indegree out-performed their global query-independent counterparts; and at the same scope level, outdegree out-performed indegree. Adding query-dependent indegree or page characteristics to query-dependent outdegree could have a small, but not significant, improvement.

References

  1. Allan, J., Callan, J., Feng, F., and Malin, D. 1999. INQUERY and TREC-8. In Proceedings of TREC-8. http://trec.nist.gov/pubs/trec8/papers/trec8-umass.pdf.Google ScholarGoogle Scholar
  2. Amento, B., Terveen, L., and Hill, W. 2000. Does “authority” mean quality? Predicting expert quality ratings of Web documents. In Proceedings of the 23rd International ACM-SIGIR Conference on Research and Development in Information Retrieval. 296--303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Aslam, J. A. and Montague, M. 2001. Models for metasearch. In Proceedings of the 24th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 276--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bailey, P., Craswell, N., Soboroff, I., and de Varies, A. P. 2007. The CSIRO enterprise search test collection. ACM SIGIR Forum 41, 2, 42--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A., and Yilmaz, E. 2008. Relevance assessment: Are judges exchangable and does it matter? In Proceedings of the 31st International ACM-SIGIR Conference on Research and Development in Information Retrieval. 667--674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bharat, K. and Henzinger, M. R. 1998. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st International ACM-SIGIR Conference on Research and Development in Information Retrieval. 104--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Brin, S. and Page, L. 1998. The anatomy of a large scale hypertextual web search engine. In Proceedings of 7th International ACM-WWW Conference on World Wide Web. 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cai, D., Yu, S., Wen, J., and Ma, W. 2003. VIPS: A vision-based page segmentation algorithm. Tech. rep., Microsoft Research.Google ScholarGoogle Scholar
  9. Calado, P., Ribeiro-Neto, B., Ziviani, N., Moura, E., and Silva, L. 2003. Local versus global link information in the Web. ACM Trans. Inf. Syst. 21, 1, 42--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Carriere, S. J. and Kazman, R. 1997. WebQuery: Searching and visualizing the Web through connectivity. In Proceedings of the 6th International World Wide Web Conference. 1257--1267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chakrabarti, S., Dom, B. E., Raghavan, P., Rajagopalan, S., Gibson, D., and Kleinberg, J. M. 1998. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of 7th International ACM-WWW Conference on World Wide Web. 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Craswell, N. and Hawking, D. 2004. Overview of the TREC-2004 web track. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/WEB.OVERVIEW.pdf.Google ScholarGoogle Scholar
  13. Craswell, N., Hawking, D., and Robertson, S. 2001a. Effective site finding using link anchor information. In Proceedings of the 24th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 250--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Craswell, N., Hawking, D., Wilkinson, R., and Wu, M. 2001b. TREC10 Web and interactive tracks at CSIRO. In Proceedings of TREC-2001. http://trec.nist.gov/pubs/trec10/papers/csiro-trec-2001.pdf.Google ScholarGoogle Scholar
  15. Craswell, N., Hawking, D., Thom, J., Upstill, T., Wilkinson, R., and Wu, M. 2002. TREC11 Web and interactive tracks at CSIRO. In Proceedings of TREC-2002. http://trec.nist.gov/pubs/trec11/papers/csiro.craswell.pdf.Google ScholarGoogle Scholar
  16. Craswell, N., Hawking, D., Upstill, T., McLean, A., Wilkinson, R., and Wu, M. 2003a. TREC12 web and interactive tracks at CSIRO. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/csiro.web.pdf.Google ScholarGoogle Scholar
  17. Craswell, N., Hawking, D., Wilkinson, R., and Wu, M. 2003b. Overview of the TREC-2003 web track. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/WEB.OVERVIEW.pdf.Google ScholarGoogle Scholar
  18. Craswell, N., Robertson, S., Zaragoza, H., and Taylor, M. 2005. Relevance weighting for query independent evidence. In Proceedings of the 28th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 416--423. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. 2001. Rank aggregation methods for the web. In Proceeding of the 10th International Conference on on World Wide Web. 613--622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Fagin, R., Kuman, R., and Sivakumar, D. 2003. Comparing top K lists. SIAM J. Discr. Math. 17, 1, 134--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Fox, E. A. and Shaw, J. A. 1994. Combination of multiple searches. In Proceedings of TREC-2. http://trec.nist.gov/pubs/trec2/papers/txt/23.txt, 243--249.Google ScholarGoogle Scholar
  22. Gibson, D., Kleinberg, J., and Raghavan, P. 1998. Inferring web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia. 225--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hawking, D. and Craswell, N. 2005. The very large collection and web tracks. In Experiment and Evaluation in Information Retrieval, E. M. Voorhees and D. K. Harman Eds., MIT Press, 199--231.Google ScholarGoogle Scholar
  24. Jarvelin, K. and Kekalainen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inform. Syst. 20, 4, 442--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kamps, J., Mishne, G., and de Rijke, M. 2004. Language nodels for searching in web corpora. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/uamsterdam.web.pdf.Google ScholarGoogle Scholar
  26. Kamps, J., Monz, C., de Rijke, M., and Sigurbjornsson, B. 2003. Approaches to robust and web retrieval. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/uamsterdam.web.robust.pdf.Google ScholarGoogle Scholar
  27. Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kraaij, W., Westerveld, T., and Hiemstra, D. 2002. The importance of prior probabilities for entry page search. In Proceedings of the 25th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 27--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lempel, R. and Moran, S. 2001. The stochastic approach for link-structure analysis. ACM Trans. Inf. Syst. 19, 2, 131--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Liu, T.-Y., Xu, J., Qin, T., Xiong, W., and Li, H. 2007. LETOR: Benchmark dataset for research on learning to rank for information retrieval. In Proceedings of SIGIR 2007 Workshop on Learning Rank for Information Retrieval.Google ScholarGoogle Scholar
  31. Lu, Y., Hu, J., and Ma, F. 2004. SJTU at TREC-2004: Web track experiments. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/shanghaiu.web.pdf.Google ScholarGoogle Scholar
  32. McBryan, O. A. 1994. GENVL and WWWW: Tools for Taming the Web. In Proceedings of the 1st International World Wide Web Conference. CERN, Geneva.Google ScholarGoogle ScholarCross RefCross Ref
  33. Miller, D. R. H., Leek, T., and Schwartz, R. M. 1998. BBN at TREC-7: Using hidden markov models for information retrieval. In Proceedings of TREC-1998. http://trec.nist.gov/pubs/.Google ScholarGoogle Scholar
  34. Najork, M. 2007. Comparing the effectiveness of hits and salsa. In Proceedings of the 30th International ACM-CIKM Conference on Information and Knowledge Management. 157--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Najork, M., Zaragoza, H., and Taylor, M. 2007. Hits on the web: how does it compare? In Proceedings of the 30th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 471--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Najork, M., Gollapudi, S., and Panigraph, R. 2009. Less is more: Sampling the neighborhood graph makes salsa better and faster. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 242--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ogilvie, P. and Callan, J. 2003. Combining document representations for known-item search. In Proceedings of the 26th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 143--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford University.Google ScholarGoogle Scholar
  39. Plachouras, V., Cacheda, F., Ounis, I., and van Rijsbergen, C. J. 2003. University of glasgow at the web track: Dynamic application of hyperlink analysis using the query scope. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/uglasgow.web.pdf.Google ScholarGoogle Scholar
  40. Plachouras, V., He, B., and Ounis, I. 2004. University of glasgow at TREC2004: Experiments in web, robust and terabyte tracks with terrier. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/uglasgow.web.robust.tera.pdf.Google ScholarGoogle Scholar
  41. Qin, T., Liu, T. Y., Zhang, X. D., Chen, Z., and Ma, W. Y. 2005. A study of relevance propagation for web search. In Proceedings of the 28th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 408--415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Qin, T., Liu, T.-Y., Zhang, X.-D., Wang, D.-S., Xiong, W.-Y., and Li, H. 2008. Learning to rank relational objects and its application to web search. In Proceedings of 17th International ACM-WWW Conference on World Wide Web. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Robertson, S., Walker, S., Hancock-Beaulieu, M. M., and Gatford, M. 1994. Okapi at TREC-3. In Proceedings of TREC-3. http://trec.nist.gov/pubs/trec2/papers/txt/02.txt, 109--126.Google ScholarGoogle Scholar
  44. Robertson, S., Zaragoza, H., and Taylor, M. 2004. Simple BM25 extension to multiple weighted fields. In Proceedings of the 27th International ACM-CIKM Conference on Information and Knowledge Management. 42--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Shakery, A. and Zhai, C. 2003. Relevance propagation for topic distillation UIUC TREC 2003 web track experiments. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/uillinois-uc.web.pdf.Google ScholarGoogle Scholar
  46. Shen, H., Chen, G., Chen, H., Liu, Y., and Cheng, X. 2007. Research on entperprise track of TREC 2007. In Proceedings of TREC-2007. http://trec.nist.gov/pubs/.Google ScholarGoogle Scholar
  47. Stokoe, C. and Tait, J. 2003. Towards a sense based document representation for internet information retrieval. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/usunderland.web.pdf.Google ScholarGoogle Scholar
  48. Tomlinson, S. 2003. Robust, web and genomic retrieval with hummingbir SearchServer<sup>TM</sup> at TREC 2003. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/hummingbird.robust.web.genomic.pdf.Google ScholarGoogle Scholar
  49. Tomlinson, S. 2004. Robust, web and terabyte retrieval with hummingbir SearchServer at TREC 2004. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/humingbird.robust.web.tera.pdf.Google ScholarGoogle Scholar
  50. Upstill, T., Craswell, N., and Hawking, D. 2003. Query-independent evidence in home page finding. ACM Trans. Inf. Syst. 21, 3, 286--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Wen, J., Song, R., Cai, D., Zhu, K., Yu, S., Ye, S., and Ma, W. 2003. Microsoft research asia at the web track of TREC 2003. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/microsoft-asia.web.pdf.Google ScholarGoogle Scholar
  52. Westerveld, T., Kraaij, W., and Hiemstra, D. 2001. Retrieving web pages using content, links, URLs and anchors. In Proceedings of TREC-2001. http://trec.nist.gov/pubs/trec10/papers/TNO-UTwente-trec10-final.pdf.Google ScholarGoogle Scholar
  53. Wu, M., Scholer, F., Shokouhi, M., Pugisi, S., and Ali, H. 2007. RMIT university at the TREC 2007 enterprise track. In Proceedings of TREC-2007. http://trec.nist.gov/pubs/trec16/papers/rmit.ent.final.pdf.Google ScholarGoogle Scholar
  54. Yang, K. and Albertson, D. 2003. WIDIT in TREC-2003 Web tracks. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/indianau.web.pdf.Google ScholarGoogle Scholar
  55. Yang, K., Yu, N., Wead, A., Rowe, G. L., Li, Y., Friend, C., and Lee, Y. 2004. WIDIT in TREC-2004 genomics, HARD, Robust, and Web tracks. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/indianau.geo.hard.robust.web.pdf.Google ScholarGoogle Scholar
  56. Zaragoza, H., Craswell, N., Taylor, M., Saria, S., and Robertson, S. 2004. Microsoft cambridge at TREC-13: Web and HARD tracks. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/microsoft-cambridge.web.hard.pdf.Google ScholarGoogle Scholar
  57. Zhou, Z., Guo, Y., Wang, B., Cheng, X., Xu, H., and Zhang, G. 2004. TREC 2004 Web track experiments at CAS-ICT. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/cas.ict.web.pdf.Google ScholarGoogle Scholar

Index Terms

  1. Topic Distillation with Query-Dependent Link Connections and Page Characteristics

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!