Abstract
Searchers on the Web often aim to find key resources about a topic. Finding such results is called topic distillation. Previous research has shown that the use of sources of evidence such as page indegree and URL structure can significantly improve search performance on interconnected collections such as the Web, beyond the use of simple term distribution statistics. This article presents a new approach to improve topic distillation by exploring the use of external sources of evidence: link structure, including query dependent indegree and outdegree; and web page characteristics, such as the density of anchor links.
Our experiments with the TREC .GOV collection, an 18GB crawl of the US .gov domain from 2002, show that using such evidence can significantly improve search effectiveness, with combinations of evidence leading to significant performance gains over both full-text and anchor-text baselines. Moreover, we demonstrate that, at a different scope level, both local query-dependent outdegree and query-dependent indegree out-performed their global query-independent counterparts; and at the same scope level, outdegree out-performed indegree. Adding query-dependent indegree or page characteristics to query-dependent outdegree could have a small, but not significant, improvement.
- Allan, J., Callan, J., Feng, F., and Malin, D. 1999. INQUERY and TREC-8. In Proceedings of TREC-8. http://trec.nist.gov/pubs/trec8/papers/trec8-umass.pdf.Google Scholar
- Amento, B., Terveen, L., and Hill, W. 2000. Does “authority” mean quality? Predicting expert quality ratings of Web documents. In Proceedings of the 23rd International ACM-SIGIR Conference on Research and Development in Information Retrieval. 296--303. Google Scholar
Digital Library
- Aslam, J. A. and Montague, M. 2001. Models for metasearch. In Proceedings of the 24th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 276--284. Google Scholar
Digital Library
- Bailey, P., Craswell, N., Soboroff, I., and de Varies, A. P. 2007. The CSIRO enterprise search test collection. ACM SIGIR Forum 41, 2, 42--45. Google Scholar
Digital Library
- Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A., and Yilmaz, E. 2008. Relevance assessment: Are judges exchangable and does it matter? In Proceedings of the 31st International ACM-SIGIR Conference on Research and Development in Information Retrieval. 667--674. Google Scholar
Digital Library
- Bharat, K. and Henzinger, M. R. 1998. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st International ACM-SIGIR Conference on Research and Development in Information Retrieval. 104--111. Google Scholar
Digital Library
- Brin, S. and Page, L. 1998. The anatomy of a large scale hypertextual web search engine. In Proceedings of 7th International ACM-WWW Conference on World Wide Web. 107--117. Google Scholar
Digital Library
- Cai, D., Yu, S., Wen, J., and Ma, W. 2003. VIPS: A vision-based page segmentation algorithm. Tech. rep., Microsoft Research.Google Scholar
- Calado, P., Ribeiro-Neto, B., Ziviani, N., Moura, E., and Silva, L. 2003. Local versus global link information in the Web. ACM Trans. Inf. Syst. 21, 1, 42--63. Google Scholar
Digital Library
- Carriere, S. J. and Kazman, R. 1997. WebQuery: Searching and visualizing the Web through connectivity. In Proceedings of the 6th International World Wide Web Conference. 1257--1267. Google Scholar
Digital Library
- Chakrabarti, S., Dom, B. E., Raghavan, P., Rajagopalan, S., Gibson, D., and Kleinberg, J. M. 1998. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of 7th International ACM-WWW Conference on World Wide Web. 65--74. Google Scholar
Digital Library
- Craswell, N. and Hawking, D. 2004. Overview of the TREC-2004 web track. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/WEB.OVERVIEW.pdf.Google Scholar
- Craswell, N., Hawking, D., and Robertson, S. 2001a. Effective site finding using link anchor information. In Proceedings of the 24th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 250--257. Google Scholar
Digital Library
- Craswell, N., Hawking, D., Wilkinson, R., and Wu, M. 2001b. TREC10 Web and interactive tracks at CSIRO. In Proceedings of TREC-2001. http://trec.nist.gov/pubs/trec10/papers/csiro-trec-2001.pdf.Google Scholar
- Craswell, N., Hawking, D., Thom, J., Upstill, T., Wilkinson, R., and Wu, M. 2002. TREC11 Web and interactive tracks at CSIRO. In Proceedings of TREC-2002. http://trec.nist.gov/pubs/trec11/papers/csiro.craswell.pdf.Google Scholar
- Craswell, N., Hawking, D., Upstill, T., McLean, A., Wilkinson, R., and Wu, M. 2003a. TREC12 web and interactive tracks at CSIRO. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/csiro.web.pdf.Google Scholar
- Craswell, N., Hawking, D., Wilkinson, R., and Wu, M. 2003b. Overview of the TREC-2003 web track. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/WEB.OVERVIEW.pdf.Google Scholar
- Craswell, N., Robertson, S., Zaragoza, H., and Taylor, M. 2005. Relevance weighting for query independent evidence. In Proceedings of the 28th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 416--423. Google Scholar
Digital Library
- Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. 2001. Rank aggregation methods for the web. In Proceeding of the 10th International Conference on on World Wide Web. 613--622. Google Scholar
Digital Library
- Fagin, R., Kuman, R., and Sivakumar, D. 2003. Comparing top K lists. SIAM J. Discr. Math. 17, 1, 134--160. Google Scholar
Digital Library
- Fox, E. A. and Shaw, J. A. 1994. Combination of multiple searches. In Proceedings of TREC-2. http://trec.nist.gov/pubs/trec2/papers/txt/23.txt, 243--249.Google Scholar
- Gibson, D., Kleinberg, J., and Raghavan, P. 1998. Inferring web communities from link topology. In Proceedings of the 9th ACM Conference on Hypertext and Hypermedia. 225--234. Google Scholar
Digital Library
- Hawking, D. and Craswell, N. 2005. The very large collection and web tracks. In Experiment and Evaluation in Information Retrieval, E. M. Voorhees and D. K. Harman Eds., MIT Press, 199--231.Google Scholar
- Jarvelin, K. and Kekalainen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inform. Syst. 20, 4, 442--446. Google Scholar
Digital Library
- Kamps, J., Mishne, G., and de Rijke, M. 2004. Language nodels for searching in web corpora. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/uamsterdam.web.pdf.Google Scholar
- Kamps, J., Monz, C., de Rijke, M., and Sigurbjornsson, B. 2003. Approaches to robust and web retrieval. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/uamsterdam.web.robust.pdf.Google Scholar
- Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. Google Scholar
Digital Library
- Kraaij, W., Westerveld, T., and Hiemstra, D. 2002. The importance of prior probabilities for entry page search. In Proceedings of the 25th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 27--34. Google Scholar
Digital Library
- Lempel, R. and Moran, S. 2001. The stochastic approach for link-structure analysis. ACM Trans. Inf. Syst. 19, 2, 131--160. Google Scholar
Digital Library
- Liu, T.-Y., Xu, J., Qin, T., Xiong, W., and Li, H. 2007. LETOR: Benchmark dataset for research on learning to rank for information retrieval. In Proceedings of SIGIR 2007 Workshop on Learning Rank for Information Retrieval.Google Scholar
- Lu, Y., Hu, J., and Ma, F. 2004. SJTU at TREC-2004: Web track experiments. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/shanghaiu.web.pdf.Google Scholar
- McBryan, O. A. 1994. GENVL and WWWW: Tools for Taming the Web. In Proceedings of the 1st International World Wide Web Conference. CERN, Geneva.Google Scholar
Cross Ref
- Miller, D. R. H., Leek, T., and Schwartz, R. M. 1998. BBN at TREC-7: Using hidden markov models for information retrieval. In Proceedings of TREC-1998. http://trec.nist.gov/pubs/.Google Scholar
- Najork, M. 2007. Comparing the effectiveness of hits and salsa. In Proceedings of the 30th International ACM-CIKM Conference on Information and Knowledge Management. 157--164. Google Scholar
Digital Library
- Najork, M., Zaragoza, H., and Taylor, M. 2007. Hits on the web: how does it compare? In Proceedings of the 30th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 471--478. Google Scholar
Digital Library
- Najork, M., Gollapudi, S., and Panigraph, R. 2009. Less is more: Sampling the neighborhood graph makes salsa better and faster. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 242--251. Google Scholar
Digital Library
- Ogilvie, P. and Callan, J. 2003. Combining document representations for known-item search. In Proceedings of the 26th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 143--150. Google Scholar
Digital Library
- Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford University.Google Scholar
- Plachouras, V., Cacheda, F., Ounis, I., and van Rijsbergen, C. J. 2003. University of glasgow at the web track: Dynamic application of hyperlink analysis using the query scope. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/uglasgow.web.pdf.Google Scholar
- Plachouras, V., He, B., and Ounis, I. 2004. University of glasgow at TREC2004: Experiments in web, robust and terabyte tracks with terrier. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/uglasgow.web.robust.tera.pdf.Google Scholar
- Qin, T., Liu, T. Y., Zhang, X. D., Chen, Z., and Ma, W. Y. 2005. A study of relevance propagation for web search. In Proceedings of the 28th International ACM-SIGIR Conference on Research and Development in Information Retrieval. 408--415. Google Scholar
Digital Library
- Qin, T., Liu, T.-Y., Zhang, X.-D., Wang, D.-S., Xiong, W.-Y., and Li, H. 2008. Learning to rank relational objects and its application to web search. In Proceedings of 17th International ACM-WWW Conference on World Wide Web. Google Scholar
Digital Library
- Robertson, S., Walker, S., Hancock-Beaulieu, M. M., and Gatford, M. 1994. Okapi at TREC-3. In Proceedings of TREC-3. http://trec.nist.gov/pubs/trec2/papers/txt/02.txt, 109--126.Google Scholar
- Robertson, S., Zaragoza, H., and Taylor, M. 2004. Simple BM25 extension to multiple weighted fields. In Proceedings of the 27th International ACM-CIKM Conference on Information and Knowledge Management. 42--49. Google Scholar
Digital Library
- Shakery, A. and Zhai, C. 2003. Relevance propagation for topic distillation UIUC TREC 2003 web track experiments. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/uillinois-uc.web.pdf.Google Scholar
- Shen, H., Chen, G., Chen, H., Liu, Y., and Cheng, X. 2007. Research on entperprise track of TREC 2007. In Proceedings of TREC-2007. http://trec.nist.gov/pubs/.Google Scholar
- Stokoe, C. and Tait, J. 2003. Towards a sense based document representation for internet information retrieval. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/usunderland.web.pdf.Google Scholar
- Tomlinson, S. 2003. Robust, web and genomic retrieval with hummingbir SearchServer<sup>TM</sup> at TREC 2003. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/hummingbird.robust.web.genomic.pdf.Google Scholar
- Tomlinson, S. 2004. Robust, web and terabyte retrieval with hummingbir SearchServer at TREC 2004. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/humingbird.robust.web.tera.pdf.Google Scholar
- Upstill, T., Craswell, N., and Hawking, D. 2003. Query-independent evidence in home page finding. ACM Trans. Inf. Syst. 21, 3, 286--313. Google Scholar
Digital Library
- Wen, J., Song, R., Cai, D., Zhu, K., Yu, S., Ye, S., and Ma, W. 2003. Microsoft research asia at the web track of TREC 2003. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/microsoft-asia.web.pdf.Google Scholar
- Westerveld, T., Kraaij, W., and Hiemstra, D. 2001. Retrieving web pages using content, links, URLs and anchors. In Proceedings of TREC-2001. http://trec.nist.gov/pubs/trec10/papers/TNO-UTwente-trec10-final.pdf.Google Scholar
- Wu, M., Scholer, F., Shokouhi, M., Pugisi, S., and Ali, H. 2007. RMIT university at the TREC 2007 enterprise track. In Proceedings of TREC-2007. http://trec.nist.gov/pubs/trec16/papers/rmit.ent.final.pdf.Google Scholar
- Yang, K. and Albertson, D. 2003. WIDIT in TREC-2003 Web tracks. In Proceedings of TREC-2003. http://trec.nist.gov/pubs/trec12/papers/indianau.web.pdf.Google Scholar
- Yang, K., Yu, N., Wead, A., Rowe, G. L., Li, Y., Friend, C., and Lee, Y. 2004. WIDIT in TREC-2004 genomics, HARD, Robust, and Web tracks. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/trec13/papers/indianau.geo.hard.robust.web.pdf.Google Scholar
- Zaragoza, H., Craswell, N., Taylor, M., Saria, S., and Robertson, S. 2004. Microsoft cambridge at TREC-13: Web and HARD tracks. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/microsoft-cambridge.web.hard.pdf.Google Scholar
- Zhou, Z., Guo, Y., Wang, B., Cheng, X., Xu, H., and Zhang, G. 2004. TREC 2004 Web track experiments at CAS-ICT. In Proceedings of TREC-2004. http://trec.nist.gov/pubs/cas.ict.web.pdf.Google Scholar
Index Terms
Topic Distillation with Query-Dependent Link Connections and Page Characteristics
Recommendations
Query-independent evidence in home page finding
Hyperlink recommendation evidence, that is, evidence based on the structure of a web's link graph, is widely exploited by commercial Web search systems. However there is little published work to support its popularity. Another form of query-independent ...
On the maximum arc-chromatic number of digraphs with bounded outdegrees or indegrees
The maximum arc-chromatic number of k-digraphs is determined for infinite many k.We give a better upper bound of the arc-chromatic number of ( k , k )-digraphs.A conjecture about the arc-chromatic number of k-digraphs holds for almost all k.Another ...
Topic-sensitive PageRank
WWW '02: Proceedings of the 11th international conference on World Wide WebIn the original PageRank algorithm for improving the ranking of search-query results, a single PageRank vector is computed, using the link structure of the Web, to capture the relative "importance" of Web pages, independent of any particular search ...






Comments