skip to main content
research-article

Mining Historic Query Trails to Label Long and Rare Search Engine Queries

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

Web search engines can perform poorly for long queries (i.e., those containing four or more terms), in part because of their high level of query specificity. The automatic assignment of labels to long queries can capture aspects of a user’s search intent that may not be apparent from the terms in the query. This affords search result matching or reranking based on queries and labels rather than the query text alone. Query labels can be derived from interaction logs generated from many users’ search result clicks or from query trails comprising the chain of URLs visited following query submission. However, since long queries are typically rare, they are difficult to label in this way because little or no historic log data exists for them. A subset of these queries may be amenable to labeling by detecting similarities between parts of a long and rare query and the queries which appear in logs. In this article, we present the comparison of four similarity algorithms for the automatic assignment of Open Directory Project category labels to long and rare queries, based solely on matching against similar satisfied query trails extracted from log data. Our findings show that although the similarity-matching algorithms we investigated have tradeoffs in terms of coverage and accuracy, one algorithm that bases similarity on a popular search result ranking function (effectively regarding potentially-similar queries as “documents”) outperforms the others. We find that it is possible to correctly predict the top label better than one in five times, even when no past query trail exactly matches the long and rare query. We show that these labels can be used to reorder top-ranked search results leading to a significant improvement in retrieval performance over baselines that do not utilize query labeling, but instead rank results using content-matching or click-through logs. The outcomes of our research have implications for search providers attempting to provide users with highly-relevant search results for long queries.

References

  1. }}Agichtein, E., Brill, E., and Dumais, S. 2006. Improving Web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 19--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. }}Allan, J., Callan, J., Croft, W. B., Ballesteros, L., Broglio, J., Xu, J., and Shu, H. 1997. Inquery at TREC-5. In Proceedings of the 5th Text Retrieval Conference (TREC). NIST, 119--132.Google ScholarGoogle Scholar
  3. }}Allan, J. and Raghavan, H. 2002. Using part-of-speech patterns to reduce query ambiguity. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 307--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. }}Beitzel, S. M., Jensen, E. C., Frieder, O., Lewis, D. D., Chowdhury, A., and Kolcz, A. 2005. Improving automatic query classification via semi-supervised learning. In Proceedings of the International Conference on Data Mining. 42--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. }}Beitzel, S. M., Jensen, E. C., Lewis, D. D., Chowdhury, A., and Frieder, O. 2007. Automatic classification of Web queries using very large unlabeled query logs. ACM Trans. Inform. Syst. 25, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. }}Bendersky, M. and Croft, W. B. 2008. Discovering key concepts in verbose queries. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 491--498. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. }}Bennett, G., Scholer, F., and Uitdenbogerd, A. 2008. A comparative study of probabalistic and language models for information retrieval. In Proceedings of the 19th Annual Australasian Database Conference. 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. }}Bennett, P. N., Svore, K., and Dumais, S. 2010. Classification-enhanced ranking. In Proceedings of the 19th International Conference on World Wide Web (WWW’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. }}Bilenko, M. and White, R. W. 2008. Mining the search trails of surfing crowds: Identifying relevant Web sites from user activity. In Proceedings of the 17th Annual Conference on the World Wide Web. 51--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. }}Bollegala, D., Matsuo, Y., and Ishizuka, M. 2007. Measuring semantic similarity between words using Web search engines. In Proceedings of the 16th International Conference on the World Wide Web. ACM, New York, 757--766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. }}Broder, A. Z., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., and Zhang, T. 2007. Robust classification of rare queries using Web knowledge. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 231--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. }}Callan, J. P., Croft, W. B., and Broglio, J. 1995. TREC and tipster experiments with inquery. Inform. Process. Manage. 31, 3, 327--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. }}Chirita, P. A., Nejdl, W., Paiu, R., and Kohlschütter, C. 2005. Using ODP metadata to personalize search. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05). 178--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. }}Chowdhury, A. and Soboroff, I. 2002. Automatic evaluation of world wide Web search services. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 421--422. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. }}Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd Ed. Lawrence Earlbaum.Google ScholarGoogle Scholar
  16. }}Cucerzan, S. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proceedings of EMNLP-CoNLL. 708--716.Google ScholarGoogle Scholar
  17. }}Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. 2001. Rank aggregation methods for the Web. In Proceedings of the 10th International Conference on World Wide Web (WWW’01). 613--622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. }}Gravano, L., Hatzivassiloglou, V., and Lichtenstein, R. 2003. Categorizing Web queries according to geographical locality. In Proceedings of the 12th ACM CIKM Conference on Information and Knowledge Management. 325--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. }}Järvelin, K. and Kekäläinen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inform. Syst. 20, 4, 422--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. }}Kardkovács, Z. T., Tikk, D., and Bánsághi, Z. 2005. The Ferrety algorithm for the KDD Cup 2005 problem. SIGKDD Explor. 7, 2, 111--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. }}Kumaran, G. and Allan, J. 2007. A case for shorter queries, and helping users create them. In Proceedings of the HLT-NAACL. 220--227.Google ScholarGoogle Scholar
  22. }}Kumaran, G. and Allan, J. 2008. Effective and efficient user interaction for long queries. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 11--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. }}Kumaran, G. and Carvalho, V. R. 2009. Reducing long queries using query quality predictors. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, In press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. }}Lease, M., Allan, J., and Croft, W. B. 2009. Regression rank: Learning to meet the opportunity of descriptive queries. In Proceedings of the 31st European Conference on Information Retrieval. Springer-Verlag, 90--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. }}Li, X., Wang, Y.-Y., and Acero, A. 2008. Learning query intent from regularized click graphs. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 339--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. }}Li, Y., Zheng, Z., and Dai, H. K. 2005. KDD CUP-2005 report: Facing a great challenge. SIGKDD Explor. 7, 2, 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. }}Metzler, D., Dumais, S., and Meek, C. 2007. Similarity measures for short segments of text. In Proceedings of the 29th European Conference on Information Retrieval. 16--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. }}Najork, M. A., Zaragoza, H., and Taylor, M. J. 2007. HITS on the Web: How does it compare? In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 471--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. }}Phan, N., Bailey, P., and Wilkinson, R. 2007. Understanding the relationship of information need specificity to search query length. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 709--710. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. }}Ponte, J. M. and Croft, W. B. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 275--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. }}Qiu, F. and Cho, J. 2006. Automatic identification of user interest for personalized search. In Proceedings of the 15th International Conference on World Wide Web (WWW’06). 727--736. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. }}Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., and Gatford, M. 1994. Okapi at TREC-3. In Proceedings of the 3rd Text REtrieval Conference (TREC’94).Google ScholarGoogle Scholar
  33. }}Robertson, S., Zaragoza, H., and Taylor, M. 2004. Simple BM25 extension to multiple weighted fields. In Proceedings of the 13th ACM CIKM Conference on Information and Knowledge Management. ACM, New York, 42--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. }}Shen, D., Pan, R., Sun, J.-T., Pan, J. J., Wu, K., Yin, J., and Yang, Q. 2005. Q2[email protected]: Our winning solution to query classification in KDDCUP 2005. SIGKDD Explor. 7, 2, 100--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. }}Shen, X., Dumais, S., and Horvitz, E. 2005. Analysis of topic dynamics in Web search. In Proceedings of the 14th International Conference on the World Wide Web. 1102--1103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. }}Strohman, T., Metzler, D., Turtle, H., and Croft, W. B. 2005. Indri: A language-model based search engine for complex queries (extended version). IR 407, U. Massachusetts.Google ScholarGoogle Scholar
  37. }}Vogel, D. S., Bickel, S., Haider, P., Schimpfky, R., Siemen, P., Bridges, S., and Scheffer, T. 2005. Classifying search engine queries using the Web as background knowledge. SIGKDD Explor. 7, 2, 117--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. }}Voorhees, E. M. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 61--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. }}White, R. W., Bailey, P., and Chen, L. 2009. Predicting user interests from contextual information. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. In press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. }}White, R. W. and Drucker, S. 2007. Investigating behavioral variability in Web search. In Proceedings of the 16th International Conference on the World Wide Web. ACM, New York, 21--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. }}Zhai, C. and Lafferty, J. 2001. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 334--342. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining Historic Query Trails to Label Long and Rare Search Engine Queries

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 4, Issue 4
      September 2010
      173 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/1841909
      Issue’s Table of Contents

      Copyright © 2010 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 September 2010
      • Accepted: 1 May 2010
      • Revised: 1 April 2010
      • Received: 1 June 2009
      Published in tweb Volume 4, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!