skip to main content
research-article

Exploiting External Collections for Query Expansion

Published:01 November 2012Publication History
Skip Abstract Section

Abstract

A persisting challenge in the field of information retrieval is the vocabulary mismatch between a user’s information need and the relevant documents. One way of addressing this issue is to apply query modeling: to add terms to the original query and reweigh the terms. In social media, where documents usually contain creative and noisy language (e.g., spelling and grammatical errors), query modeling proves difficult. To address this, attempts to use external sources for query modeling have been made and seem to be successful. In this article we propose a general generative query expansion model that uses external document collections for term generation: the External Expansion Model (EEM). The main rationale behind our model is our hypothesis that each query requires its own mixture of external collections for expansion and that an expansion model should account for this. For some queries we expect, for example, a news collection to be most beneficial, while for other queries we could benefit more by selecting terms from a general encyclopedia. EEM allows for query-dependent weighing of the external collections.

We put our model to the test on the task of blog post retrieval and we use four external collections in our experiments: (i) a news collection, (ii) a Web collection, (iii) Wikipedia, and (iv) a blog post collection. Experiments show that EEM outperforms query expansion on the individual collections, as well as the Mixture of Relevance Models that was previously proposed by Diaz and Metzler [2006]. Extensive analysis of the results shows that our naive approach to estimating query-dependent collection importance works reasonably well and that, when we use “oracle” settings, we see the full potential of our model. We also find that the query-dependent collection importance has more impact on retrieval performance than the independent collection importance (i.e., a collection prior).

References

  1. Amati, G., Carpineto, C., and Romano, G. 2004. Query difficulty, robustness, and selective application of query expansion. In Proceedings of the 26th European Conference on IR Research (ECIR’04). Lecture Notes in Computer Science Series, vol. 2997. Springer, 127--137.Google ScholarGoogle Scholar
  2. Aquaint-2. 2007. http://trec.nist.gov/data/qa/2007_qadata/qa.07.guidelines.html#documents.Google ScholarGoogle Scholar
  3. Arguello, J., Elsas, J., Callan, J., and Carbonell, J. 2008. Document representation and query expansion models for blog recommendation. In Proceedings of the 2nd International Conference on Weblogs and Social Media (ICWSM’08). AAAI Press.Google ScholarGoogle Scholar
  4. Baeza-Yates, R. and Ribeiro-Neto, B. 2011. Modern Information Retrieval: The Concepts and Technology behind Search. Addison-Wesley Professional. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Balog, K., Weerkamp, W., and de Rijke, M. 2008. A few examples go a long way: Constructing query models from elaborate query formulations. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, 371--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Balog, K., Meij, E., Weerkamp, W., He, J., and de Rijke, M. 2009. The University of Amsterdam at TREC 2008: Blog, enterprise, and relevance feedback. In Proceedings of the 17th Text REtrieval Conference (TREC’08). NIST.Google ScholarGoogle Scholar
  7. Cao, G., Nie, J.-Y., Gao, J., and Robertson, S. 2008. Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, 243--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cartright, M.-A., Seo, J., and Lease, M. 2009. UMass amherst and UT austin @ the TREC 2009 relevance feedback track. In Proceedings of the 18th REtrieval Conference (TREC’09). NIST.Google ScholarGoogle Scholar
  9. ClueWeb09. 2009. http://www.lemurproject.org/clueweb09.php/.Google ScholarGoogle Scholar
  10. Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2004. A framework for selective query expansion. In Proceeding of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04). ACM, New York, 236--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Diaz, F. and Metzler, D. 2006. Improving the estimation of relevance models using large external corpora. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, 154--161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dong, A., Chang, Y., Zheng, Z., Mishne, G., Bai, J., Zhang, R., Buchner, K., Liao, C., and Diaz, F. 2010. Towards recency ranking in web search. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). ACM, New York, 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Elsas, J., Arguello, J., Callan, J., and Carbonell, J. 2008a. Retrieval and feedback models for blog distillation. In Proceedings of the 16th Text REtrieval Conference (TREC’07). NIST.Google ScholarGoogle Scholar
  14. Elsas, J. L., Arguello, J., Callan, J., and Carbonell, J. G. 2008b. Retrieval and feedback models for blog feed search. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, 347--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ernsting, B. J., Weerkamp, W., and de Rijke, M. 2008. The university of amsterdam at the TREC 2007 blog track. In Proceedings of the 16th Text REtrieval Conference (TREC’07). NIST.Google ScholarGoogle Scholar
  16. Fautsch, C. and Savoy, J. 2009. UniNE at TREC 2008: Fact and opinion retrieval in the blogsphere. In Proceedings of the 17th Text REtrieval Conference (TREC’08). NIST.Google ScholarGoogle Scholar
  17. Hawking, D., Bailey, P., and Craswell, N. 1999. ACSys TREC-8 experiments. In Proceedings of the 8th Text REtrieval Conference (TREC’99). NIST.Google ScholarGoogle Scholar
  18. He, B. and Ounis, I. 2007. Combining fields for query expansion and adaptive query expansion. Inf. Process. Manag. 43, 5, 1294--1307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. He, B. and Ounis, I. 2009. Finding good feedback documents. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, 2011--2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hiemstra, D. 2001. Using language models for information retrieval. Ph.D. thesis, University of Twente.Google ScholarGoogle Scholar
  21. Hofmann, K. and Weerkamp, W. 2008. Content extraction for information retrieval in blogs and intranets. Tech. rep., University of Amsterdam, ISLA.Google ScholarGoogle Scholar
  22. Java, A., Kolari, P., Finin, T., Joshi, A., and Martineau, J. 2007. The blogvox opinion retrieval system. In Proceedings of the 15th Text REtrieval Conference (TREC’06). NIST.Google ScholarGoogle Scholar
  23. Jijkoun, V., de Rijke, M., and Weerkamp, W. 2010. Generating focused topic-specific sentiment lexicons. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 585--594. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kamps, J., de Rijke, M., and Sigurbjörnsson, B. 2004. Length normalization in XML retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York, 80--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kurland, O., Lee, L., and Domshlak, C. 2005. Better than the real thing?: Iterative pseudo-query processing using cluster-based language models. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05). ACM, New York, 19--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kwak, H., Lee, C., Park, H., and Moon, S. 2010. What is twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web (WWW’10). ACM, New York, 591--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kwok, K. L., Grunfeld, L., Dinstl, N., and Chan, M. 2001. TREC-9 cross language, web and question-answering track experiments using PIRCS. In Proceedings of the 9th Text REtrieval Conference (TREC-9). NIST.Google ScholarGoogle Scholar
  28. Lafferty, J. and Zhai, C. 2003. Probabilistic relevance models based on document and query generation. In Language Modeling for Information Retrieval, Kluwer International Series on Information Retrieval, Springer.Google ScholarGoogle Scholar
  29. Lancaster, F. 1968. Information Retrieval Systems: Characteristics, Testing and Evaluation. John Wiley & Sons, New York.Google ScholarGoogle Scholar
  30. Lavrenko, V. and Croft, W. B. 2001. Relevance based language models. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, New York, 120--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Leskovec, J., Backstrom, L., and Kleinberg, J. 2009. Meme-Tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, 497--506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lv, Y. and Zhai, C. 2009. Adaptive relevance feedback in information retrieval. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, 255--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Macdonald, C. and Ounis, I. 2006. The TREC blogs06 collection: Creating and analyzing a blog test collection. Tech. rep. TR-2006-224, Department of Computer Science, University of Glasgow.Google ScholarGoogle Scholar
  34. Macdonald, C., Ounis, I., and Soboroff, I. 2008. Overview of the TREC 2007 blog track. In Proceedings of the 16th Text REtrieval Conference (TREC’07). NIST.Google ScholarGoogle Scholar
  35. Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Meij, E., Weerkamp, W., and de Rijke, M. 2009. A query model based on normalized log-likelihood. In Proceeding of the 18th ACM Conference on Information and Knowledge Managemnt (CIKM’09). ACM, New York, 1903--1906. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Metzler, D. and Croft, W. B. 2005. A markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05). ACM, New York, 472--479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Miller, D., Leek, T., and Schwartz, R. 1999. A hidden markov model information retrieval system. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). ACM, New York, 214--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Mishne, G. and de Rijke, M. 2006. A study of blog search. In Proceedings of the 28th European Conference on IR Research (ECIR’06). Lecture Notes in Computer Science, vol. 3936, Springer, 289--301. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Ounis, I., de Rijke, M., Macdonald, C., Mishne, G., and Soboroff, I. 2007. Overview of the TREC-2006 blog track. In Proceedings of the 15th Text REtrieval Conference (TREC’06). NIST.Google ScholarGoogle Scholar
  41. Ounis, I., Macdonald, C., and Soboroff, I. 2009. Overview of the TREC-2008 blog track. In Proceedings of the 17th Text REtrieval Conference (TREC’08). NIST.Google ScholarGoogle Scholar
  42. Ponte, J. M. and Croft, W. B. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98). ACM, New York, 275--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Qiu, Y. and Frei, H.-P. 1993. Concept based query expansion. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93). ACM, New York, 160--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Rocchio, J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice Hall.Google ScholarGoogle Scholar
  45. Sakai, T. 2002. The use of external text data in cross-language information retrieval based on machine translation. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC’02). IEEE, 6--9.Google ScholarGoogle ScholarCross RefCross Ref
  46. Sheldon, D., Shokouhi, M., Szummer, M., and Craswell, N. 2011. Lambdamerge: Merging the results of query reformulations. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, 795--804. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Tao, T. and Zhai, C. 2006. Regularized estimation of mixture models for robust pseudo-relevance feedback. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, 162--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Weerkamp, W. 2011. Finding people and their utterances in social media. Ph.D. thesis, University of Amsterdam.Google ScholarGoogle Scholar
  49. Weerkamp, W. and de Rijke, M. 2008. Credibility improves topical blog post retrieval. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 923--931.Google ScholarGoogle Scholar
  50. Weerkamp, W. and de Rijke, M. 2012. Credibility-Inspired ranking for blog post retrieval. Inf. Retriev. J. 15, 3--4, 243--277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Westerveld, T., Kraaij, W., and Hiemstra, D. 2002. Retrieving web pages using content, links, urls and anchors. In Proceedings of the 10th REtrieval Conference (TREC’01). NIST.Google ScholarGoogle Scholar
  52. Xu, Y., Jones, G. J., and Wang, B. 2009. Query dependent pseudo-relevance feedback based on wikipedia. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, New York, 59--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Yan, R. and Hauptmann, A. 2007. Query expansion using probabilistic local feedback with application to multimedia retrieval. In Proceeding of the 16th ACM International Conference on Information and Knowledge Management (CIKM’07). ACM, New York, 361--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Yin, Z., Shokouhi, M., and Craswell, N. 2009. Query expansion using external evidence. In Proceedings of the 31st European Conference on IR Research (ECIR’09). Lecture Notes in Computer Science, vol. 5478. Springer, 362--374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Zhang, W. and Yu, C. 2007. UIC at TREC 2006 blog track. In Proceeding of the 15th Text REtrieval Conference (TREC’06). NIST.Google ScholarGoogle Scholar

Index Terms

  1. Exploiting External Collections for Query Expansion

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 6, Issue 4
      November 2012
      138 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/2382616
      Issue’s Table of Contents

      Copyright © 2012 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 November 2012
      • Accepted: 1 July 2012
      • Revised: 1 March 2012
      • Received: 1 June 2011
      Published in tweb Volume 6, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!