Abstract
A persisting challenge in the field of information retrieval is the vocabulary mismatch between a user’s information need and the relevant documents. One way of addressing this issue is to apply query modeling: to add terms to the original query and reweigh the terms. In social media, where documents usually contain creative and noisy language (e.g., spelling and grammatical errors), query modeling proves difficult. To address this, attempts to use external sources for query modeling have been made and seem to be successful. In this article we propose a general generative query expansion model that uses external document collections for term generation: the External Expansion Model (EEM). The main rationale behind our model is our hypothesis that each query requires its own mixture of external collections for expansion and that an expansion model should account for this. For some queries we expect, for example, a news collection to be most beneficial, while for other queries we could benefit more by selecting terms from a general encyclopedia. EEM allows for query-dependent weighing of the external collections.
We put our model to the test on the task of blog post retrieval and we use four external collections in our experiments: (i) a news collection, (ii) a Web collection, (iii) Wikipedia, and (iv) a blog post collection. Experiments show that EEM outperforms query expansion on the individual collections, as well as the Mixture of Relevance Models that was previously proposed by Diaz and Metzler [2006]. Extensive analysis of the results shows that our naive approach to estimating query-dependent collection importance works reasonably well and that, when we use “oracle” settings, we see the full potential of our model. We also find that the query-dependent collection importance has more impact on retrieval performance than the independent collection importance (i.e., a collection prior).
- Amati, G., Carpineto, C., and Romano, G. 2004. Query difficulty, robustness, and selective application of query expansion. In Proceedings of the 26th European Conference on IR Research (ECIR’04). Lecture Notes in Computer Science Series, vol. 2997. Springer, 127--137.Google Scholar
- Aquaint-2. 2007. http://trec.nist.gov/data/qa/2007_qadata/qa.07.guidelines.html#documents.Google Scholar
- Arguello, J., Elsas, J., Callan, J., and Carbonell, J. 2008. Document representation and query expansion models for blog recommendation. In Proceedings of the 2nd International Conference on Weblogs and Social Media (ICWSM’08). AAAI Press.Google Scholar
- Baeza-Yates, R. and Ribeiro-Neto, B. 2011. Modern Information Retrieval: The Concepts and Technology behind Search. Addison-Wesley Professional. Google Scholar
Digital Library
- Balog, K., Weerkamp, W., and de Rijke, M. 2008. A few examples go a long way: Constructing query models from elaborate query formulations. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, 371--378. Google Scholar
Digital Library
- Balog, K., Meij, E., Weerkamp, W., He, J., and de Rijke, M. 2009. The University of Amsterdam at TREC 2008: Blog, enterprise, and relevance feedback. In Proceedings of the 17th Text REtrieval Conference (TREC’08). NIST.Google Scholar
- Cao, G., Nie, J.-Y., Gao, J., and Robertson, S. 2008. Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, 243--250. Google Scholar
Digital Library
- Cartright, M.-A., Seo, J., and Lease, M. 2009. UMass amherst and UT austin @ the TREC 2009 relevance feedback track. In Proceedings of the 18th REtrieval Conference (TREC’09). NIST.Google Scholar
- ClueWeb09. 2009. http://www.lemurproject.org/clueweb09.php/.Google Scholar
- Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2004. A framework for selective query expansion. In Proceeding of the 13th ACM International Conference on Information and Knowledge Management (CIKM’04). ACM, New York, 236--237. Google Scholar
Digital Library
- Diaz, F. and Metzler, D. 2006. Improving the estimation of relevance models using large external corpora. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, 154--161. Google Scholar
Digital Library
- Dong, A., Chang, Y., Zheng, Z., Mishne, G., Bai, J., Zhang, R., Buchner, K., Liao, C., and Diaz, F. 2010. Towards recency ranking in web search. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). ACM, New York, 11--20. Google Scholar
Digital Library
- Elsas, J., Arguello, J., Callan, J., and Carbonell, J. 2008a. Retrieval and feedback models for blog distillation. In Proceedings of the 16th Text REtrieval Conference (TREC’07). NIST.Google Scholar
- Elsas, J. L., Arguello, J., Callan, J., and Carbonell, J. G. 2008b. Retrieval and feedback models for blog feed search. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). ACM, New York, 347--354. Google Scholar
Digital Library
- Ernsting, B. J., Weerkamp, W., and de Rijke, M. 2008. The university of amsterdam at the TREC 2007 blog track. In Proceedings of the 16th Text REtrieval Conference (TREC’07). NIST.Google Scholar
- Fautsch, C. and Savoy, J. 2009. UniNE at TREC 2008: Fact and opinion retrieval in the blogsphere. In Proceedings of the 17th Text REtrieval Conference (TREC’08). NIST.Google Scholar
- Hawking, D., Bailey, P., and Craswell, N. 1999. ACSys TREC-8 experiments. In Proceedings of the 8th Text REtrieval Conference (TREC’99). NIST.Google Scholar
- He, B. and Ounis, I. 2007. Combining fields for query expansion and adaptive query expansion. Inf. Process. Manag. 43, 5, 1294--1307. Google Scholar
Digital Library
- He, B. and Ounis, I. 2009. Finding good feedback documents. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, 2011--2014. Google Scholar
Digital Library
- Hiemstra, D. 2001. Using language models for information retrieval. Ph.D. thesis, University of Twente.Google Scholar
- Hofmann, K. and Weerkamp, W. 2008. Content extraction for information retrieval in blogs and intranets. Tech. rep., University of Amsterdam, ISLA.Google Scholar
- Java, A., Kolari, P., Finin, T., Joshi, A., and Martineau, J. 2007. The blogvox opinion retrieval system. In Proceedings of the 15th Text REtrieval Conference (TREC’06). NIST.Google Scholar
- Jijkoun, V., de Rijke, M., and Weerkamp, W. 2010. Generating focused topic-specific sentiment lexicons. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 585--594. Google Scholar
Digital Library
- Kamps, J., de Rijke, M., and Sigurbjörnsson, B. 2004. Length normalization in XML retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York, 80--87. Google Scholar
Digital Library
- Kurland, O., Lee, L., and Domshlak, C. 2005. Better than the real thing?: Iterative pseudo-query processing using cluster-based language models. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05). ACM, New York, 19--26. Google Scholar
Digital Library
- Kwak, H., Lee, C., Park, H., and Moon, S. 2010. What is twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web (WWW’10). ACM, New York, 591--600. Google Scholar
Digital Library
- Kwok, K. L., Grunfeld, L., Dinstl, N., and Chan, M. 2001. TREC-9 cross language, web and question-answering track experiments using PIRCS. In Proceedings of the 9th Text REtrieval Conference (TREC-9). NIST.Google Scholar
- Lafferty, J. and Zhai, C. 2003. Probabilistic relevance models based on document and query generation. In Language Modeling for Information Retrieval, Kluwer International Series on Information Retrieval, Springer.Google Scholar
- Lancaster, F. 1968. Information Retrieval Systems: Characteristics, Testing and Evaluation. John Wiley & Sons, New York.Google Scholar
- Lavrenko, V. and Croft, W. B. 2001. Relevance based language models. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, New York, 120--127. Google Scholar
Digital Library
- Leskovec, J., Backstrom, L., and Kleinberg, J. 2009. Meme-Tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, 497--506. Google Scholar
Digital Library
- Lv, Y. and Zhai, C. 2009. Adaptive relevance feedback in information retrieval. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). ACM, New York, 255--264. Google Scholar
Digital Library
- Macdonald, C. and Ounis, I. 2006. The TREC blogs06 collection: Creating and analyzing a blog test collection. Tech. rep. TR-2006-224, Department of Computer Science, University of Glasgow.Google Scholar
- Macdonald, C., Ounis, I., and Soboroff, I. 2008. Overview of the TREC 2007 blog track. In Proceedings of the 16th Text REtrieval Conference (TREC’07). NIST.Google Scholar
- Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google Scholar
Digital Library
- Meij, E., Weerkamp, W., and de Rijke, M. 2009. A query model based on normalized log-likelihood. In Proceeding of the 18th ACM Conference on Information and Knowledge Managemnt (CIKM’09). ACM, New York, 1903--1906. Google Scholar
Digital Library
- Metzler, D. and Croft, W. B. 2005. A markov random field model for term dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05). ACM, New York, 472--479. Google Scholar
Digital Library
- Miller, D., Leek, T., and Schwartz, R. 1999. A hidden markov model information retrieval system. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99). ACM, New York, 214--221. Google Scholar
Digital Library
- Mishne, G. and de Rijke, M. 2006. A study of blog search. In Proceedings of the 28th European Conference on IR Research (ECIR’06). Lecture Notes in Computer Science, vol. 3936, Springer, 289--301. Google Scholar
Digital Library
- Ounis, I., de Rijke, M., Macdonald, C., Mishne, G., and Soboroff, I. 2007. Overview of the TREC-2006 blog track. In Proceedings of the 15th Text REtrieval Conference (TREC’06). NIST.Google Scholar
- Ounis, I., Macdonald, C., and Soboroff, I. 2009. Overview of the TREC-2008 blog track. In Proceedings of the 17th Text REtrieval Conference (TREC’08). NIST.Google Scholar
- Ponte, J. M. and Croft, W. B. 1998. A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98). ACM, New York, 275--281. Google Scholar
Digital Library
- Qiu, Y. and Frei, H.-P. 1993. Concept based query expansion. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’93). ACM, New York, 160--169. Google Scholar
Digital Library
- Rocchio, J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice Hall.Google Scholar
- Sakai, T. 2002. The use of external text data in cross-language information retrieval based on machine translation. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC’02). IEEE, 6--9.Google Scholar
Cross Ref
- Sheldon, D., Shokouhi, M., Szummer, M., and Craswell, N. 2011. Lambdamerge: Merging the results of query reformulations. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, 795--804. Google Scholar
Digital Library
- Tao, T. and Zhai, C. 2006. Regularized estimation of mixture models for robust pseudo-relevance feedback. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, 162--169. Google Scholar
Digital Library
- Weerkamp, W. 2011. Finding people and their utterances in social media. Ph.D. thesis, University of Amsterdam.Google Scholar
- Weerkamp, W. and de Rijke, M. 2008. Credibility improves topical blog post retrieval. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 923--931.Google Scholar
- Weerkamp, W. and de Rijke, M. 2012. Credibility-Inspired ranking for blog post retrieval. Inf. Retriev. J. 15, 3--4, 243--277. Google Scholar
Digital Library
- Westerveld, T., Kraaij, W., and Hiemstra, D. 2002. Retrieving web pages using content, links, urls and anchors. In Proceedings of the 10th REtrieval Conference (TREC’01). NIST.Google Scholar
- Xu, Y., Jones, G. J., and Wang, B. 2009. Query dependent pseudo-relevance feedback based on wikipedia. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, New York, 59--66. Google Scholar
Digital Library
- Yan, R. and Hauptmann, A. 2007. Query expansion using probabilistic local feedback with application to multimedia retrieval. In Proceeding of the 16th ACM International Conference on Information and Knowledge Management (CIKM’07). ACM, New York, 361--370. Google Scholar
Digital Library
- Yin, Z., Shokouhi, M., and Craswell, N. 2009. Query expansion using external evidence. In Proceedings of the 31st European Conference on IR Research (ECIR’09). Lecture Notes in Computer Science, vol. 5478. Springer, 362--374. Google Scholar
Digital Library
- Zhang, W. and Yu, C. 2007. UIC at TREC 2006 blog track. In Proceeding of the 15th Text REtrieval Conference (TREC’06). NIST.Google Scholar
Index Terms
Exploiting External Collections for Query Expansion
Recommendations
Combining fields for query expansion and adaptive query expansion
In this paper, we aim to improve query expansion for ad-hoc retrieval, by proposing a more fine-grained term reweighting process. This fine-grained process uses statistics from the representation of documents in various fields, such as their titles, the ...
Document Expansion Using External Collections
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalDocument expansion has been shown to improve the effectiveness of information retrieval systems by augmenting documents' term probability estimates with those of similar documents, producing higher quality document representations. We propose a method ...
Cluster-based query expansion using external collections in medical information retrieval
Display Omitted We propose a query expansion method which utilizes multiple external collections.To estimate each relevance model, we use the structure of the external collections.Our method extends queries effectively by considering related context ...






Comments