Abstract
It has been shown that top-k retrieval quality can be considerably improved by taking not only relevance but also diversity into account. However, currently proposed diversification approaches have not put much attention on practical usability in large-scale settings, such as modern web search systems. In this work, we make two contributions toward this goal. First, we propose a combination of optimizations and heuristics for an implicit diversification algorithm based on the desirable facility placement principle, and present two algorithms that achieve linear complexity without compromising the retrieval effectiveness. Instead of an exhaustive comparison of documents, these algorithms first perform a clustering phase and then exploit its outcome to compose the diverse result set. Second, we describe and analyze two variants for distributed diversification in a computing cluster, for large-scale IR where the document collection is too large to keep in one node. Our contribution in this direction is pioneering, as there exists no earlier work in the literature that investigates the effectiveness and efficiency of diversification on a distributed setup. Extensive evaluations on a standard TREC framework demonstrate a competitive retrieval quality of the proposed optimizations to the baseline algorithm while reducing the processing time by more than 80% and up to 97%, and shed light on the efficiency and effectiveness tradeoffs of diversification when applied on top of a distributed architecture.
- Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In Proceedings of the 2nd International Conference on Web Search and Web Data Mining. 5--14. Google Scholar
Digital Library
- Berkant Barla Cambazoglu and Ricardo Baeza-Yates. 2011. Scalability challenges in web search engines. In Advanced Topics in Information Retrieval, Massimo Melucci and Ricardo Baeza-Yates (Eds.). The Information Retrieval Series, Vol. 33. 27--50.Google Scholar
- Berkant Barla Cambazoglu, Emre Varol, Enver Kayaaslan, Cevdet Aykanat, and Ricardo A. Baeza-Yates. 2010. Query forwarding in geographically distributed search engines. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 90--97. Google Scholar
Digital Library
- Gabriele Capannini, Franco Maria Nardini, Raffaele Perego, and Fabrizio Silvestri. 2011. Efficient diversification of web search results. PVLDB 4, 7 (2011), 451--459. Google Scholar
Digital Library
- Jaime G. Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval. 335--336. Google Scholar
Digital Library
- Claudio Carpineto, Massimiliano D’Amico, and Giovanni Romano. 2012. Evaluating subtopic retrieval methods: Clustering versus diversification of search results. Inf. Process. Manage. 48, 2 (2012), 358--373. Google Scholar
Digital Library
- Ben Carterette. 2009. An analysis of NP-completeness in novelty and diversity ranking. In Proceedings of the 2nd International Conference on the Theory of Information Retrieval. 200--211. Google Scholar
Digital Library
- Ben Carterette and Praveen Chandar. 2009. Probabilistic models of ranking novel documents for faceted topic retrieval. In Proceedings of the 18th ACM Conf. on Information and Knowledge Management. 1287--1296. Google Scholar
Digital Library
- Olivier Chapelle, Shihao Ji, Ciya Liao, Emre Velipasaoglu, Larry Lai, and Su-Lin Wu. 2011. Intent-based diversification of web search Results: Metrics and algorithms. Inf. Retr. 14, 6 (2011), 572--592. Google Scholar
Digital Library
- Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 621--630. Google Scholar
Digital Library
- Edgar Chávez and Gonzalo Navarro. 2005. A compact space decomposition for effective metric indexing. Pattern Recog. Lett. 26, 9 (2005), 1363--1376. DOI:http://dx.doi.org/10.1016/j.patrec.2004.11.014 Google Scholar
Digital Library
- Harr Chen and David R. Karger. 2006. Less is more: Probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 429--436. Google Scholar
Digital Library
- Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the TREC 2009 web track. In Proceedings of the 18th Text Retrieval Conference.Google Scholar
- Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Gordon V. Cormack. 2010. Overview of the TREC 2010 web track. In Proceedings of the 19th Text Retrieval Conference.Google Scholar
- Charles L. A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. 659--666. Google Scholar
Digital Library
- Van Dang and W. Bruce Croft. 2012. Diversity by proportionality: An election-based approach to search result diversification. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 65--74. Google Scholar
Digital Library
- Van Dang and W. Bruce Croft. 2013. Term level search result diversification. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 603--612. Google Scholar
Digital Library
- Jeffrey Dean. 2009. Challenges in building large-scale information retrieval systems: Invited talk. In Proceedings of the 2nd International Conference on Web Search and Web Data Mining. 1. Google Scholar
Digital Library
- Marina Drosou and Evaggelia Pitoura. 2009. Diversity over continuous data. IEEE Data Eng. Bull. 32, 4 (2009), 49--56.Google Scholar
- Moran Feldman, Ronny Lempel, Oren Somekh, and Kolman Vornovitsky. 2011. On the impact of random index-partitioning on index compression. CoRR abs/1107.5661 (2011). http://arxiv.org/abs/1107.5661.Google Scholar
- Veronica Gil-Costa, Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2011. Sparse spatial selection for novelty-based search result diversification. In Proceedings of the 18th International Symposium on String Processing and Information Retrieval. 344--355. Google Scholar
Digital Library
- Veronica Gil-Costa, Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2013. Modelling efficient novelty-based search result diversification in metric spaces. J. Discrete Algorithms 18 (2013), 75--88. Google Scholar
Digital Library
- Sreenivas Gollapudi and Aneesh Sharma. 2009. An axiomatic approach for result diversification. In Proceedings of the 18th International Conference on the World Wide Web. 381--390. Google Scholar
Digital Library
- Jiyin He, Edgar Meij, and Maarten de Rijke. 2011. Result diversification based on query-specific cluster ranking. JASIST 62, 3 (2011), 550--571. Google Scholar
Digital Library
- Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422--446. Google Scholar
Digital Library
- Maryam Kamvar and Shumeet Baluja. 2006. A large scale study of wireless search behavior: Google mobile search. In Proceedings of the 2006 Conference on Human Factors in Computing Systems (CHI’06). 701--709. Google Scholar
Digital Library
- Anagha Kulkarni and Jamie Callan. 2010. Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM Conference on Information and Knowledge Management. 449--458. Google Scholar
Digital Library
- Shangsong Liang, Zhaochun Ren, and Maarten de Rijke. 2014. Fusion helps diversification. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. 303--312. Google Scholar
Digital Library
- Harry Markowitz. 1952. Portfolio selection. J. Finance 7, 1 (1952), 77--91.Google Scholar
- Enrico Minack, Wolf Siberski, and Wolfgang Nejdl. 2011. Incremental diversification for very large sets: A streaming-based approach. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 585--594. Google Scholar
Digital Library
- Rifat Ozcan, Ismail Sengör Altingövde, Berkant Barla Cambazoglu, Flavio Paiva Junqueira, and Özgür Ulusoy. 2012. A five-level static cache architecture for web search engines. Inf. Process. Manage. 48, 5 (2012), 828--840. Google Scholar
Digital Library
- Ahmet Murat Ozdemiray and Ismail Sengor Altingovde. 2014. Query performance prediction for aspect weighting in search result diversification. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. 1871--1874. Google Scholar
Digital Library
- Ahmet Murat Ozdemiray and Ismail Sengor Altingovde. 2015. Explicit search result diversification using score and rank aggregation methods. JASIST 66, 6 (2015), 1212--1228. DOI:http://dx.doi.org/ 10.1002/asi.23259Google Scholar
- Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In Proceedings of the 1st International Conference on Scalable Information Systems. Google Scholar
Digital Library
- Diego Puppin, Fabrizio Silvestri, and Domenico Laforenza. 2006. Query-driven document partitioning and collection selection. In Proceedings of the 1st International Conference on Scalable Information Systems. 34. Google Scholar
Digital Library
- Filip Radlinski and Susan T. Dumais. 2006. Improving personalized web search using result diversification. In Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. 691--692. Google Scholar
Digital Library
- Stephen E. Robertson. 1977. The probability ranking principle in IR. J. Document. 33 (1977), 294--304.Google Scholar
Cross Ref
- Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2010a. Exploiting query reformulations for web search result diversification. In Proceedings of the 19th International Conference on World Wide Web. 881--890. Google Scholar
Digital Library
- Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2010b. Selectively diversifying web search results. In Proceedings of the 19th ACM Conference on Information and Knowledge Management. 1179--1188. Google Scholar
Digital Library
- Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2011. Intent-aware search result diversification. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 595--604. Google Scholar
Digital Library
- David Vallet and Pablo Castells. 2012. Personalized diversification of search results. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 841--850. Google Scholar
Digital Library
- Saul Vargas, Pablo Castells, and David Vallet. 2012. Explicit relevance models in intent-oriented information retrieval diversification. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 75--84. Google Scholar
Digital Library
- Jun Wang and Jianhan Zhu. 2009. Portfolio theory of information retrieval. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 115--122. Google Scholar
Digital Library
- Shengli Wu and Chunlan Huang. 2014. Search result diversification via data fusion. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. 827--830. Google Scholar
Digital Library
- ChengXiang Zhai, William W. Cohen, and John D. Lafferty. 2003. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 10--17. Google Scholar
Digital Library
- ChengXiang Zhai and John D. Lafferty. 2006. A risk minimization framework for information retrieval. Inf. Process. Manage. 42, 1 (2006), 31--55. Google Scholar
Digital Library
- Guido Zuccon and Leif Azzopardi. 2010. Using the quantum probability ranking principle to rank interdependent documents. In Proceedings of the 32nd European Conf. on IR Research. 357--369. Google Scholar
Digital Library
- Guido Zuccon, Leif Azzopardi, Dell Zhang, and Jun Wang. 2012. Top-k retrieval using facility location analysis. In Proceedings of the 34th European Conference on IR Research. 305--316. Google Scholar
Digital Library
Index Terms
Scalable and Efficient Web Search Result Diversification
Recommendations
Search Result Diversification
Ranking in information retrieval has been traditionally approachedas a pursuit of relevant information, under the assumption that theusers' information needs are unambiguously conveyed by their submittedqueries. Nevertheless, as an inherently limited ...
Explicit web search result diversification
Queries submitted to a web search engine are typically short and often ambiguous. With the enormous size of the Web, a misunderstanding of the information need underlying an ambiguous query can misguide the search engine, ultimately leading the user to ...
Intent-aware search result diversification
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalSearch result diversification has gained momentum as a way to tackle ambiguous queries. An effective approach to this problem is to explicitly model the possible aspects underlying a query, in order to maximise the estimated relevance of the retrieved ...






Comments