skip to main content
research-article

Scalable and Efficient Web Search Result Diversification

Published:16 August 2016Publication History
Skip Abstract Section

Abstract

It has been shown that top-k retrieval quality can be considerably improved by taking not only relevance but also diversity into account. However, currently proposed diversification approaches have not put much attention on practical usability in large-scale settings, such as modern web search systems. In this work, we make two contributions toward this goal. First, we propose a combination of optimizations and heuristics for an implicit diversification algorithm based on the desirable facility placement principle, and present two algorithms that achieve linear complexity without compromising the retrieval effectiveness. Instead of an exhaustive comparison of documents, these algorithms first perform a clustering phase and then exploit its outcome to compose the diverse result set. Second, we describe and analyze two variants for distributed diversification in a computing cluster, for large-scale IR where the document collection is too large to keep in one node. Our contribution in this direction is pioneering, as there exists no earlier work in the literature that investigates the effectiveness and efficiency of diversification on a distributed setup. Extensive evaluations on a standard TREC framework demonstrate a competitive retrieval quality of the proposed optimizations to the baseline algorithm while reducing the processing time by more than 80% and up to 97%, and shed light on the efficiency and effectiveness tradeoffs of diversification when applied on top of a distributed architecture.

References

  1. Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In Proceedings of the 2nd International Conference on Web Search and Web Data Mining. 5--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Berkant Barla Cambazoglu and Ricardo Baeza-Yates. 2011. Scalability challenges in web search engines. In Advanced Topics in Information Retrieval, Massimo Melucci and Ricardo Baeza-Yates (Eds.). The Information Retrieval Series, Vol. 33. 27--50.Google ScholarGoogle Scholar
  3. Berkant Barla Cambazoglu, Emre Varol, Enver Kayaaslan, Cevdet Aykanat, and Ricardo A. Baeza-Yates. 2010. Query forwarding in geographically distributed search engines. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 90--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Gabriele Capannini, Franco Maria Nardini, Raffaele Perego, and Fabrizio Silvestri. 2011. Efficient diversification of web search results. PVLDB 4, 7 (2011), 451--459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jaime G. Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval. 335--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Claudio Carpineto, Massimiliano D’Amico, and Giovanni Romano. 2012. Evaluating subtopic retrieval methods: Clustering versus diversification of search results. Inf. Process. Manage. 48, 2 (2012), 358--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ben Carterette. 2009. An analysis of NP-completeness in novelty and diversity ranking. In Proceedings of the 2nd International Conference on the Theory of Information Retrieval. 200--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ben Carterette and Praveen Chandar. 2009. Probabilistic models of ranking novel documents for faceted topic retrieval. In Proceedings of the 18th ACM Conf. on Information and Knowledge Management. 1287--1296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Olivier Chapelle, Shihao Ji, Ciya Liao, Emre Velipasaoglu, Larry Lai, and Su-Lin Wu. 2011. Intent-based diversification of web search Results: Metrics and algorithms. Inf. Retr. 14, 6 (2011), 572--592. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 621--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Edgar Chávez and Gonzalo Navarro. 2005. A compact space decomposition for effective metric indexing. Pattern Recog. Lett. 26, 9 (2005), 1363--1376. DOI:http://dx.doi.org/10.1016/j.patrec.2004.11.014 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Harr Chen and David R. Karger. 2006. Less is more: Probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 429--436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. 2009. Overview of the TREC 2009 web track. In Proceedings of the 18th Text Retrieval Conference.Google ScholarGoogle Scholar
  14. Charles L. A. Clarke, Nick Craswell, Ian Soboroff, and Gordon V. Cormack. 2010. Overview of the TREC 2010 web track. In Proceedings of the 19th Text Retrieval Conference.Google ScholarGoogle Scholar
  15. Charles L. A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. 659--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Van Dang and W. Bruce Croft. 2012. Diversity by proportionality: An election-based approach to search result diversification. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 65--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Van Dang and W. Bruce Croft. 2013. Term level search result diversification. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 603--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jeffrey Dean. 2009. Challenges in building large-scale information retrieval systems: Invited talk. In Proceedings of the 2nd International Conference on Web Search and Web Data Mining. 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Marina Drosou and Evaggelia Pitoura. 2009. Diversity over continuous data. IEEE Data Eng. Bull. 32, 4 (2009), 49--56.Google ScholarGoogle Scholar
  20. Moran Feldman, Ronny Lempel, Oren Somekh, and Kolman Vornovitsky. 2011. On the impact of random index-partitioning on index compression. CoRR abs/1107.5661 (2011). http://arxiv.org/abs/1107.5661.Google ScholarGoogle Scholar
  21. Veronica Gil-Costa, Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2011. Sparse spatial selection for novelty-based search result diversification. In Proceedings of the 18th International Symposium on String Processing and Information Retrieval. 344--355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Veronica Gil-Costa, Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2013. Modelling efficient novelty-based search result diversification in metric spaces. J. Discrete Algorithms 18 (2013), 75--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sreenivas Gollapudi and Aneesh Sharma. 2009. An axiomatic approach for result diversification. In Proceedings of the 18th International Conference on the World Wide Web. 381--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jiyin He, Edgar Meij, and Maarten de Rijke. 2011. Result diversification based on query-specific cluster ranking. JASIST 62, 3 (2011), 550--571. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Maryam Kamvar and Shumeet Baluja. 2006. A large scale study of wireless search behavior: Google mobile search. In Proceedings of the 2006 Conference on Human Factors in Computing Systems (CHI’06). 701--709. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Anagha Kulkarni and Jamie Callan. 2010. Document allocation policies for selective searching of distributed indexes. In Proceedings of the 19th ACM Conference on Information and Knowledge Management. 449--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shangsong Liang, Zhaochun Ren, and Maarten de Rijke. 2014. Fusion helps diversification. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. 303--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Harry Markowitz. 1952. Portfolio selection. J. Finance 7, 1 (1952), 77--91.Google ScholarGoogle Scholar
  30. Enrico Minack, Wolf Siberski, and Wolfgang Nejdl. 2011. Incremental diversification for very large sets: A streaming-based approach. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 585--594. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Rifat Ozcan, Ismail Sengör Altingövde, Berkant Barla Cambazoglu, Flavio Paiva Junqueira, and Özgür Ulusoy. 2012. A five-level static cache architecture for web search engines. Inf. Process. Manage. 48, 5 (2012), 828--840. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ahmet Murat Ozdemiray and Ismail Sengor Altingovde. 2014. Query performance prediction for aspect weighting in search result diversification. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. 1871--1874. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ahmet Murat Ozdemiray and Ismail Sengor Altingovde. 2015. Explicit search result diversification using score and rank aggregation methods. JASIST 66, 6 (2015), 1212--1228. DOI:http://dx.doi.org/ 10.1002/asi.23259Google ScholarGoogle Scholar
  34. Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In Proceedings of the 1st International Conference on Scalable Information Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Diego Puppin, Fabrizio Silvestri, and Domenico Laforenza. 2006. Query-driven document partitioning and collection selection. In Proceedings of the 1st International Conference on Scalable Information Systems. 34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Filip Radlinski and Susan T. Dumais. 2006. Improving personalized web search using result diversification. In Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. 691--692. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Stephen E. Robertson. 1977. The probability ranking principle in IR. J. Document. 33 (1977), 294--304.Google ScholarGoogle ScholarCross RefCross Ref
  38. Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2010a. Exploiting query reformulations for web search result diversification. In Proceedings of the 19th International Conference on World Wide Web. 881--890. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2010b. Selectively diversifying web search results. In Proceedings of the 19th ACM Conference on Information and Knowledge Management. 1179--1188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2011. Intent-aware search result diversification. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 595--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. David Vallet and Pablo Castells. 2012. Personalized diversification of search results. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 841--850. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Saul Vargas, Pablo Castells, and David Vallet. 2012. Explicit relevance models in intent-oriented information retrieval diversification. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 75--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jun Wang and Jianhan Zhu. 2009. Portfolio theory of information retrieval. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 115--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Shengli Wu and Chunlan Huang. 2014. Search result diversification via data fusion. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. 827--830. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. ChengXiang Zhai, William W. Cohen, and John D. Lafferty. 2003. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 10--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. ChengXiang Zhai and John D. Lafferty. 2006. A risk minimization framework for information retrieval. Inf. Process. Manage. 42, 1 (2006), 31--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Guido Zuccon and Leif Azzopardi. 2010. Using the quantum probability ranking principle to rank interdependent documents. In Proceedings of the 32nd European Conf. on IR Research. 357--369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Guido Zuccon, Leif Azzopardi, Dell Zhang, and Jun Wang. 2012. Top-k retrieval using facility location analysis. In Proceedings of the 34th European Conference on IR Research. 305--316. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable and Efficient Web Search Result Diversification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on the Web
        ACM Transactions on the Web  Volume 10, Issue 3
        August 2016
        201 pages
        ISSN:1559-1131
        EISSN:1559-114X
        DOI:10.1145/2988335
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 August 2016
        • Revised: 1 March 2016
        • Accepted: 1 March 2016
        • Received: 1 December 2014
        Published in tweb Volume 10, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!