Abstract
Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users’ initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-relevant set. Recently, dense retrieval – through the use of neural contextual language models such as BERT for analysing the documents’ and queries’ contents and computing their relevance scores – has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrieval families have emerged: the use of single embedded representations for each passage and query, e.g., using BERT’s [CLS] token, or via multiple representations, e.g., using an embedding for each token of the query and document (exemplified by ColBERT). In this work, we conduct the first study into the potential for multiple representation dense retrieval to be enhanced using pseudo-relevance feedback and present our proposed approach ColBERT-PRF. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, ColBERT-PRF extracts the representative feedback embeddings from the document embeddings of the pseudo-relevant set. Among the representative feedback embeddings, the embeddings that most highly discriminate among documents are employed as the expansion embeddings, which are then added to the original query representation. We show that these additional expansion embeddings both enhance the effectiveness of a reranking of the initial query results as well as an additional dense retrieval operation. Indeed, experiments on the MSMARCO passage ranking dataset show that MAP can be improved by up to 26% on the TREC 2019 query set and 10% on the TREC 2020 query set by the application of our proposed
- [1] . 2004. UMass at TREC 2004: Novelty and HARD. In Proceedings of TREC.Google Scholar
Cross Ref
- [2] . 2003. Probability Models for Information Retrieval Based on Divergence from Randomness Ph.D. thesis. University of Glasgow (2003).Google Scholar
- [3] . 2004. Query difficulty, robustness, and selective application of query expansion. In Proceedings of ECIR. 127–137.Google Scholar
Cross Ref
- [4] . 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 357–389.Google Scholar
Digital Library
- [5] . 2007. K-Means++: The advantages of careful seeding. In Proceedings of SODA. 1027–1035.Google Scholar
- [6] . 2008. Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of SIGIR. 243–250.Google Scholar
Digital Library
- [7] . 2021. Overview of the TREC 2020 deep learning track. In Proceedings of TREC.Google Scholar
- [8] . 2020. Overview of the TREC 2019 deep learning track. In Proceedings of TREC.Google Scholar
- [9] . 2019. Deeper text understanding for IR with contextual neural language modeling. In Proceedings of SIGIR. 985–988.Google Scholar
Digital Library
- [10] . 2020. Context-aware document term weighting for ad-hoc search. In Proceedings of WWW. 1897–1907.Google Scholar
Digital Library
- [11] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of ACL. 4171–4186.Google Scholar
- [12] . 2016. Query expansion with locally-trained word embeddings. In Proceedings of ACL. 367–377.Google Scholar
Cross Ref
- [13] . 2021. A white box analysis of ColBERT. In Proceedings of ECIR. 257–263.Google Scholar
Digital Library
- [14] . 2016. A deep relevance matching model for ad-hoc retrieval. In Proceedings of CIKM. 55–64.Google Scholar
Digital Library
- [15] . 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).Google Scholar
- [16] . 2020. Dense passage retrieval for open-domain question answering. In Proceedings of EMNLP. 6769–6781.Google Scholar
Cross Ref
- [17] . 2020. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of SIGIR. 39–48.Google Scholar
Digital Library
- [18] . 2019. Clustering algorithms for query expansion based information retrieval. In Proceedings of ICCI. 261–272.Google Scholar
Digital Library
- [19] . 2016. Query expansion using word embeddings. In Proceedings of CIKM. 1929–1932.Google Scholar
Digital Library
- [20] . 2018. NPRF: A neural pseudo relevance feedback framework for ad-hoc information retrieval. In Proceedings of EMNLP. 4482–4491.Google Scholar
Cross Ref
- [21] . 2021. PARADE: Passage Representation Aggregation for Document Reranking. arXiv:2008.09093 [cs.IR].Google Scholar
- [22] . 2021. Improving query representations for dense retrieval with pseudo relevance feedback: A reproducibility study. In Proceedings of ECIR.Google Scholar
- [23] . 2021. Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics 9 (2021), 329–345.Google Scholar
- [24] . 2020. OpenNIR: A complete neural ad-hoc ranking pipeline. In Proceedings of WSDM. 845–848.Google Scholar
Digital Library
- [25] . 2020. Declarative experimentation in information retrieval using PyTerrier. In Proceedings of ICTIR. 161–168.Google Scholar
Digital Library
- [26] . 2021. On approximate nearest neighbour selection for multi-stage dense retrieval. In Proceedings of CIKM. 3318–3322.Google Scholar
Digital Library
- [27] . 2021. PyTerrier: Declarative experimentation in Python from BM25 to dense retrieval. In Proceedings of CIKM. 4526–4533.Google Scholar
Digital Library
- [28] . 2021. On single and multiple representations in dense passage retrieval. IIR 2021 Workshop (2021).Google Scholar
- [29] . 2021. CEQE: Contextualized embeddings for query expansion. Proceedings of ECIR (2021), 467–482.Google Scholar
- [30] . 2016. MS MARCO: A human generated machine reading comprehension dataset. In [email protected].Google Scholar
- [31] . 2020. Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713 (2020).Google Scholar
- [32] . 2019. From doc2query to docTTTTTquery. Online preprint (2019).Google Scholar
- [33] . 2019. Document expansion by query prediction. arXiv preprint arXiv:1904.08375 (2019).Google Scholar
- [34] . 2005. Terrier information retrieval platform. In Proceedings of ECIR. 517–519.Google Scholar
Digital Library
- [35] . 2020. Rethinking query expansion for BERT reranking. In Proceedings of ECIR. 297–304.Google Scholar
Digital Library
- [36] . 2018. Deep contextualized word representations. In Proceedings of NAACL-HLT. 2227–2237.Google Scholar
Cross Ref
- [37] . 1971. Relevance feedback in information retrieval. The Smart Retrieval System-experiments in Automatic Document Processing (1971), 313–323.Google Scholar
- [38] . 2019. Selecting discriminative terms for relevance model. In Proceedings of SIGIR. 1253–1256.Google Scholar
Digital Library
- [39] . 2018. Using word embeddings for information retrieval: How collection and term normalization choices affect performance. In Proceedings of CIKM. 1835–1838.Google Scholar
Digital Library
- [40] . 2016. Using word embeddings for automatic query expansion. In Proceedings of SIGIR Workshop on Neural Information Retrieval. arXiv:1606.07608.Google Scholar
- [41] . 2021. Query embedding pruning for dense retrieval. In Proceedings of CIKM. 3453–3457.Google Scholar
Digital Library
- [42] . 2020. A pseudo-relevance feedback framework combining relevance matching and semantic matching for information retrieval. Information Processing & Management 57, 6 (2020), 102342.Google Scholar
Cross Ref
- [43] . 2022. Improving zero-shot retrieval using dense external expansion. Information Processing & Management 59, 5 (2022), 103026.Google Scholar
Digital Library
- [44] . 2021. Pseudo-relevance feedback for multiple representation dense retrieval. In Proceedings of ICTIR. 297–306.Google Scholar
Digital Library
- [45] . 2017. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of SIGIR. 55–64.Google Scholar
Digital Library
- [46] . 2021. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In Proceedings of ICLR.Google Scholar
- [47] . 2021. PGT: Pseudo relevance feedback using a graph-based transformer. In Proceedings of ECIR. 440–447.Google Scholar
Digital Library
- [48] . 2021. Improving query representations for dense retrieval with pseudo relevance feedback. In Proceedings of CIKM. 3592–3596.Google Scholar
Digital Library
- [49] . 2016. Embedding-based query language models. In Proceedings of ICTIR. 147–156.Google Scholar
Digital Library
- [50] . 2018. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing. In Proceedings of CIKM. 497–506.Google Scholar
Digital Library
- [51] . 2020. BERT-QE: Contextualized query expansion for document re-ranking. In Proceedings of EMNLP: Findings. 4718–4728.Google Scholar
Cross Ref
Index Terms
ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document Retrieval
Recommendations
Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval
ICTIR '21: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information RetrievalPseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users' initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-...
Query dependent pseudo-relevance feedback based on wikipedia
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalPseudo-relevance feedback (PRF) via query-expansion has been proven to be e®ective in many information retrieval (IR) tasks. In most existing work, the top-ranked documents from an initial search are assumed to be relevant and used for PRF. One problem ...
Document-based and term-based linear methods for pseudo-relevance feedback
Query expansion is a successful approach for improving Information Retrieval effectiveness. This work focuses on pseudo-relevance feedback (PRF) which provides an automatic method for expanding queries without explicit user feedback. These techniques ...






Comments