Unsupervised Dense Retrieval Training with Web Anchors

In this work, we present an unsupervised retrieval method with contrastive learning on web anchors. The anchor text describes the content that is referenced from the linked page. This shows similarities to search queries that aim to retrieve pertinent information from relevant documents. Based on their commonalities, we train an unsupervised dense retriever, Anchor-DR, with a contrastive learning task that matches the anchor text and the linked document. To filter out uninformative anchors (such as ``homepage'' or other functional anchors), we present a novel filtering technique to only select anchors that contain similar types of information as search queries. Experiments show that Anchor-DR outperforms state-of-the-art methods on unsupervised dense retrieval by a large margin (e.g., by 5.3% NDCG@10 on MSMARCO). The gain of our method is especially significant for search and question answering tasks. Our analysis further reveals that the pattern of anchor-document pairs is similar to that of search query-document pairs. Code available at https://github.com/Veronicium/AnchorDR.


INTRODUCTION
Dense retrieval matches queries and documents in the embedding space [15,16,26], which can capture the semantic meaning of the text and handle more complex queries compared to traditional sparse retrieval methods [23].Due to the scarcity of labeled data in certain domains, including legal and medical, numerous recent studies have focused on unsupervised dense retrieval, which trains dense retrievers without annotations [11,13,14,18].One of the most common approaches of unsupervised dense retrieval is to design a contrastive learning task that approximates retrieval [3,11,13,14,18,19,22], yet it is nontrivial to construct contrastive pairs.Most existing methods construct contrastive pairs from the same context, such as a sentence and its context [14], or two individual text spans in a document [11,13,19].The relation between these co-document pairs is different from query-document pairs in search or question answering, where the query aims to seek information from the document.LinkBERT [27] leverage text spans sampled from a pair of linked Wikipedia pages.However, such text spans are not guaranteed to have high relevance.Few other methods train a model to generate queries from documents [2,18], but they either require large language models or huge amounts of training data.
In this work, we present Anchor-DR, an unsupervised dense retriever that is trained on predicting the linked document of an anchor given its anchor text.The text on the anchor of hyperlinks typically contains descriptive information that the source document cites from the linked document, suggesting that the anchor-document pairs exhibit resemblances to query-document pairs in search, where the search query describes the information that the user is required from the relevant document.As a result, we present to train Anchor-DR to match the anchor text and its linked document with a contrastive objective.
Although the relation between anchor-document pairs is typically similar to that of search queries and relevant documents, there also exist a large number of uninformative anchors.For example, a web document may use anchor links to redirect to the linked document (e.g., "homepage" or "website").Such anchor-document pairs do not resemble the relation between search queries and documents and may introduce noise to our model.We thus design a few heuristic rules to filter out functional anchors, such as headers/footers or anchors in the same domain.In addition, we train a classifier with a small number of high-quality search queries to further identify anchors containing similar types of information as real search queries.
Experiment results show that Anchor-DR outperforms state-ofthe-art unsupervised dense retrievers by a large margin on two widely adopted retrieval datasets, MSMARCO [1] and BEIR [24] (e.g., by 5.3% NDCG@10 on MSMARCO).The improvement of Anchor-DR is most significant on search and question answering tasks, suggesting that compared to the contextual relation between co-document text spans [11,13], the referral relation between anchor-document pairs is more similar to the information-seeking relation between search query-document pairs.We further present examples to show that anchor-document pairs indeed have similar patterns as query-document pairs.

RELATED WORK
Dense Retrieval.Dense retrieval is the technique of using dense vector representations of text to retrieve relevant documents [5,12].With the development of pretrained language models [6,15], recent works have developed various techniques for dense retrieval, including retrieval-oriented pretraining [11,13,19] and negative selection [26].While dense retrieval has exhibited remarkable effectiveness in contrast to traditional sparse retrieval approaches [23], its benefits are generally confined to supervised settings that involve an adequate amount of human annotations [24].Unsupervised dense retrieval.Previous work on unsupervised dense retrieval mainly adopts contrastive learning to model training.ICT [14] matches the surrounding context of a random sentence.SPAR [3] uses random sentences as queries with positive and negative passages ranked by the BM25 score.Co-condenser [11], COCO-LM [19], and contriever [13] regard independent text spans in one document as positive pairs.QExt [18] further improves their work by selecting the text span with the highest relevance computed by an existing pretrained model.A few other research works use neural models to generate queries, such as question-like queries [2] or the topic, title, and summary of the document [18].However, both works require a large-scale generation system.Leveraging web anchors in retrieval.Web anchors have been widely applied to classic approaches for information retrieval [4,7,8,10,28].Recently, HARP [17] designs several pretraining objectives leveraging anchor texts, including representative query prediction or query disambiguation modeling.ReInfoSelect [29] learns to select anchor-document pairs that best weakly supervise the neural ranker.However, these methods either focus on classic bag-of-word modeling or apply a cross-encoder architecture that does not fit the setting of dense retrieval.

METHODOLOGY
We present an unsupervised dense retrieval method that trains the model to match the representations of anchor text and its linked document.This section describes the contrastive learning task of anchor-document prediction and the anchor filtering process.

Contrastive Learning with
Anchor-Document Pairs Based on the commonalities between anchor-document pairs and query-document pairs [4,7,8,10,28], we compute the representation of each anchor and document with our model, Anchor-DR, and trains it with a contrastive objective of matching anchor text and its linked document: where   is our presented model, Anchor-DR, with T5 [21] as its backbone, the sequence embedding   () is the embedding of the first token output by the decoder of Anchor-DR, (,  + ) is the anchor text and its linked document, and  () is the set of negative documents sampled from the whole dataset.In practice, we use BM25 negatives in the first iteration [15] and use the negatives mined by Anchor-DR in the following iterations [26].In inference, we feed the query and all the documents into Anchor-DR separately and use the embedding of the first token in the decoder output as the sequence embedding.Then we rank all the documents by their similarity to the query: (, ) = ⟨  (),   ()⟩, where   denotes Anchor-DR.

Anchor Filtering
While some anchor-document pairs exhibit strong similarities with query-document pairs in search, others do not.For instance, "homepage" or "website" and their linked documents hold entirely distinct relations with query-document pairs.Including these pairs in the training data may introduce noise to our model.As a result, we first apply a few heuristic rules and then train a lightweight classifier to filter out uninformative anchor text.Anchor filtering with heuristic rules.We observe that a large number of uninformative anchors are functional anchors and these anchors mainly exist between pages within the same website.Consequently, we filter out anchor text that falls in the following categories: (1) In-domain anchors, where the source and target page share the same domain; (2) Headers or footers, which are detected by specific HTML tags, such as <header> and <footer>; and (3) Keywords indicating functionalities, which are manually selected from anchors with top 500 frequency. 2nchor filtering with query classifier.We train a lightweight query classifier to learn the types of information that is typically contained in search queries about relevant documents.Specifically, we use the ad-hoc queries provided by WebTrack [9] as positive examples.These small number of queries are manually selected to reflect important characteristics of authentic Web search queries for each year.As for negative examples, we sample a subset of anchors before filtering by our rules, which has the same size as positive examples We train the query classifier with the Cross-Entropy Loss: where  is a miniBERT-based [25] model.After training the query classifier, we rank all the anchor text by the logits of the positive class (i.e., similarity to search queries) and only keep the top 25%.

EXPERIMENTS
In this section, we describe the experiment setups, compare Anchor-DR with baselines and ablations, and analyze its effectiveness.

Experimental Setup
We evaluate Anchor-DR on two public datasets: MSMARCO [1] and BEIR [24] for unsupervised retrieval, where we directly apply the methods to encode test queries and documents without supervision.We report the nDCG@10 results following previous works [13,18].
Table 2: Unsupervised retrieval results on MSMARCO and BEIR under nDCG@10.The best result for each task is marked in bold.The best result among dense retrievers is underlined.We follow previous work [13] and report the average performance on 14 BEIR tasks and MSMARCO (BEIR14+MM).The results of coCondenser and results with † are evaluated using their released checkpoints.The results of other baselines are copied from their original papers.Training data.We train Anchor-DR on a subset of the ClueWeb22 dataset [20].To preprocess the data, we first randomly sampled a subset of English documents with at least one in-link.After that, we use rules and then train a query classifier to filter out uninformative anchors, as introduced in Sec.3.2.Finally, we sample at most 5 in-links for each document.The statistics of the anchors and documents after each step of filtering are shown in Table 1.Note that ClueWeb22 has in total of 52.7B anchors, hence we are able to further scale up our model in the future.Implementation details.For continuous pretraining on anchordocument prediction, we train our model with BM25 negatives for one epoch and with ANCE negatives [26] for another epoch.We use a learning rate of 1e-5 and a batch size of 128 positive pairs.The query classifier is trained on the adhoc test queries of WebTrack 2009 -2014 [9], which contains 300 queries in total.
Baselines.We compare Anchor-DR with a sparse retrieval method: BM25 [23] and four unsupervised dense retrieval methods: coCondenser [11], Contriever [13], SPAR Λ (trained on Wikipedia) [3], and QExt-PLM (trained on Pile-CC with MoCo) [18].All these dense retrieval methods construct contrastive pairs in an unsupervised way: either by rules [11,13], lexical features [3], or with pretrained models [18].Note that we do not compare with methods that require large-scale generation system to generate contrastive pairs, such as QGen [18] or InPars [2], as their generators either require additional human annotations or have significantly larger sizes compared to our model (e.g., 6B vs. 220M).As for ablation studies, we substitute the anchor-document prediction task with two other contrastive tasks: ICT [14], which considers a document and a sentence randomly selected from the document as positive pairs, and co-doc [11], which treats two text sequences from the same document as positive pairs.We also compare to Anchor (rule only), which removes the query classifier and only uses rules to filter anchors.For a fair comparison, we train all the ablations on the same subset of documents in ClueWeb22.

Main Results
Table 2 shows the unsupervised retrieval results on MSMARCO and BEIR.Anchor-DR outperforms all the dense retrieval baselines on MSMARCO and BEIR with a large margin (e.g., by 2.9% nDCG@10 on BEIR14+MM and 3.8% on all datasets).Furthermore, compared to other dense retrievers, Anchor-DR achieves the best performances across a majority of datasets.indicating that our method can be generalized to a wide range of domains and retrieval tasks.
We observe that Anchor-DR exhibits strong performance in specific subsets of tasks.For instance, Anchor-DR achieves a large performance gain of 11.8% nDCG@10 on TREC-COVID, but it is outperformed by other baseline methods on ArguAna and Quora.

Ablation Study
To demonstrate the effectiveness of our anchor-doc prediction task, we perform ablation studies in Table 3.We observe that Anchor-DR outperforms both methods.Additionally, ICT and co-doc have less than 1% performance gap on 7 out of 19 datasets.This is probably because the contrastive learning pairs in both methods contain contextual information about each other.Anchor-DR also outperforms Anchor (rule only), indicating that it is effective to train on anchor texts with higher similarities to search queries.Table 4: Examples of the query-document pairs in two BEIR datasets: ArguAna and TREC-COVID, the co-document text pairs (co-doc), and the anchor-document pairs (Anchor-DR).

Dataset: ArguAna
Query: Becoming a vegetarian is an environmentally friendly thing to do.Modern farming is one of the main sources of pollution in our rivers, and as long as people continue to buy fast food ... Document: Health general weight philosophy ethics You don't have to be vegetarian to be green.Many special environments have been created by livestock farming, for example chalk down land in England and mountain pastures ...

Dataset: TREC-COVID
Query: what causes death from Covid-19?Document: Predicting the ultimate outcome of the COVID-19 outbreak in Italy: During the COVID-19 outbreak, it is essential to monitor the effectiveness of measures taken by governments on the course of the epidemic.Here we show that there is already a sufficient amount of data collected in Italy to predict the outcome of the process ...

Method: Codoc
Query #1: Going vegetarian is one of the best things you can do for your health.Document #1: We publish a quarterly magazine The Irish Vegetarian, with features and our roundup of news and events of interest to Irish vegetarians.Get involved!There are lots of ways to get involved.You can read our Going Vegetarian page.You can pick up a copy of The Irish Vegetarian.You can come to a Meetup meeting ... Query #2: COVID-19 vaccines designed to elicit neutralizing antibodies may sensitize vaccine recipients to severe diseases Document #2: According to a study that examined how informed consent is given to COVID-19 vaccinetrial participants, disclosure forms fail to inform volunteers that the vaccine might make them susceptible to more severe disease.The study, "Informed Consent Disclosure to Vaccine Trial Subjects of Risk of COVID-19 Vaccine ...

Method: Anchor-DR
Query #1: Vegetarian Society of Ireland Document #1: The Vegetarian Society of Ireland is a registered charity.Our aim is to increase awareness of vegetarianism in relation to health, animal welfare and environmental perspectives.We support both vegetarian and vegan aims.Going vegetarian is one of the best things you can do for your health, for animals and for the planet ...

Query #2: How COVID19 Vaccine Can Destroy Your Immune System
Document #2: According to a study that examined how informed consent is given to COVID-19 vaccine trial participants, disclosure forms fail to inform volunteers that the vaccine might make them susceptible to more severe diseases...

Performance Analysis
Performance breakdown.The results in Table 2 show that Anchor-DR achieves strong performance in a majority of datasets but not in others.To analyze the effectiveness of Anchor-DR on different datasets, we categorize the datasets into three subsets: (1) Search/QA, where the query is a question or keywords related to the document; (2) Context/Paraphrase, where the query and document contain coherent or overlapping information; and (3) Others.Figure 1(a) shows that Anchor-DR performs better on Search/QA datasets and co-doc is better on Context/Paraphrase datasets.The results are consistent with our hypothesis that the referral relation between query-document pairs is similar to the information-seeking relation between search queries and relevant documents.
We further quantitatively analyze the information pattern of query-document pairs captured by Anchor-DR and co-doc.Figure 1(b) shows the performance gap between Anchor-DR and co-doc versus the degree of information overlap between queries and documents in each test dataset, which is measured using Jaccard Similarity.We observe that Anchor-DR performs much better on datasets where queries and documents contain less overlapping information.The primary emphasis of datasets with high query-document similarity is mainly on paraphrasing and coherency, which are distinct from the relation between search queries and documents.Case studies.We show in Table 4 the contrastive pairs of Anchor-DR and co-doc, as well as the positive pairs in ArguAna and TREC-COVID, which represent the Search/QA and Context/Paraphrase datasets.The query-doc pairs of ArguAna are arguments around the same topic, which are coherent and have similar formats.Similarly, the contrastive pairs of co-doc contain either coherent (e.g., the claim and recent work of the vegetarian society) or repeating information (e.g., COVID vaccine may cause diseases), which may explain its good performance on Context/Paraphrase datasets.
In contrast, in TREC-COVID, the answer to the query is contained in the document.As shown in Table 4, the anchor text in Anchor-DR could be the topic of the linked document, or in the format of a question.In both examples, the anchor text can serve as a search query and the document can provide the information the query is seeking, which could be the reason why Anchor-DR achieves strong performance on the Search/QA datasets.

CONCLUSION
We train an unsupervised dense retrieval model, Anchor-DR, leveraging the rich web anchors.In particular, we design a contrastive learning task: anchor-document prediction to continuously pretrain Anchor-DR.Additionally, we apply predefined rules and train a query classifier to filter out uninformative anchors.Experiments on two public datasets: MSMARCO and BEIR show that Anchor-DR significantly outperforms the state-of-the-art dense retrievers on unsupervised retrieval.Our analyses provide a further comparison of the patterns of information contained in our contrastive learning pairs and query-document pairs in test datasets.

Table 1 :
The statistics of ClueWeb22 anchor training data.

Table 3 :
nDCG@10 of models trained with different contrastive tasks on the same subset of documents, with 400K documents and 400K contrastive pairs.T-test shows Anchor-DR outperforms co-doc on All Avg. with p-value < 0.05.