Constructing Tree-based Index for Efficient and Effective Dense Retrieval

Recent studies have shown that Dense Retrieval (DR) techniques can significantly improve the performance of first-stage retrieval in IR systems. Despite its empirical effectiveness, the application of DR is still limited. In contrast to statistic retrieval models that rely on highly efficient inverted index solutions, DR models build dense embeddings that are difficult to be pre-processed with most existing search indexing systems. To avoid the expensive cost of brute-force search, the Approximate Nearest Neighbor (ANN) algorithm and corresponding indexes are widely applied to speed up the inference process of DR models. Unfortunately, while ANN can improve the efficiency of DR models, it usually comes with a significant price on retrieval performance. To solve this issue, we propose JTR, which stands for Joint optimization of TRee-based index and query encoding. Specifically, we design a new unified contrastive learning loss to train tree-based index and query encoder in an end-to-end manner. The tree-based negative sampling strategy is applied to make the tree have the maximum heap property, which supports the effectiveness of beam search well. Moreover, we treat the cluster assignment as an optimization problem to update the tree-based index that allows overlapped clustering. We evaluate JTR on numerous popular retrieval benchmarks. Experimental results show that JTR achieves better retrieval performance while retaining high system efficiency compared with widely-adopted baselines. It provides a potential solution to balance efficiency and effectiveness in neural retrieval system designs.


INTRODUCTION
Information Retrieval (IR) system has become one of the most important tools for people to find useful information online.In practice, an industry IR system usually needs to retrieve a small set of relevant documents from millions or even billions of documents.The computational cost of brute-force search is usually unacceptable due to the system latency requirements.Therefore, indexes are often employed to achieve a fast response.One of the most well-known examples is the inverted index [24].Unfortunately, such traditional indexing techniques cannot be applied to recently proposed Dense Retrieval (DR) models.As DR models have shown promising performance in multi-stage retrieval systems, especially in the first stage retrieval [16,17,20,29,33,39], how to develop effective and efficient indexing technique for dense retrieval has become an important question for the research community.
Efficient DR solutions such as Approximate Nearest Neighbor (ANN) search algorithms have been widely applied in the first stage retrieval process.Popular methods include tree-based indexes [2], locality sensitive hashing (LSH) [15], product quantization (PQ) [13,14], hierarchical navigable small-world network (HNSW) [23], etc.Since tree-based indexes can achieve sub-linear time complexity by pruning low-quality candidates, it has drawn significant attention.Despite success in improving retrieval efficiency, most existing tree-based indexes often come with the degradation of retrieval performance.There are two main reasons: (1) The majority of existing indexes cannot benefit from supervised data because they use task-independent reconstruction error as the loss function.(2) The training objectives of the index structure and the encoder are inconsistent.In general, the optimization goal of index structure is to minimize the approximation error, while the optimization goal of the encoder is to get better retrieval performance.This inconsistency may lead to sub-optimal results.
To balance the effectiveness and efficiency of the tree-based indexes, we propose JTR, which stands for Joint optimization of TRee-based index and query encoding.To jointly optimize index structure and query encoder in an end-to-end manner, JTR drops the original "encoding-indexing" training paradigm and designs a unified contrastive learning loss.However, training tree-based indexes using contrastive learning loss is non-trivial due to the problem of differentiability.To overcome this obstacle, the treebased index is divided into two parts: cluster node embeddings and cluster assignment.For differentiable cluster node embeddings, which are small but very critical, we design tree-based negative sampling to optimize them.For cluster assignment, an overlapped cluster method is applied to iteratively optimize it.
To verify the effectiveness of our method, we compared JTR with a wide range of ANN methods on two large publicly available datasets.The empirical experimental results show that JTR achieves better performance compared to baselines while keeping high efficiency.Through the ablation study, we have confirmed the effectiveness of the proposed strategies 1 .
In short, our main contributions are as follows: (1) We propose a novel tree-based index for Dense Retrieval.
Benefiting from the tree structure, it achieves sub-linear time complexity while still yielding promising results.(2) We propose a new joint optimization framework to learn both the tree-based index and query encoder.To the best of our knowledge, JTR is the first joint-optimized retrieval approach with tree-based index.This framework improves endto-end retrieval performance through unified contrastive learning loss and tree-based negative sampling.(3) We relax the constraint that documents are mutually exclusive between clusters and propose an efficient clustering solution in which a document can be assigned into multiple clusters to optimize the cluster assignment.We demonstrate that overlapped cluster is beneficial for improving retrieval performance.

RELATED WORK 2.1 Dual Encoders
With the development of Pre-trained Language Models (PLMs), Dense Retrieval (DR) has developed rapidly in the last decade.DR typically uses dual encoders to obtain embeddings of queries and documents and utilizes inner products as similarity factors.For a query-document pair < ,  >, the primary goal is to learn a presentation function  (•) that maps text to higher-dimensional embeddings.The general form of a dual encoder is as follows: Where, <, > refers to the inner product. (, ) is the relevance score.To further improve the performance of dual encoders, many researchers have improved the loss function, sampling method, and so on to make them more suitable for information retrieval tasks [3,8,12,18,19,34,35,37].For example, Lee et al. [18] suggest the use of random negative sampling in large batches, which increases the difficulty of training.Zhan et al. [37] propose the dynamic hard negative sampling method, which effectively improves the performance of the DR model.

Approximate Nearest Neighbor Search
When queries and documents are encoded into embeddings, the retrieval problem can be considered as a Nearest Neighbor (NN) search problem.The simplest nearest neighbor search method is brute-force search, which becomes impractical when the corpus size explodes.Therefore, most studies use Approximate Nearest Neighbor (ANN) search.In general, there are four common methods, including tree-based methods [2], hashing [15,28], quantizationbased methods [13,14], and graph-based methods [23].Tree-based and graph-based methods divide embeddings into different spaces and query embeddings are retrieved only in similar spaces.Hashing and quantization-based methods compress vector representations in different ways.Both are important and are often used in combination in practice.

Joint Optimization
Existing DR systems follow the "encoding-indexing" paradigm, Inconsistent optimization objectives of the two steps during training will lead to sub-optimal performance.Previous studies have shown that the combination of optimized index construction and retrieval model training can make the index directly benefit from the annotation information and achieve better retrieval performance [9-11, 36, 38, 41].Zhan et al [36] explore joint optimization of query encoding and product quantization, achieving state-of-theart results.Furthermore, RepCONC [38] treats quantization as a constrained clustering process, requiring vectors to be clustered around quantized centroids.These approaches achieve good vector compression results, but the retrieval time complexity is still linear with respect to the corpus size.Tree-based methods have shown promising performance in recommender systems [40][41][42].For example, TDM [41] proposes a tree-based deep recommendation model, which can be incorporated into any high-level model for retrieval.However, these methods are designed for recommendation systems and cannot be directly applied to retrieval systems.Recently, Tay et al. [30] regarded transformer memory as the Differentiable Search Index (DSI).DSI maps documents into document IDs when indexing, and generates the corresponding docid for query when retrieving, so as to unify the training and indexing process.On this basis, Wang et al. [31] propose the Neural Corpus Indexer (NCI), which supports end-to-end document retrieval through a sequence-to-sequence network.These mode-based indexes provide a new paradigm for learning end-to-end retrieval systems.However, they are not suitable for larger-scale web searches.And when adding or deleting documents to the system, it is difficult to update the index.Compared with them, this paper focuses on the joint optimization of index and encoders, which preserves the advantages of the original index structure.

THE JTR MODEL
In this section, we first introduce the preliminary and tree structure in JTR.Secondly, the end-to-end joint optimization process is described in detail.Finally, we show how to optimize the cluster assignments of documents.Table 1 shows the important notations that are present in this paper.

Preliminary
Dense Retrieval can be formalized as follows: given a query  and the corpus  = { 1 ,  2 , ...,   }, the model needs to retrieve the top  most relevant documents from  with .The training set is given in the form of {(  ,   )...}, which means that   and   are relevant.Tree-based index clusters all documents and prunes irrelevant cluster nodes at retrieval time to improve retrieval efficiency.As depicted in Figure 1, components in JTR are as follows: • Deep Encoder Φ, which encodes the query  into a D-dimensional dense embeddding.Following the previous work, BERT [7] is employed as the encoder.In this case, the total number of documents is 10, the branch balance factor  = 2, and the leaf balance factor  = 5.If the node contains more documents than , then k-means will be performed on the document embedding in the node until all nodes contain less than  documents.The embedding of each node is initialized as the cluster centroid embedding.
• Cluster Nodes, which consist of root node, intermediate nodes, and leaf nodes.The leaf nodes correspond to the fine-grained clusters, while the intermediate nodes correspond to the coarsegrained clusters.All clustering nodes are represented as trainable dense embeddings    ∈ R D , where  denotes the -th clustering node.More specifically, given the dense embedding of a query Φ() ∈ R D , the relevance scores of the query and the clustered nodes are calculated by  =    • Φ() at each level of the tree.Based on these scores, top  leaf nodes are returned and the documents within them are further ranked.The parameter  represents the beam size.Cluster node embeddings are crucial for a tree.We will describe how to optimize them in the following section.
• Cluster Assignment, which represents the distribution of documents to leaf nodes.Assume there are K leaf nodes, we use ..} to denote the documents assigned to leaf node .The initial cluster assignment is constructed by the k-means algorithm on document embeddings    ∈ R D , where  denotes the -th document.The score  =    • Φ() is applied to indicate the relevance of the query to the document.As mentioned above, we only calculate the relevance score for the documents in the top  leaf nodes.Since k-means is an unsupervised clustering method, which divides documents into mutually exclusive clusters where documents in each cluster share the same semantics.However, this does not correspond to reality.Therefore, we relax this constraint and apply an overlapped cluster approach to optimize cluster assignment.The details will be described in section 3.3.
In this paper, we define  as the branch balance factor, representing the number of child nodes of non-leaf nodes, and  as the leaf balance factor, representing the maximum number of documents contained by one leaf node.
JTR builds the tree index by recursively using the k-means algorithm.Specifically, given the corpus to be indexed, all documents are encoded into embeddings with the trained document encoder.Then, all embeddings are clustered into  clusters by the k-means algorithm.For each node that contains more than  documents, the k-means is applied recursively until all nodes contain less than  documents.The embedding of each node is initialized as the cluster centroid embedding.Figure 2 illustrates the initialization of the tree structure.We can observe that the tree index has the following properties: (1) Each non-leaf node of the tree has  child nodes.The depth of the tree is influenced by branch balance factor  and leaf balance factor . (2) The tree may be unbalanced.In the large corpus, this unbalance is usually insignificant and does not significantly affect the effectiveness of retrieval.

End-to-End Optimization
In this section, we introduce the end-to-end joint optimization process.In Figure 3, we compare the workflow of JTR and existing works.As shown in Figure 3(a), the existing methods follow a twostep "encoding-indexing" process.They first train the query encoder and document encoder with the ranking loss.After that, the welltrained document embeddings are used to train the tree-based index under the guidance of MSE loss.The training process of tree-based index is independent and cannot benefit from supervised data.In contrast, Figure 3(b) shows the optimization process of JTR.The joint optimization is mainly implemented by two strategies: unified contrastive learning loss and tree-based negative sampling.Next, we describe our motivation and the specific design in detail.

Motivation.
For tree-based indexes, ensuring retrieval effectiveness and efficiency relies on two main components: pruning low-quality nodes and cluster assignment.For pruning low-quality nodes, tree-based indexes usually utilize beam search [25] to achieve efficient top  leaf nodes.Pruning the correct node at the upper levels can lead to a serious accumulation of errors.From this perspective, we argue that tree-based indexes should have the maximum heap property, which can well support the effectiveness of beam search.More specifically, the maximum heap property can be expressed as follows:  Where   (|) represents the relevant probability between query  and cluster node  in level  and  ( ) is the normalized term in level .This formula indicates that the relevant probability of a parent node relies on the maximum relevant probability of its children nodes.In other words, given a specific query , the parent of the optimal top node also belongs to the top node of the upper level.The maximum heap property is the basis of beam search [25].
In practice, we do not need to know the exact relevant probabilities of the cluster nodes.The relevance rank order of the nodes at each level is sufficient for accurate beam search.Therefore, we design a new contrastive learning loss to optimize cluster nodes.Moreover, the tree-based negative sample sampling technique is applied to improve the performance of JTR.

Unified Contrastive Learning Loss.
Existing DR models usually use the contrastive learning loss function.Specifically, given a query , let  + and  − be relevant documents and negative documents.The loss function is formulated as follows: ) Where  (, ) is the relevance score.The purpose of contrastive learning loss is to make the query closer to related documents in the embedding space compared to the irrelevant ones.However, this loss function optimizes the embeddings of queries and documents and is not applicable to existing ANN methods.In JTR, we adapt this Training the tree-based index with the contrastive learning loss is non-trivial because the clustering assignment is not differentiable.To solve this problem, we initialize the cluster assignments using Figure 3(a) and only train the cluster node embeddings, which can benefit from supervised data directly.

Tree-based Negative Sampling.
To further improve the performance of JTR, we design the tree-based negative sampling strategy.As shown in Figure 4, given a specific query, the leaf node with number three is a positive sample since it contains the relevant document.As mentioned above, the leaf node with number one is a positive sample at  1 because it serves as the father node of the positive sample.Since the tree is constructed with the kmeans algorithm, the brother nodes of each positive sample are considered to be closer to the positive sample in the embedding space.Hence, we select them as negative samples for training.As shown in Figure 3(b), the query embedding is fed into the "Search Negatives" module and brother nodes of the positive sample are returned under the current parameter.

Optimized Overlapped Cluster
As mentioned above, cluster assignment plays an essential role in the tree-based index.However, k-means is an unsupervised clustering method, which divides documents into mutually exclusive clusters, and documents in each cluster should share the same semantics.We observe in practice that the semantics of a document are complex and multi-topic.Putting each document in a cluster can limit the performance of the tree-based index.To solve this problem, we propose overlapped clusters to further optimize cluster assignment.
The overlapped cluster has been studied in the field of unsupervised learning [4,21,32].Inspired by Liu et al. [21], we formulate the cluster assignment problem in information retrieval as an optimization problem.Suppose that there are L queries and N documents in the training corpus and all documents have been clustered and put into the K leaf nodes of the tree index.
First, we define the ground truth matrix Y ∈ {0, 1} L×N as: Y describes the relevance of the training queries and the documents and is the best result the model is trying to achieve.Then, we input the training queries into the tree-based index to get the leaf nodes associated with them.We define the relationship between queries and leaf nodes as matrix M ∈ {0, 1} L×K and the original cluster assignment as matrix C ∈ {0, 1} N×K .It is worth noting that the initialized cluster assignment is obtained from Figure 3(a).
Then we can represent the relationship between queries and documents of JTR as:  for matrix cross product.Then, the performance of JTR can be expressed as the intersection of Y and Ŷ.The formula is as follows: where | • | returns the number of non-zero elements in the matrix.Since Ŷ and Y are binary matrices, We can obtain the following formula: where   (•) returns the trace of the matrix.It is found that the performance of JTR is proportional to the trace of the matrix.In order to maximize the recall rate, we formulate the cluster assignment problem as follows: maximize This optimization problem is an NP-complete problem.Following Liu et al. [21], we approximate the objective function with a continuous, RelU-like function.
In the optimization function, the cluster information is hidden in the matrix C.   (Y T × M × C T ) is linear with matrix C, so the optimal solution of C is the projection of Y T ×M onto the constraint set.Therefore, the optimization problem has a closed-form solution: where the Proj( The proposed solution is based on the intuition that two documents are more likely to cluster into the same class if they are related to the same query.It is worth noting that only the cluster assignment is changed in this process, the structure of the tree and the cluster node embeddings do not change.
After the C * is determined, We re-optimize the JTR as described in Section 3.2 to accommodate the new cluster assignment.Figure 5 shows the process of optimized overlapped cluster.We acknowledge that there exists a few documents that are not relevant to any training query.These documents are retained in the original clusters.

Tree Retrival
For a specific query, the tree retrieval process is described in algorithm 1.At each level of the tree, we select the top nodes and their child nodes as candidates for the next level.It saves retrieval time by pruning less relevant nodes.In the tree structure of JTR, the leaf nodes are not always at the same level.Therefore, the  set is used to preserve the encountered leaf nodes.In line 5 of Algorithm 1, we select top  − () nodes in  each time to create  .Since the leaf nodes in  are removed and inserted into  in line 3, there are no leaf nodes in  .
In the JTR, the retrieval process is hierarchical and top-to-down.If the tree has K leaf nodes, N documents and beam size is  and branch balance factor is , the time complexity of retrieving leaf nodes is  ( *  * log K).For retrieval in set , the maximum number of documents in  is  * , where  is leaf balance factor.Since  can be approximated as N/K, the time complexity of this part is  ( * N/K).In summary, the retrieval time complexity of JTR is  ( *  * log K) +  ( * N/K).The overall time complexity is below the linear time complexity, which greatly improves the retrieval efficiency.
Table 2: Results on the MS MARCO dataset.AQT stands for Average Query processing Time.which is measured by averaging time over each query of the MS MARCO Dev set with a single thread and a single batch on the CPU.*/** denotes that JTR performs significantly better than baselines at  < 0.05/0.01level using the two-tailed pairwise t-test.The best method in each column is marked in bold.

EXPERIMENT SETTINGS
In this section, we introduce our experimental settings, including implementation details, datasets and metrics, baselines.

Datasets and Metrics
The experiments are conducted on the dataset MS MARCO [27], which is a large-scale ad-hoc retrieval benchmark.We conduct our experiments on two tasks from MS MARCO Dev, TREC2019 DL [5] and TREC2020 DL [6].The MS MARCO Dev is extracted from Bing's search logs, and each query is marked as relevant to a few documents.TREC2019 DL and TREC2020 DL are collections that contain extensively annotated documents for each query, i.e., using four-level annotation criteria.We use @100 to evaluate the recall performance of different methods.@100 and  @10 are applied to measure ranking performance.

Baselines
We select several state-of-the-art ANN indexes as our baselines.
Here are more details: IVFFlat [24]: IVFFlat is the classic inverted index.When applied to dense retrieval, IVF defines  clusters in the vector space, and only the top  clusters close to the query embedding are retrieved.We set  to be the same as beam size .
PQ [14]: PQ is an ANN index based on product quantization.We set the number of embedding segments to 32 and the number of coding bits to 8.
IVFPQ [14]: IVFPQ is one of the fastest indexes available, which combines "inverted index + product quantization".The parameter settings are the same as IVFFlat and PQ.
JPQ [36]: JPQ implements the joint optimization of query encoder and product quantization.We set the number of segments into which each embedding will be divided as 24.
Annoy [2]: Annoy improves retrieval efficiency by building a binary tree.We set the number of trees to 100.FALCONN [28]: FALCONN is an optimized LSH method, which supports hyperplane LSH and cross polyhedron LSH.We use the recommended parameter settings as adopted in the corresponding paper.
FLANN [26]: FLANN is one of the most complete ANN opensource libraries, including liner, kdtree, kmeans tree, and other index methods.We use automatic tuning to get the best parameters.
IMI [1]: IMI is a multilevel inverted index, which combines product quantization and inverted index to bring very good search performance.The number of bits is set to 12.
HNSW [23]: HNSW is a typical and widely-used graph-based index.We set the number of links to 8 and ef-construction to 100.
For a fair comparison, all ANN indexes are operated on the document embeddings formed by STAR [37].For IVFFlat, PQ, IVFPQ, HNSW and IMI, we implement them with the Faiss2 .For Annoy 3 , FALCONN 4 and FLANN 5 , we use the official library to implement them.

Implementation Details
We implemented JTR using PyTorch and Transformers 6 .Initial embeddings were obtained using STAR [37] for all documents.We also load the checkpoint of STAR to warm up the query encoder.In our experiment, The default  is 1000 and  is 10.The dimension of embeddings is 768.For the training setup, we use the AdamW [22] optimizer with a batch size of 32.The learning rate is set to 5 × 10 −6 .In the overlapped cluster of the tree, we use the top 100 documents obtained from ADORE-STAR [37] to build matrix Ȳ.All experiments are evaluated on the workstation with Xeon Gold 5218 CPUs and RTX 3090 GPUs.

Comparison with ANN Methods
The performance comparisons between SAILER and baselines are shown in Table 2.We derive the following observations from the experiment results.• PQ and JPQ are ANN methods based on product quantization, which have a linear time complexity with respect to the corpus size.As expected, their average query processing time is the longest among all baselines.IVFPQ is an inverted index based on product quantization, which improves retrieval efficiency but damages retrieval effectiveness.• Benefiting from a different index structure, the average query processing time of IVFPQ, FALCONN, FLANN, and HNSW is lower than that of JTR.However, they suffered a loss in effectiveness.• Annoy and FLANN are existing tree-based indexes.Compared with them, JTR achieved the best results.To the best of our knowledge, JTR is the best tree-based index available, which shows the great potential of tree-bases indexes for dense retrieval.• Overall, JTR performs significantly better than the baseline method on most measures.In terms of effectiveness, JTR achieves the best MRR/NDCG and the second best recall performance (i.e., R@100) on all datasets.Compared to the baseline with the best recall (i.e., JPQ), JTR achieves 3 to 5 times latency speedup.We can conclude that JTR has a very competitive effectiveness compared to other ANN indexes.
To further analyze the ability of different indexes to balance efficiency and effectiveness, we plot AQT-MRR curves with varying parameters.As shown in Figure 6, we have the following findings: • IVFPQ and FALCONN are limited by resources, which makes it difficult to achieve high retrieval performance.• When our requirement on recall is extremely high (e.g., larger than 0.8), brute-force search could be more efficient than any indexing technique, which makes brute-force search-based algorithms like JPQ more efficient than JTR.After all, any indexing process would add more time complexity if we eventually need to     check all documents in the corpus.However, such high requirements on recall are not common in web search or other similar retrieval tasks.
• To sum up, JTR has a very outstanding effectiveness-efficiency trade-off ability, which can always achieve better results with a short retrieval latency.JTR provides a potential solution to balance efficiency and effectiveness for dense retrieval.

Hyperparameter Sensitivity
5.2.1 Hyperparameters for Tree Retrieval.In this section, we investigate the impact of the overlap number  and the beam size .All experiments were conducted on the MS MARCO Doc Dev dataset.
We set  = 1, 2, 3, 4, 5 and  = 10, 20, 30, 40, 50 to observe the retrival performance of JTR.Figures 7, 8 and 9 show the changes in retrieval quality and retrieval efficiency with hyperparameters.We can get the following observations: • Both retrieval quality and latency increase with .When  increases, the leaf node contains more documents.The number of candidate documents goes up which leads to better results and higher latency.9, we note that the gap between curves increases when  is larger, which shows that a large  could lead to larger computation costs on each leaf node.• Overall, the increase of  and  can lead to better performance and longer latency.Therefore, we recommend the moderate value for  and  to maximize the trade-off between effectiveness and efficiency.

5.2.2
Hyperparameter for Overlapped Cluster.In this section, we fix the model as ADORE-STAR and change  from 25 to 150, where  denotes the number of documents recalled with the dense retrieval model.Figure 10 presents the retrieval performance on each  and  value.Obviously, the larger , the more information contained in Ȳ, and the better the final performance.When  is larger than 100, the gain becomes smaller.Therefore, we set  = 100 in our experiments, which provides sufficient information.

Ablation Study
In this section, we conduct ablation studies on JTR to explore the importance of different components.We use IVFFlat as the baseline.
IVFFlat forms as many clusters as the leaf nodes of the tree.The  of IVFFlat and the beam size of JTR are both set to 10.We use the following four model variants: • Tree: The STAR model is used to obtain the document embeddings, and then we build the tree index with MSE loss.• +Joint Optimization: Joint optimization of query encoders and tree-based indexes using the unified contrastive learning loss function.• +Reorganize Clusters: Update the clustering of documents, which means the number of overlapped clustering  = 1.• +Overlapped Clustering: Documents can be re-occurring in the clustering nodes.For convenience, the number of overlapped clustering  = 2. Table 3 shows the @100 and @100 on the development set of the MS MARCO Document Ranking task.As the results show, the tree structure significantly reduces the retrieval latency.All of Joint Optimization, Reorganize Clusters and Overlapped Clustering improve the performance of JTR.Tree-based indexes benefit from supervised data directly by Joint Optimization.Reorganize clusters makes the distribution of embeddings more reasonable.Overlapped cluster mines different semantic information in documents.This result demonstrates the effectiveness of our method.

CONCLUSION
To improve the efficiency of the DR models while ensuring the effectiveness, we propose JTR, which jointly optimizes tree-based index and query encoder in an end-to-end manner.To achieve this goal, we carefully design a unified contrastive learning loss and treebased negative sampling strategy.Through the above strategies, the constructed index tree possess the maximum heap property which easily supports beam search.Moreover, for cluster assignment, which is not differentiable w.r.t.contrastive learning loss, we introduce overlapped cluster optimization.We further conducted extensive experiments on several popular retrieval benchmarks.Experimental results show that our approach achieves competitive results compared with widely-adopted baselines.Ablation studies demonstrate the effectiveness of our strategies.Unfortunately, since different indexes have varying degrees of code optimization, we do not report the memory case in our paper.In the future, we will try to jointly optimize PQ and tree-based index to achieve the "effectiveness-efficiency-memory" tradeoff.

Figure 1 :
Figure 1: Illustration of the JTR tree structures.The integer represents the sequence number of the node.In this case, The tree has a depth of 3, number of clusters 4, branch balance factor  = 2, and leaf balance factor  = 4.The beam size  is set to 2.

Figure 2 :
Figure 2: Initialization of the tree structure.The integer in nodes indicates the number of documents the node contains.In this case, the total number of documents is 10, the branch balance factor  = 2, and the leaf balance factor  = 5.If the node contains more documents than , then k-means will be performed on the document embedding in the node until all nodes contain less than  documents.The embedding of each node is initialized as the cluster centroid embedding.

Figure 3 :
Figure 3: Comparison of the workflow of JTR and existing methods.The solid arrows indicate that the gradient propagates backward, while the dashed arrows indicate that the gradient does not propagate.

Figure 4 :
Figure4: The sampling process of JTR.The integer represents the sequence number of the node.We select the brother nodes of positive samples as negative samples.loss to optimize the tree-based index.Specifically, given a training data (  ,   ),   is the leaf node that contains   .Therefore,   is the positive sample of the current level, and the leaf nodes not containing   are negative samples.On this basis, the ancestor nodes of   are also treated as positive samples of the level in which they are located.To make the tree-based index meet the properties of the maximum heap, we optimize it with negative sampling at each level.Let  + denotes the node embedding of positive samples and  − denotes the node embedding of negative samples.The unified contrastive learning loss is formalized as follows: ) where Binary() =   is the element-wise indicator function.When the element  > 0,   = 1, which ensures Ŷ ∈ {0, 1} L×N .× stands

Figure 5 :
Figure 5: Illustration of the optimized overlapped cluster.In this case, there are 3 queries, 4 leaf nodes, and 5 documents.The   \  \  represent the i-th query\leaf node\document respectively.We set the number of overlapped clustering  = 2.The values in the red boxes are identified by the Proj(.)function.In practice, if a document has the same value for two leaves in C*, the Proj(•) function prefers to keep the document in its original leaf.formatrix cross product.Then, the performance of JTR can be expressed as the intersection of Y and Ŷ.The formula is as follows: Li, et al.

Figure 6 :
Figure 6: Trade-off curves for different ANN methods.AQT stands for Average Query processing Time.Bottom and right is better.

Figure 9 :
Figure 9: AQT versus different  ranging from 1 to 5 and  ranging from 10 to 50.AQT stands for Average Query processing Time.

Table 1 :
Important notations present in this paper.

Algorithm 1 :
Tree Retrival algorithm Input: the trained tree T , query , beam size , query encoder Φ(•) Output:  nearest approximate candidates 1 Result set  = ∅, candidate set  ={the root node  1 }.Remove all leaf nodes from  and insert them into .Calculate  =     • Φ() for each remaining node  ∈  .   is the embedding of current node.According to the , top ( − ()) nodes in  are selected to form the set  .There are no leaf nodes in  .Compute  =    • Φ() for documents  contained by nodes in  to get top  candidates.
•) operator selects the top  elements for each line of the matrix.In the field of Information Retrieval, there exists the problem of data sparsity.Only a small set of the documents have the corresponding training queries, resulting in a lot of rows in Y T with all entities being zeros.To solve this problem, we use the trained DR model to retrieve the top- documents of each training query to construct matrix Ȳ ∈ {0, 1} L×N .In other words, if   is the top  document recalled by   using the DR model, then Ȳ, = 1.Hence, every line of Ȳ has  nonzero entries.In short, the final solution is:

Table 3 :
Ablation study on MS MARCO Dev Doc dataset.The slope of MRR and R curve decreases with the increase of , but the slope of AQT remains unchanged.That means the gain from overlapped cluster is getting smaller.•When we fix , it is found that the gains of MRR and R decreased with the increase of .The reason may be that the most relevant documents have already clustered in the top leaf nodes.• In Figure