FINGER: Fast Inference for Graph-based Approximate Nearest Neighbor Search

Approximate K-Nearest Neighbor Search (AKNNS) has now become ubiquitous in modern applications, such as a fast search procedure with two-tower deep learning models. Graph-based methods for AKNNS in particular have received great attention due to their superior performance. These methods rely on greedy graph search to traverse the data points as embedding vectors in a database. Under this greedy search scheme, we make a key observation: many distance computations do not influence search updates so that these computations can be approximated without hurting performance. As a result, we propose FINGER, a fast inference method for efficient graph search in AKNNS. FINGER approximates the distance function by estimating angles between neighboring residual vectors. The approximated distance can be used to bypass unnecessary computations for faster searches. Empirically, when it comes to speeding up the inference of HNSW, which is one of the most popular graph-based AKNNS methods, FINGER significantly outperforms existing acceleration approaches and conventional libraries by 20 to 60 across different benchmark datasets.


Introduction
K-Nearest Neighbor Search (KNNS) is a fundamental problem in machine learning [5], and is used in various applications in computer vision, natural language processing and data mining [8,37,35].Further, most of the neural embedding-based retrieval and recommendation algorithms require KNNS in the inference phase to find items that are nearest to a given query [44].Formally, consider a dataset D with n data points {d 1 , d 2 , ..., d n } where each data point has m-dimensional features.Given a query q ∈ R m , KNNS algorithms return the K closest points in D under a certain distance measure (e.g., L2 distance • 2 ).Despite its simplicity, the cost of finding exact nearest neighbors is linear in the size of a dataset, which can be prohibitive for massive datasets in real time applications.It is almost impossible to obtain exact K-nearest neighbors without a linear scan of the whole dataset due to a well-known phenomenon called curse of dimensionality [23].Thus, in practice, an exact KNNS becomes time-consuming or even infeasible for large-scale data.To overcome this problem, researchers resort to Approximate K-Nearest Neighbor Search (AKNNS).An AKNNS method proposes a set of K candidate neighbors T = {t 1 , • • • , t K } to approximate the exact answer.Performance of AKNNS is usually measured by recall@K defined as |T ∩A|  K , where A is the set of ground-truth K-nearest neighbors of q in the dataset D. Most AKNNS methods try to minimize the search time by leveraging pre-computed data structures while maintaining high recall [24].There is a Figure 1: Comparison of state-of-the-art graph-based libraries on three benchmark datasets.Throughput versus recall@10 curve is used as the metric, where a larger area under the curve corresponds to a better method.We can observe no single method outperforms the rest on all datasets.large body of AKNNS literature [6,40,35]; most of the efficient AKNNS methods can be categorized into three categories: quantization methods, space partitioning methods and graph-based methods.In particular, graph-based methods receive extensive attention from researchers due to their competitive performance.Many papers have reported that graph-based methods are among the most competitive AKNNS methods on various benchmark datasets [6,40,2,15].
Graph-based methods work by constructing an underlying search graph where each node corresponds to a data point in D. Given a query q and a current search node c, at each step, an algorithm will only calculate distances between q and all neighboring nodes of c.Once the local search of c is completed, the current search node will be replaced with an unexplored node whose distance is the closest to q among all unexplored nodes.Thus, neighboring edge selection of a data point plays an important role in graph-based methods as it controls the complexity of the search space.Consequently, most recent research is focused on how to construct different search graphs or design heuristics to prune edges in a graph to achieve efficient searches [24,15,39,32].Despite different methods having their own advantages, there is no clear winner among these graph construction approaches on all datasets.Following a recent systematic evaluation protocol [2], we evaluate performance by comparing throughput versus recall@10 curves, where a larger area under the curve corresponds to a better method.As shown in Figure 1, many graph-based methods achieve similar performance on three benchmark datasets.A method (e.g., PyNNDescent [11]) can be competitive on a dataset (e.g., GIST-1M-960) while another method (e.g., HNSW [32]) performs better on the other dataset (e.g., DEEP-10M-96).These results suggest there might not be a single graph construction method that works best, which motivates us to consider the research question: Other than improving an underlying search graph, is there any other strategy to improve search efficiency of all graph-based methods?.
In this paper, instead of proposing yet another graph construction method, we show that for a given graph, part of the computations in the inference phase can be substantially reduced.Specifically, we observe that after a few node updates, most of the distance computations will not influence the search update.This suggests the complexity of distance calculation during an intermediate stage can be reduced without hurting performance.Based on this observation, we propose FINGER, Fast INference for Graph-based approximated nearest neighbor sEaRch, which reduces computational cost in a graph search while maintaining high recall.Our contribution are summarized as follows: • We provide an empirical observation that most of the distance computations in the prevalent best-first-search graph search scheme do not affect final search results.Thus, we can reduce the computational complexity of many distance functions.
• Leveraging this characteristic, we propose an approximated distance based on modeling angles between neighboring vectors using low-rank bases.In addition, angles of neighboring vectors in a graph tend to be distributed as a Gaussian distribution, and we propose a distribution matching scheme to achieve a better distance approximation.
• We provide an open source efficient C++ implementation of the proposed algorithm FINGER on the popular HNSW graph-based method.HNSW-FINGER outperforms many popular graph-based AKNNS algorithms in wall-clock time across various benchmark datasets by 20%-60%.

Related Work
There are three major directions in developing efficient approximate K-Nearest-Neighbours Search (AKNNS) methods.The first direction is still to traverse all elements in a database but reduce the complexity of each distance calculation; quantization methods represent this direction.The second direction is to partition search space into regions and only search data points falling into matched regions.This includes tree-based methods [38] and hashing-based methods [7].The third direction is graph-based methods which construct a search graph and convert the search into a graph traversal.
Quantization Methods compress data points and represent them as short codes.Compressed representations consume less storage and thus achieve more efficient memory bandwidth usage [19].
In addition, the complexity of distance computations can be reduced by computing approximate distances with the pre-computed lookup tables.Quantization can be done by random projections [31], or learned by exploiting structure in the data distribution [33,36].In particular, the seminal Product Quantization method [25] separates data feature space into different parts and constructs a quantization codebook for each chunk.Product Quantization has become the cornerstone for most recent quantization methods [19,34,42,12].There is also work focusing on learning transformations in accordance with product quantization [17].Most recent quantization methods achieve competitive results on various benchmarks [19,27].
Space Partition Methods includes hashing-based and tree-based methods.Hashing-based Methods generate low-bit codes for high dimensional data and try to preserve the similarity among the original distance measure.Locality sensitive hashing [18] is a representative framework that enables users to design a set of hashing functions.Some data-dependent hashing functions have also been designed [41,22].Nevertheless, a recent review [6] reported the simplest random-projection hashing [7] actually achieves the best performance.According to this review, the advantage of hashing-based methods is simplicity and low memory usage; however, they are significantly outperformed by graphbased methods.Tree-based Methods learn a recursive space partition function as a tree following some criteria.When a new query comes, the learned partition tree is applied to the query and the distance computation is performed only on relevant elements falling in the same sub-tree.Representative methods are KD-tree [38] and R * -tree [4].It is observed in previous studies that tree-based methods only work for low-dimensional data and their performances drop significantly for high-dimensional problems [6].
Graph-based Methods date back to theoretical work in graph theory [3,29,10].However, these theoretical guarantees only work for low-dimensional data [3,29] or require expensive (O(n 2 ) or higher) index building complexity [10], which is not scalable to large-scale datasets.Recent works are mostly geared toward approximations of different proximity graph structures to improve nearest neighbor search.There is a series of works on approximating K-nearest-neighbour graphs [21,20,14,26].Most recent works approximate monotonic graph [16] or relative neighbour graph [1,32].In essence, these methods first construct an approximated K-nearest-neighbour graph and prune redundant edges by different criteria inspired by different proximity graph structures.Some other works mixed the above criteria with other heuristics to prune the graph [15,24].Some pruning strategies can even work on randomly initialized dense graphs [24].According to various empirical studies [21,6,2], graph-based methods achieve very competitive performance among all AKNNS methods.Despite concerns about scalability of graph-based methods due to their larger memory usage [12], it has been shown that graph-based methods can be deployed in billion scale commercial usage [16].In addition, recent studies also demonstrated that graph-based AKNNS can scale quite well on billion-scale benchmarks when implemented on SSD hard-disks [24,9].In this work, we aim at demonstrating a generic method to accelerate the inference speed of graph-based methods so we will mainly focus on in-memory scenarios.

Observation: Most distance computations do not contribute to better search results
Once a search graph is built, graph-based methods use a greedy-search strategy (Algorithm 1) to find relevant elements of a query in a database.It maintains two priority queues: candidate queue that stores potential candidates to expand and top results queue that stores current most similar candidates (line 1).At each iteration, it finds the current nearest point in the candidate queue and explores its neighboring points.An upper-bound variable records the distance of the furthest element from the current top results queue to the query q (line 4).The search will stop when the current nearest distance from the candidate queue is larger than the upper-bound (line 5), or there is no element left in the candidate queue (line 2).The upper-bound not only controls termination of the search but also determines if a point will present in the candidate queue (line 11).An exploring point will not be added into the candidate queue if the distance from the point to the query is larger than the upper-bound.Thus, upper-bound plays an important role as we need to spend computational resources on distance calculation (dist function in line 11) but it might not influence search results if the distance is larger than the upper-bound.Empirically, as shown in Figure 2, we observe in two benchmark datasets that most of the explorations end up having a larger distance than the upper-bound.Especially, starting from the mid-phase of a search, over 80 % of distance calculations are larger than the upper-bound.Using greedy graph search will inevitably waste a significant amount of computing time on non-influential operations.[30] also found this phenomenon and proposed to learn an early termination criterion by an ML model.Instead of only focusing on the near-termination phase, we propose a more general framework by incorporating the idea of reducing the complexity of distance calculations into a graph search.The fact that most distance computations do not influence search results suggests that we don't need to have exact distance computations.A faster distance approximation can be applied in the search.ub ← distance of the furthest element from T to q (i.e., update ub) 17 return T

Modeling Distribution of Neighboring Residual Angles
Given a query q and the current nearest point to q in the candidate queue c, in Line 7 of Algorithm 1, we will expand the search by exploring neighbors of c.Consider a specific neighbor of c called d, we have to compute distance between q and d in order to update the search results.Here, we will focus on the L2 distance (i.e., Dist = q − d 2 ).The derivations of inner-product and angle distance are provided in the Supplementary A. As shown in the previous section, most distance computations will not contribute to the search in later stages, we aim at finding a fast approximation of L2 distance.A key idea is that we can leverage c to represent q (and d) as a vector along c (i.e., projection) and a vector orthogonal to c (i.e., residual): In other words, we treat each center node as a basis and project the query and its neighboring points onto the center vector so query and data can be written as q = q proj + q res and d = d proj + d res respectively.With this formulation, the squared L2 distance can be written as: where (a) comes from the fact that projection vectors are orthogonal to residual vectors so the inner product vanishes.For d proj and d res , we can pre-calculate these values after the search graph is constructed.For q proj , notice that center node c is extracted from the candidate queue (line 3 of Algorithm 1).That means we must have already visited c before.Thus, q − c 2 has been calculated and we can get q T c by a simple algebraic manipulation: Since calculation of q 2 2 is a one-time task for a query, it's not too costly when a dataset is moderately large.c 2 2 can again be pre-computed in advance so q T c and thus q proj can be obtained in just a few arithmetic operations.Also notice that q 2 2 = q proj 2 2 + q res 2 2 as q proj and q res are orthogonal, so we can get q res 2 2 by calculating q 2 2 -q proj 2 2 in few operations too.Therefore, the only uncertain term in Eq. ( 2) is q T res d res .If we can estimate this term with less computational resources, we can obtain a fast yet accurate approximation of L2 distance.Since we don't have direct access to the distribution of q and thus q res , we hypothesize we can instead use the distribution of residual vectors between neighbors of c to approximate the distribution of q T res d res term.The rationale behind this is as we only approximate q T res d res when q and c are close enough (i.e., c is selected in line 3 of Algorithm 1), both q and d could be treated as near points in our search graph and thus interaction between q res and d res might be well approximated by d res T d res , where d is another neighbouring point of c and d res is its residual vector.Empirically, as shown in the left column of Figure 3, angles between residual vectors of sampled neighbors (i.e., d, d ∈ neighbor(c)) distributes like a Gaussian.In particular, compared to the distribution of direct inner-product d res T d res (right column of Figure 3), the distribution cos(d res , d res ) is less-skewed and thus more alike Gaussian.This motivates us to design an efficient approximator of cos(q res , d res ) and obtain q T res d res by q res 2 d res 2 cos(q res , d res ).

FINGER: Fast Inference by Low-rank Angle Estimation and Distribution Matching
Low-rank Estimation Motivated by the above derivations, we aim at finding an efficient estimation of angles between all pairs of neighboring residual vectors.In AKNNS literature, a popular method for estimating this is Locality Sensitive Hashing (LSH) and its variants.In particular, Random Projection-based LSH (RPLSH) [7] is reported to achieve good average performance on various benchmark datasets [6].RPLSH samples r random vectors from Normal distribution to form a projection matrix P ∈ R r×m , where m is the dimension of data and query.L2 distance between two vectors x, y ∈ R m can be approximated by the distance in projected space, and the error, , is bounded probabilistically [28].We can further binarize the projection results to form a compact representation, and the angle between x and y can be approximated by hamming distance of the signed results: hamm(sgn(P x),sgn(P y)) π r .However, there is an immediate disadvantage with this approach.Random projection guarantees worst case performance [13] and it is oblivious of the data distribution.Since we can sample abundant neighboring residual vectors from the training database,  we can leverage the data information to obtain a better approximation.Formally, given an existing search graph G = (D, E) where D are nodes in the graph corresponding to data points and E are edges connecting data points, we collect all residual vectors into D res ∈ R m×N , where N is total number of edges in G (i.e., |E|); and we assume D res spans the whole space which residual vectors lie in.The approximation problem can be formulated as the following optimization problem: where we aim at finding an optimal P minimizing the approximating error over the residual pairs D res from training data.It's not hard to see that the Singular Value Decomposition (SVD) of D res will provide an answer to the above optimization problem, and thus we can use SVD to find better r lower-dimensions to estimate the angle of neighboring residual vectors.Proposition 3.1.Given a residual vector matrix D res ∈ R m×N , and denoting D res = U SV T as the Singular Value Decomposition of D res .U 1:r , the first r columns of U is an optimal solution of optimization problem Eq. (3).
Proof.The proof is provided in the Supplementary B.
Distribution Matching In addition to efficient low-rank estimation of angles, we further propose a distribution matching method to improve the performance.Despite as discussed in Section 3.2 that angles between neighbouring residual vectors tend to be distributed alike Gaussian, this attribute only partially transfers to the distribution of angles approximated by low-rank computations as shown in Figure 4.Although the approximated distribution still looks alike Gaussian, its distribution is slightly skewed.Furthermore, its mean is shifted and its variance is larger than the real data distribution.To mitigate this, we propose to transform the approximated distributions into real data distributions by matching their mean and variance.Formally, assume angles of neighboring residual vectors follows a Gaussian distribution N (µ, σ), and the approximated angles distributes as N (μ, σ).Given a residual pair x and y with a low-rank projection matrix P , we can calculate the approximated angle t = cos(P x, P y).Under our assumption that it comes from a draw of N (μ, σ), the value can be transformed by t = ( t − μ) σ σ + µ.The transformed angle estimation t then follows N (µ, σ) as desired.Parameters µ, σ, μ, σ can be estimated by using training data.
Overall Algorithm Construction of FINGER can be summarized in Algorithm 2. Our aim is to provide a generic acceleration for all graph-based search.Thus, we can build the search index from any existing graph G. FINGER first iterates through all nodes in the graph.For each node, FINGER samples a pair of distinct nodes from its neighbors.In addition, we also calculate the residual vector of one sampled point and store it for later usage.We hypothesize the collected residual vectors D res spans the residual space, and we can find its optimal low-rank approximation by SVD.Once the low-rank projection P is ready, we can estimate the mean and variance of angle distribution and approximated distribution respectively (i.e., line 9, 10 in Algorithm 2).Certainly, this distribution matching scheme would still produce error.We further compute the average L1 error between real and approximated angles to serve as an error correction term.With this information saved in a search index, Algorithm 3 approximates the distance between a query q and a data point d.Notice that we explicitly write out the projection matrix P and center node c in Algorithm 3 to make it easier to understand the full approximation workflow.In practice, the projected residual vector P d res can be pre-computed and stored.Detailed computation is illustrated in the Supplementary.

Experimental Setups Baseline Methods
We compare FINGER to the most competitive graph-based and quantization methods.We include different implementations of the popular HNSW methods such as NMSLIB [32], n2 a , PECOS [43] and HNSWLIB [32].Other graph construction methods include NGT-PANNG [39] , VA-MANA(DiskANN) [24] and PyNNDescent [11].Since our goal is to demonstrate FIN-GER can improve search efficiency of an underlying graph, we mainly include these competitive methods with good python interface and documentation.For quantization methods, we compare to the best performing ScaNN [19] and Faiss-IVFPQFS [27].In experiments, we combine FINGER with HNSW as it is a simple and prevalent method.The implementation of HNSW-FINGER is based on a modification of PECOS as its codebase is easy to read and extend.Pre-processing time and memory footprint are discussed in the Supplementary F. 8 N = size of X, Algorithm 3: Approximate Distance Function Input: query q, projection matrix P , center node c, data point d ∈ neighbors of c, distribution parameters µ, σ, μ, σ, Output: t, the approximated distance between q and d 1 compute q res and d res with c and Eq. 1 2 compute t = cos(P q res , P d res ) 3 t = ( t − μ) σ σ + µ, t = t + , return t Evaluation Protocol and Dataset We follow ANN-benchmark protocol [2] to run all experiments.Instead of using a single set of hyperparameter, the protocol searches over a pre-defined set of hyper-parameters1 for each method, and reports the best performance over each recall regime.In other words, it allows methods to compete others with its own best hyper-parmameters within each recall regime.We follow this protocol to measure recall@10 values and report the best performance over 10 runs.Results will be presented as throughput versus recall@10 charts.A method is better if the area under curve is larger in the plot.All experiments are run on AWS r5dn.24xlargeinstance with Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz.We evaluate results over both L2-based and angular-based metric.We represent a dataset with the following format: (dataset name)-(training data size)-(dimensionality of dataset).For L2 distance measure, we evaluate on FashionMNIST-60K-784, SIFT-1M-128, and GIST-1M-960.For cosine distance measure, we evaluate on NYTIMES-290K-256, GLOVE-1.2M-100 and DEEP-10M-96.More details of each dataset can be found in [2].

Improvements of FINGER over HNSW
In Figure 5, we demonstrate how FINGER accelerates the competitive HNSW algorithm on all datasets.Since FINGER is implemented on top of PECOS, it's important for us to check if PECOS provides any advantage over other HNSW libraries.Results verify that across all 6 datasets, the performance of PECOS does not give an edge over other HNSW implementations, so the performance difference between FINGER and other HNSW implementations could be mostly attributed to the proposed approximate distance search scheme.We observe that FINGER greatly boosts the performance over all different datasets and outperforms existing graph-based algorithms.FINGER  Throughput versus Recall@10 chart is plotted for three datasets.We can observe each method has its pros and cons and there is no single method which performs best on all datasets.
works better not only on datasets with large dimensionality such as FashionMNIST-60K-784 and GIST-1M-960, but also works for dimensionality within range between 96 to 128.This shows that FINGER can accelerate the distance computation across different dimensionalities.Results of comparison to most competitive graph-based methods are shown in Figure 8 of the Supplementary D. Briefly speaking, HNSW-FINGER outperforms most state-of-the-art graph-based methods except FashionMNIST-60K-784 where PyNNDescent achieves the best and HNSW-FINGER is the runner-up.Notice that FINGER could also be implemented over other graph structures including PyNNDescent.We chose to build on top of HNSW algorithm only due to its simplicity and popularity.Studying which graph-based method benefits most from FINGER is an interesting future direction.
Here, we aim at empirically demonstrating approximated distance function can be integrated into the greedy search for graph-based methods to achieve a better performance.

Ablation Study
We conduct an ablation study to see the effectiveness of each component of FINGER.First, we compare FINGER to the popular random projection locality hashing (RPLSH) for angle estimation.Since we have greatly optimized C++ implementation of FINGER, a direct comparison on wallclock time won't be fair.Instead, we compare two schemes by counting the effective number of distance function calls.We collect the number of full distance calls and approximate distance calls separately, and combine them into an effective number of distance calls.For example, if we call full m-dimensional distance a times and b times of r-dimensional approximate computations, we would have an effective distance calls of a + b r m times.We firstly analyze estimation quality by approximation error defined as |t− t| |t| where t is the true cosine angle value and t is the approximated value.Ideally, we could expect a better approximation scheme results in a smaller approximation error.Certainly, a smaller approximation doesn't necessarily yield better recall.Thus, we will also analyze the performance based on recall.Results of trade-off between approximation error (%) and effective number of distance calls are shown in Figure 6(a) for FashionMNIST-60K-784 and 6(b) for GLOVE-1.2M-100.Corresponding results of recall vs effective distance calls are shown in Figure 6(c) and 6(d).We can see FINGER achieves smaller approximation errors compared to RPLSH on both datasets, which shows that FINGER is indeed a better low-rank approximation given data distribution.We also observe smaller approximation error transfers to higher recalls on both datasets.In addition, we apply distribution matching on RPLSH and found out this will greatly improve RPLSH.This shows distribution matching is a generic method that improves the performance of all different angle estimation methods.But even with the aid of distribution matching, RPLSH cannot achieve similar performance as FINGER and this shows the superiority of SVD results.Since FINGER consists of a low-rank approximation module plus a distribution matching module, we are interested in studying their own effectiveness.We conduct similar analysis on full method and low-rank only version of FINGER shown in Figure 6.Low-rank approximation alone still provides a much better angle estimation compared to RPLSH.Even without the distribution matching scheme, low-rank angle estimation outperforms RPLSH with distribution matching.We also observe limited difference between FINGER and FINGER without distribution matching in FashionMNIST-60K-784.However, the difference is more significant when it comes to GLOVE-1.2M-100 which still shows effective distribution matching.

Comparison to Quantization Results
In addition to graph-based method, we are also interested in seeing the performance of HNSW-FINGER compared to the state-of-the-art quantization methods.Results of comparisons to quantization methods are shown in Figure 7.As we can observe, there is no single method achieving the best performance over all tasks.Faiss-IVFPQFS performs well on NYTIMES-290K-256 but fails on DEEP-10M-96.ScaNN performs consistently well on all datasets but it doesn't achieve top performance on anyone.HNSW-FINGER performs competitively on GIST-1M-960 and DEEP-10M-96 but worse on NYTIMES-290K-256.These results showed that quantization provides some advantages over graph-based methods but the advantage is not consistent across datasets.Studying how to combine the advantage of quantization methods with FINGER and graph-based methods is an interesting future direction.

Conclusions and Social Impact
In this work, we propose FINGER, a fast inference method for graph-based AKNNS.FINGER approximates distance function in graph-based method by estimating angles between neighboring residual vectors.FINGER constructs low-rank bases to estimate residual angles and use distribution matching to achieve a better precision.The approximated distance can be used to bypass unnecessary distance evaluations, which translates into a faster searching.Empirically, FINGER on top of HNSW is shown to outperform all existing graph-based methods.This work mainly focuses on accelerating existing models with approximate computations.It doesn't directly touch any controversial part of the data and thus it's unlikely providing any negative social impact.When used correctly with positive information to spread.It can help to accelerate the propagation as the work accelerates the inference speed.

A Formulation of Inner-product
In the main text, we presented derivation of L2 distance, and in this section we will derive the approximation for inner-product distance measure.Notice that angle measure can be obtained by firstly normalizing data vectors and then apply inner-product distance and thus the derivation is the same.For a query q and data point d, inner-product distance measure is Dist = q T d.Similar to L2 distance, we can apply the same decomposition to write q = q proj + q res and d = d proj + d res .
substituting the decomposition into distance definition, we have Dist = q T proj d proj + q T res d res .
As in L2 case q proj and d proj can be obtained by simple operations and the remaining uncertainy term is again q T res d res .Therefore, in inner-product case, angle between neighboring residual vectors is still the target to approximate.

B Proof of proposition 3.1
Proof.We can firstly construct all possible pairs of N (N −1) 2 combinations of sample of vectors x, y from D res and compile all N (N −1) 2 pairs into two matrices X and Y .With this notation, we can rewrite the original optimization into matrix form: = arg min where • F denotes matrix frobenius norm.By introducing the matrix notation, we then explicitly write out the overall objective function without the sampling.We can further denote Z = X − Y .Z matrix then denotes all possible pairs of vector difference from our original distribution.The objective function can then further be written into: where U z S z V T z denotes the SVD decomposition of Z.By the basic properties of SVD decomposition, we know that F as U z and V z are unitary matrices.S z 2 F equals sum of square of singular values of Z. Similarly, P U z S z V T z 2 F .Thus it's not hard to see that the objective function is to find a projection direction which will result the minimal difference between the projected S z and full sum of squared eigenvalues of S z .Thus, the optimal answer is the top r directions as of columns of matrix U z as it will cancel out the top r square of eigenvalues of S z which happens to be the largest ones.
The remaining thing is to show that SVD of Z is essentially the same as SVD of D res .Notice that both X and Y are just duplicating and re-ordering of D res .So both X,Y share the same basis of D res .Denote SVD results of D res = U SV T .We can then represent X = U SV T x and Y = U SV T y .Consequently, we can also represent the SVD of Z as Z = X − Y = U SV T x − U SV T y = U S(V x − V y ) T so we can see that it shares the same basis as D res and the proof is complete.

D Complete Comparison of Graph-based Methods
Complete results of all graph-based methods are shown in Figure 8. HNSW-FINGER basically outperforms all existing graph-based methods except on FashionMNIST-60K-784 where PyNNDescent performs extremely well.In principle, FINGER could also be applied on PyNNDescent to further improve the result.Results show that currently no graph-based methods completely exploits the training data distribution.This reflects the importance of the inference acceleration methods as FINGER that can create consistently faster inference on all underlying search graph.Making a search graph maximally suitable for applying FINGER is also an interesting future direction.

E Selection of Rank Parameter r
Under ANN-benchmark protocol, we could have made the selection of rank r in FINGER as a hyper-parameter to search in order to achieve best performance.But this might be time-consuming for real applications.Instead, here we provide a practical rule of thumb for choosing r by calculating the correlation coefficient of X, Y in Algorithm 2. X stores true angles between neighboring pairs and Y stores approximated angles.We start r to be 8 in order to maximally leverage SIMD.Specifically, AVX2 SIMD allows a single instruction with 8 parallel floating point computation.Increase the rank in a multiple of 8 will maximally leverage the capability of SIMD instructions.Now, if the correlation is smaller than 0.7, we enlarge r by 8 and redo Algorithm 2 again with increased r until correlation between X and y is larger than 0.7.In this work, to show the

F Pre-processing Time and Memory footprint of HNSW-FINGER and HNSW
Examples of pre-processing time and memory footprint of HNSW-FINGER and HNSW is shown in Table 1.FINGER requires additional linear scan of training data, so it will add some additional processing time to the base method.The difference is around 90 seconds which is not significant compared to the pre-processing time of base HNSW method.Memory usage of HNSW is approximately memory of data plus number of edges |E| × sizeof(int).For a selected low-rank dimension r, FINGER requires additional (r + 2) × |E| × sizeof(float) to store the pre-computed values.
G Detailed computation of Approximate Distance.
As mentioned in the main text, we explicitly write out the projection matrix P and center node c in Algorithm 3 to make it easier to understand the full approximation workflow.In practice, the projected residual vector P d res can be pre-computed and stored in the search index to save inference

Figure 2 :Algorithm 1 : 6 return T 7 for 8 if n ∈ V then 9 continue 10 V 13 T
Figure 2: Illustration of the empirical observation that most points in a database will have distance to query larger than the upper-bound.(a) FashionMNIST-60K-784 dataset (b) Glove-1.2M-100dataset.Starting from the 5th step of greedy graph search (i.e., running line 2 in Algorithm 1 five times), both experiments show more than 80% of data points will be larger than the current upper-bound.

Figure 3 :
Figure 3: Illustration of the empirical observation that normalized cosine values of neighboring residual vectors distribute as a Gaussian distribution on FashionMNIST-60K-784 and SIFT-1M-128.Left column (a) and (c): angles of neighbouring residual pairs distribute alike Gaussian.Right column (b) and (d): un-normalized inner-product values between neighbouring residual pairs are more skewed.

Figure 4 :
Figure 4: Illustration of Distribution Matching.In the left column, we show correct angle distributions of FashionMNIST-60K-784 and SIFT-1M-128.In the right column, we show angles of neighbouring residual pairs calculated by low-rank approximation (r = 16).Our goal is to transform approximated results (in red) into real ones (in green).

Figure 5 :
Figure 5: Experimental results of graph-based methods.Throughput versus Recall@10 chart is plotted for all datasets.Top row presents datasets with L2 distance measure and bottom row presents datasets with angular distance measure.We can observe a significant performance gain of FINGER over all existing graph-based methods.

Figure 6 :
Figure 6: Results of ablation studies on FashionMNIST-60K-784 and GLOVE-1.2M-100.(a) and (b) show approximation error(%) vs effective number of full distance calls.FINGER achieves smaller error than RPLSH.(c) and (d) show recall@10 vs effective number of full distance calls.FINGER achieves higher recalls.

Figure 7 :
Figure 7: Comparisons to competitive quantization methods.Throughput versus Recall@10 chart is plotted for three datasets.We can observe each method has its pros and cons and there is no single method which performs best on all datasets.

Figure 8 :
Figure 8: Experimental results of graph-based methods.Throughput versus Recall@10 chart is plotted for all datasets.Top row presents datasets with L2 distance measure and bottom row presents datasets with angular distance measure.We can observe a significant performance gain of HNSW-FINGER over existing graph-based methods.

Algorithm 4 :
Approximate Greedy Graph Search Input: graph G, query q, starting point p, distance function dist(), appxoaimate distance function appx(), number of nearest points to return ef s Output: top candidate set T 1 candidate set C = {p} 2 dynamic list of currently best candidates T = {p} 3 visited V = {p} 4 while C is not empty do 5 cur ← nearest element from C to 6 E ub ← distance of the furthest element from T to q (i.e., upper bound of the candidate search) 17 if e ≤ ub or |T | ≤ ef s then 18 update distance to be dist(n,q) 19 C.add(n) 20 T.add(n) 21 if |T | > ef s then 22 remove furthest point to q from T 23 ub ← distance of the furthest element from T to q (i.e., update ub) 24 return T