DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication

Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate results in a Web search, for example, a common approach looks at the Jaccard index between all pairs of pages. In social network analysis, a much-celebrated metric is the Adamic-Adar index, widely used to compare node neighborhood sets in the important problem of predicting links. However, with the increasing amount of data to be processed, calculating the exact similarity between all pairs can be intractable. The challenge of working at this scale has motivated research into efficient estimators for set similarity metrics. The two most popular estimators, MinHash and SimHash, are indeed used in applications such as document deduplication and recommender systems where large volumes of data need to be processed. Given the importance of these tasks, the demand for advancing estimators is evident. We propose DotHash, an unbiased estimator for the intersection size of two sets. DotHash can be used to estimate the Jaccard index and, to the best of our knowledge, is the first method that can also estimate the Adamic-Adar index and a family of related metrics. We formally define this family of metrics, provide theoretical bounds on the probability of estimate errors, and analyze its empirical performance. Our experimental results indicate that DotHash is more accurate than the other estimators in link prediction and detecting duplicate documents with the same complexity and similar comparison time.


INTRODUCTION
Many current challenges-and opportunities-in computer science stem from the sheer scale of the data to be processed [19,26].Among these challenges, one of the most outstanding is comparing collections of objects, or simply sets1 .This demand arises, for example, in important problems such as comparing text documents or social media profiles.The challenge is often not the size of each set, but the number of pairwise comparisons over a large dataset of sets.This has motivated research on estimating existing set similarity measures, the main subject of this paper.
The search for methods to compare sets of elements is longstanding: more than a century ago, Gilbert [23] and Jaccard [35] independently proposed a measure that is still widely used, known as the Jaccard index.The metric is defined as the ratio between the sizes of the intersection and the union of two sets.With the explosion of available data, brought about mainly by the advent of the Web, the Jaccard index has become prevalent as an essential tool in data mining and machine learning.Important applications include information retrieval [55,63], natural language processing [66,72], and image processing [48,57], among several others [9].
Another academic field in which set similarity has become crucial is network science.This field studies network representations of physical, biological and social phenomena and is used to understand complex systems in various disciplines [3].Such networks are modeled using graphs, a mathematical abstraction tool where sets are ubiquitous.Marked by its growing relevance, this area also gave rise to one of the most famous set similarity metrics: the Adamic-Adar index [1].The index was proposed as an alternative to Jaccard for the problem of predicting links, such as friendship or co-authorship, in social networks.
Link prediction is a widely studied problem, with applications in several Web-related tasks, including hyperlink-prediction [74], recommender systems in e-commerce [45], entity resolution [51], and friend recommendation in social networks [15,21], among others [27].In this problem, each node is characterized by its set of adjacent nodes, or neighbors.The intuition is that nodes of similar neighborhoods tend to become neighbors.The Adamic-Adar index is used to compare these sets of neighbors, but unlike Jaccard, it assigns different weights to neighbors (see Section 2.2).Adamic-Adar is known to be superior to Jaccard for modeling the phenomenon of link emergence in networks in various real-world applications [47,53].The success of Adamic-Adar also motivated the emergence of several metrics with different ways of weighting neighbors [7,16,49], which will be discussed in Section 2.3.
A second prime example of an application marked by demanding an enormous number of set comparisons is the removal of (near-)duplicate pages in Web search.Eliminating such pages saves network bandwidth, reduces storage costs and improves the quality of search results [52].In this domain, each document is commonly treated as a set of word sequences.To find duplicates in a corpus of 10 million pages, a relatively small scale for Web applications [52], it would already be necessary to compute the set similarity metric about 50 trillion times.It was precisely in the face of this challenge that the problem of estimating set similarity metrics has become highly relevant and has triggered numerous scientific endeavors.
MinHash [4] and SimHash [8], the two best-known estimators, were initially developed for the above-mentioned problem and used respectively in the AltaVista and Google Web search engines.Other current applications taking advantage of set similarity estimation include genomic and metagenomic analysis [41,56], graph comparison [68], collaborative filtering [14], natural language dataset preparation [42], and duplicate detection of other types of data such as images [12].SimHash is also used in locality sensitive hashing (LSH), a technique applied to detect if the similarity of sets exceeds a given threshold, used in problems such as dimensionality reduction, nearest neighbor search, entity resolution and fingerprint matching [44].An important remark is that, despite being related, LSH and the problem addressed in this paper of estimating metrics directly are distinct and both individually relevant as we will discuss in Section 3.This wide range of relevant applications illustrates the importance and potential of set similarity estimators to push boundaries of problems where they are applied.
Despite the importance of the estimators mentioned above, previous works reveal limitations of these techniques.As is common with estimators, their accuracy is a function of the input as well as the value to be estimated.In the original MinHash paper, Broder [4] indicates that the estimator's accuracy is at its worst when the Jaccard index is around 0.5.Koslicki and Zabeti [41] show that the probability of the estimator deviating from the true value increases exponentially with the difference in size between the sets to be compared.Nonetheless, Shrivastava and Li [65] provides theoretical arguments for the superiority of MinHash over the other baseline, SimHash.More importantly, these techniques are not able to estimate indices like Adamic-Adar.In Section 5, we will show experimentally that this limitation makes them less appropriate for the link prediction and near-duplicate detection problems.
Commendable effort has been devoted to reshaping these techniques to overcome these limitations for specific applications as we discuss in Section 3.3.Nevertheless, given its relevance, we argue that it is also important to pursue alternative techniques that explore new perspectives on the problem.With this in mind, we propose DotHash-a novel set similarity estimator.Like its predecessors, DotHash is based on creating a fixed-size representation of sets (often called a sketch, signature or fingerprint in the literature) that can be compared to estimate the similarity between the sets.Creating these compressed representations introduces preprocessing time, but dramatically mitigates the time for each comparison.The central idea of DotHash is to exploit valuable features of highdimensional random vector spaces, in particular their tendency towards orthogonality [37], to create fixed-dimension sketches while retaining as much information from the original space as possible.In Theorem 2, we show that the cardinality of the intersection of two sets can be estimated, without bias, by a simple dot product between their DotHash sketches.As the dot product is such a fundamental operation, we argue that DotHash can take advantage of recent progress in modern hardware platforms for enhanced performance and energy efficiency [54,71].
In addition to the theoretical contribution of a new baseline framework to the problem of comparing sets in its general formulation, we also show that DotHash has prompt practical relevance.We conducted experiments with several popular datasets for the problems of link prediction and near-duplicate detection, which show that the accuracy obtained by DotHash is higher than that obtained with SimHash and MinHash.For this, we exploit the fact that DotHash is able to estimate a more general family of metrics, which are better adapted to applications, as is the case of Adamic-Adar for link prediction.It is worth noting that the time complexity for each comparison is linear in the size of the DotHash sketches regardless of the metric, since it consists of computing the dot product between them.As previously mentioned, these pair-wise comparisons between sketches dictate the overall time complexity.

BACKGROUND
In this section, we introduce the most relevant set similarity metrics for the purpose of our work.We start with the Jaccard and Adamic-Adar indices, for which we show theoretical and empirical results.Then, we provide an overview of other metrics that DotHash is able to estimate directly.It is worth noting that many alternative set similarity metrics have been proposed, specifically tailored to better suit specific scenarios by adapting the basic metrics.For an extensive overview of set similarity metrics we recommend Martínez et al. [53] and Lü and Zhou [50].

Jaccard
Consider the sets and , and let ∩ denote their intersection and ∪ their union.The Jaccard index [35] between and is defined as: where the vertical bars denote the cardinality of the enclosed set.This is one of the oldest and most established ways of comparing sets.Over time, numerous adaptations of this simple metric have emerged, specializing it for particularly interesting applications.Next, we describe one of these adaptations which is widely used, especially in network science [3].

Adamic-Adar
The Adamic-Adar index was created for the problem of link prediction in social graphs [1].Let = ( , ) be a graph, composed of a set of nodes and a set of edges .Each = ( , ) ∈ represents an edge (or link) between nodes , ∈ .With Γ( ) ⊆ we denote the subset of nodes that are adjacent to , i.e., the neighbors of .The cardinality of a neighborhood |Γ( )| is referred to as the degree of .
In the context of graphs, Jaccard can be used to compare pairs of nodes by looking at how many connections they have in common, normalized by the size of the union of their neighborhoods.Adamic-Adar seeks to improve this comparison based on the intuition that the more popular a node in the intersection is (i.e., the higher its degree), the less informative it is about the similarity of the nodes being compared.In the case of social networks, for example, a common connection with a celebrity says little about the chance of two people connecting with each other compared to a less popular mutual friend.To account for this, Adamic-Adar penalizes the number of connections that each shared connection has by taking the logarithm of its degree.Formally, the Adamic-Adar between two nodes is defined as:

Link prediction metrics
Given the importance of link prediction, several other metrics have emerged for the purpose of comparing neighborhoods.Similar to Adamic-Adar, these metrics take the local properties of the nodes in the intersection into account.Many of these metrics can also be directly estimated using DotHash.One such example is the Resource Allocation index, used to evaluate the resource transmission between two unconnected nodes through their neighbors which is defined as: In Section 4.5, we provide a formal description of the family of set similarity metrics that DotHash can directly estimate.

RELATED WORK
In this section we describe the two most popular estimators, Min-Hash and SimHash, which will then be used as baselines for the evaluation of DotHash.It is important to highlight that we compare DotHash with these two methods because they constitute the state of the art for the problem in its most general formulation and are used in current applications as already mentioned.

MinHash
MinHash [4] is a probabilistic method for estimating the Jaccard index.The technique is based on a simple intuition: if we uniformly sample one element from the set ∪ , we have that: which makes the result of this experiment an estimator for Jaccard.
However, an important problem remains: how to uniformly sample from ∪ ?Explicitly computing the union is at least as expensive as computing the intersection, that is, it would be as expensive as calculating the Jaccard index exactly.The main merit of MinHash is an efficient way of circumventing this problem.Let ℎ : ∪ → N denote a min-wise independent hash function, i.e., for any subset of the domain, the output of any of its elements is equally likely to be the minimum (see Broder et al. [5] for a detailed discussion).Then, we have: Given the above, the problem of uniformly sampling an element of the union and checking if it belongs to the intersection can be emulated as follows: hash the set elements and check if the smallest value obtained in both sets is the same.Although the result of this random variable is an unbiased estimator of the Jaccard index, its variance is high when the Jaccard is around 0.5.The idea of Min-Hash is therefore to do such experiments with independent hash functions and return the sample mean to reduce the variance.

SimHash
SimHash [8], sometimes indistinctly called angular LSH, is another popular estimator of set similarity.The sketches of sets are fixedlength binary vectors and are generated as follows: 1) all elements of the superset are mapped uniformly to vectors in {−1, 1} ; 2) for each set ⊆ a -dimensional vector is created by adding the vectors of its elements; 3) the SimHash sketch of the set is a bit string obtained by transforming each positive entry to one and the non-positive entries to zero.The similarity between pairs of sets is then measured by the Hamming distance between these sketches.
Despite being a general estimator for set similarity metrics, SimHash owes its popularity largely to a specific use.Manku et al. [52] showed an efficient way to solve the following problem: in a collection of SimHash sketches, quickly find all sketches that differ at most bits from a given sketch, where is a small integer.This particular formulation is very useful in the context of duplicate text detection and its efficient solution led to SimHash being used by Google Crawler.
The problem described above is an instance of the problem known as locality sensitive hashing (LSH) which, in general, tries to detect pairs of objects whose similarity exceeds a threshold by maximizing the collision probability of the hashes of these objects [32,33].Note that LSH is used to group similar objects and the output is binary, i.e. two objects are either similar or not [44].Therefore, we emphasize that the problem of estimating the metrics directly, addressed in this paper, is different from LSH. Directly estimating similarity has other possible outcomes, such as ordering pairs by similarity, which is crucial in some applications like query optimization [13] and link prediction as we will discuss in Section 5.
Although SimHash is much more popular in the context of LSH, for the sake of completeness, but underscoring the above, we consider SimHash as a baseline of the general problem as the method was originally proposed by Charikar [8].Despite this, we make it clear that the other baseline, MinHash, is much more common in the literature for the problem of estimating the actual value of metrics as we discuss in the next section.

Adjacent research and developments
Our primary focus in this study is to address the task of estimating set similarity metrics in its broadest sense.However, it is important to acknowledge other advancements in the field that are not directly aligned with our specific contribution.These advancements, discussed below, primarily involve enhancements tailored to specific contexts and applications.It is important to note that our method is not intended to directly compete with these notable developments in each individual application, but rather aims to serve as a new baseline for the general problem.We emphasize, however, that DotHash also brings practical contributions to the state of the art.This is achieved by supporting a wider range of metrics, which is formally defined in Section 4.5.As we will show in Section 5, this allows for greater accuracy in important problems such as link prediction and document deduplication.
While MinHash remains the standard framework for estimating the actual value of set similarity metrics, several techniques have been proposed to enhance its accuracy and efficiency in specific application contexts.For instance, Chum and Matas [11] propose an efficient method to compute MinHash sketches for image collections using inverted indexing.Another technique, introduced by Koslicki and Zabeti [41], employs Bloom filters for fast membership queries and is known as containment MinHash.They demonstrate the superiority of this technique in metagenomic analysis by more accurately estimating the Jaccard index when dealing with sets that significantly differ in size.
Several other works have focused on a variant of the problem that involves estimating the weighted Jaccard index [70].For datasets with sparse data, Ertl [18] and Christiani [10] have explored the concept of consistent weighted sampling (CWS) [34] with their respective BagMinHash and DartMinHash techniques.Conversely, when dealing with dense data, methods based on rejection sampling have been demonstrated to be more efficient [46,64].
Another important recent endeavor has been to develop LSH techniques based on deep learning, known as "learn to hash" methods.These include Deep Hashing [73] and various others [17,28,40].In general, these techniques do not estimate any particular metric directly, but seek to create sketches that allow for approximate nearest neighbor search.Another key distinction lies in the methodology employed by these approaches.They rely on constructing trained models through annotated data, where the concept of similarity is derived from the mapping of training examples to a specific target.Consequently, this similarity may not necessarily extend to other datasets, making it potentially non-generalizable.In contrast, the methods discussed in this paper estimate the similarity between two sets solely based on the sets themselves.
Although "learn to hash" methods have demonstrated promising accuracy in situations where supervised learning is viable, their broader adoption has also been hindered by other challenges.These limitations encompass high costs associated with training and inference, the inherent unpredictability due to unknown bounds in estimation error, and their high sensitivity to data distribution, often concealed by the reliance on purely empirical assessments [39,69].As a result, traditional methods such as MinHash and SimHash continue to be utilized in important applications such as the ones mentioned earlier.Despite the inherent differences and the difficulty of setting up an accurate and unbiased study that delves deep into both approaches, we believe that a comparative study between these traditional methods and learning-based approaches would yield significant value for the scientific community.

DOTHASH
We begin by describing a simple method to compute the cardinality of the intersection between two sets.This provides the basis from which we describe the intuition for DotHash.The intuition builds on a generalization of the simple method and a subtle feature of high-dimensional vector spaces.From it, we show how we can create an estimator for the intersection cardinality.We emphasize that virtually all set similarity metrics are direct functions of the intersection cardinality, combined with other easily obtained quantities such as the size of the sets [53].The fact that DotHash estimates the intersection cardinality directly makes it naturally extendable to all these metrics.One of the few exceptions is a family of metrics that assign different weights to the intersection items, such as the Adamic-Adar index.We conclude this section showing how DotHash can be adapted to estimate this larger family of metrics as well, being the first estimator to enable this.

Computing the intersection size
A common way of representing sets is by using bit strings [20,61].In this representation, an arbitrary order [ 1 , 2 , . . ., ] of the elements of the superset is established.Then, each set ⊆ is represented by an | |-dimensional binary vector whose -th bit is one if ∈ , and zero otherwise.Table 1 illustrates this representation for sets = { , , , } and = { , , , }.This representation is especially common for graphs, where it is called an adjacency matrix, and each set consists of the neighborhood of a node.

Table 1: Bit string representation of sets and
It is easy to observe that the size of the intersection between and is given by the number of columns where both elements are one.This provides a straightforward way to get the size of the intersection of sets: calculate the dot product between their bit strings [6,20].

Generalization to orthogonal vectors
An alternative way of visualizing the set bit strings, important for the generalization we propose, is: consider that each element ∈ is encoded by an | |-dimensional vector of value one in position , and zero elsewhere.This representation is usually called one-hot encoding.From this, we can define the bit string of a set as the sum of its one-hot encoded elements, as illustrated in Figure 1 In Theorem 1, we show that the dot product results in the intersection of sets not only when they are the sum of one-hot encoded elements, but more generally when we encode the elements using any orthonormal basis of R | | , of which one-hot is a particular case-the standard basis.Although this, arguably trivial, generalization alone may not seem advantageous at this point, in the next section we show how it is fundamental in the transition from exact to estimation with our method.Then, While the above yields a way of representing sets so that we can compute intersection sizes by simple dot products, notice that each set is represented using | | bits, resulting in a time and space complexity of (| |).Although this can be useful in certain scenarios, clearly this method becomes prohibitively expensive for very large supersets , for example, in the large scale applications described in the previous sections.Another problem is that in many real applications, such as those related to social networks, the sets change over time, so there is no way to establish the superset size a priori.
In some cases the time and space complexity can be improved to (| | + | |) by restricting (•) to the standard basis encoding, represented using a sparse vector format so that only the non-zero elements are stored.With this modification the dot product can be computed by iterating over both vectors at once, similar to merging lists.However, because of the overhead of sparse vector representations, this is mainly useful when There are still limitations to the sparse vector improvement, especially when even | | + | | is very large.This happens both in social networks where nodes can have more than a million connections, and in document deduplication where large documents can hold many word sequences.In the next section we present a method to improve the time complexity to a constant value by giving up the exact intersection size in favour of an estimate.

Exploiting quasi-orthogonality
The method described above seems unsuitable for large scale applications as it requires a number of dimensions equal to the size of the superset.This constraint is imposed by the fact that the smallest real vector space with | | orthogonal vectors is R | | .Intuitively, we need orthogonality so that ( ) • = 1 if, and only if, ∈ , and zero otherwise.This ensures that • = | ∩ | (for a detailed discussion see the proof of Theorem 1 in the Appendix A.1). From an information theory perspective, this guarantees a lossless representation of the sets by the sum of the encoded elements, since by the above operation we can verify exactly which elements make up the set.
Our proposed estimator, DotHash, relies on a very interesting property of high-dimensional vector spaces: uniformly sampled vectors are nearly orthogonal, or quasi-orthogonal, to each other with high probability [38].This valuable feature has been explored in several other domains, especially to model human cognition and memory [22,37].It is based on this insight that DotHash turns the above method into an estimator for the size of the intersection of sets.
Instead of using a precisely orthonormal basis of vectors of R | | to encode the elements of , DotHash uses unit vectors sampled from R , with < | |.The set sketches (fixed-length representations) are then built in the same way, by adding the encodings of their elements, as depicted in Figure 2. Intuitively from the above, each encoded element would be quasi-orthogonal to all others, allowing to approximate the dot product relations mentioned above.In Theorem 2 we formalize this idea, showing that the the dot product between the DotHash sketches of sets is an unbiased estimator for their intersection cardinality.
= { , , , } Then, Using the variance provided in Theorem 2 and the Chebychev inequalty we can bound the probablity of error by: If we use the observation that each dimension can be interpreted as an independent sample, we can use the Central Limit Theorem (CLT) to approximate the probability of error as follows: where Φ(•) denotes the standard normal cumulative distribution function.In Figure 3 we provide the CLT estimate (solid line) and the empirical probability (dashed line).We can rewrite the CLT (or the Chebychev inequality) to get the required number of dimensions to obtain an error greater or equal to | ∩ | with a given probability : where Φ −1 (•) denotes the standard normal percent point function.

Estimating Adamic-Adar
Now that we have established a method for estimating the size of the intersection, we describe how to adapt DotHash to estimate the Adamic-Adar index.The idea starts from the fact that and are sums of the vectors that encode the elements of their respective sets.Given this and the distributive property of the dot product over addition, we have: Observing that E[ ( ) • ( )] = 1 if = , and zero otherwise (see the proof of Theorem 2): The right-hand side of this equation is similar to Adamic-Adar in that both sum values over the intersection items.The key missing part is that the value to be summed must be a function of the size of the neighborhoods, not a constant.In the above case, the summation parameter is one because every element is encoded to a unit vector.However, the construction of DotHash allows us to adjust the summation parameter by modifying the magnitude of the vectors used to represent each element.To obtain the Adamic-Adar index, we want each element, in this case each node, to be encoded in such a way that: Theorem 3 shows how to adapt the vector magnitudes to obtain the Adamic-Adar index.Then, E[ • ] = A( , ).

General family of supported metrics
Building upon the result presented in the previous section, we naturally extend it to encompass a general formulation of all set similarity metrics that can be directly estimated using DotHash.This family includes all metrics of the form:

EXPERIMENTS
In this section we present experiments comparing DotHash to the baselines presented in Section 3. The main goal here is to provide empirical evidence on the advantages of DotHash in the link prediction and duplicate detection tasks.All the methods were implemented using PyTorch [58] and the Torchhd library [29], and ran on a machine with 20 Intel Xeon Silver 4114 CPUs, 93 GB of RAM and 4 Nvidia TITAN Xp GPUs.The experiments, however, only used a single CPU or GPU.We repeated each experiment 5 times on the CPU and 5 times on the GPU.The code is available at: https://github.com/mikeheddes/dothash.

Datasets
An overview of the datasets used is shown in Table 2. To compare the methods under different circumstances, we consider a range of common benchmarks used in the literature that have different characteristics and are associated with different applications.For the link-prediction task we evaluate each method on the following datasets: • Drugs [24]: This dataset represents the interaction between drugs, where the joint effect of using both drugs is significantly different from their effects separately.• Wikipedia [62]: This dataset represents a webpage network where each node represents a web page and the edges represent hyperlinks between them.• Facebook [62]: A network of verified Facebook pages where nodes correspond to Facebook pages and edges are the mutual likes between pages.• Proteins [67]: Is a protein network where nodes represent proteins from different species and edges show biological meaningfulness between the proteins associations.• Epinions [59]: Represents the who-trusts-whom social network of the general consumer review site Epinions.com,where each node represents a user, and each edge is a directed trust relation.
• Slashdot [43]: Represents the Slashdot social network as of February 2009, where each node is a user and each edge is a directed friend/foe link.
For the document deduplication we use the following datasets: • CORE Deduplication Dataset 2020 [25]: This dataset consists of more than 1.3M scholar documents labeled as duplicates or non-duplicates.• Harvard Dataverse Duplicate Detection Restaurant dataset [2].
This dataset consists of a collection of 864 restaurant records containing 112 duplicates.• Harvard Dataverse Duplicate Detection cddb dataset [2]: This dataset contains a set of 9,763 records with 299 duplicated entries, with each row representing information about a particular audio recording.

Link prediction
This experiment aims to demonstrate a practical advantage of DotHash.While the baseline estimators, MinHash and SimHash, are limited to estimating the Jaccard index, DotHash offers the ability to estimate more complex metrics, known to be more effective for certain applications, including link prediction.We evaluate the accuracy of each estimator in solving the link prediction problem.The problem consists of inferring which links will appear in the future, called the inference time interval, given a snapshot of a graph [47].In practice, the task is seen as a ranking problem for pairs of nodes, i.e., the approaches compare pairs of nodes and predict that the most similar pairs are those that are likely to connect in the future.The quality of the methods is evaluated based on how well they rank pairs that effectively form in the inference time interval, against random pairs that do not connect.These edges are respectively called the positive and negative samples.The most popular metric, Hits@ , counts the ratio of positive edges that are ranked or above the negative edges [31].The Jaccard, Adamic-Adar, Common Neighbors, and Resource Allocation indices are all used in the literature for establishing this ranking.
The results, presented in Figure 4, provide evidence to substantiate the claim that estimating more appropriate metrics makes DotHash a better estimator for the link prediction problem.By employing sufficient dimensions and selecting suitable metrics, DotHash consistently outperforms the baselines across all datasets and approaches the exact indices, shown in dashed lines.Each solid line represents the mean of the values observed in the five repetitions, and the corresponding colored shades show the 95% confidence interval.Importantly, as explained in the previous section, the adoption of DotHash does not impose a significant additional computational burden compared to the baseline methods.Figure 5 shows the normalized execution time of each method on the same datasets.For more details on execution times, please refer to Appendix B.

Document deduplication
The detection of duplicate documents was the first major application to motivate the development of set similarity estimators.Both MinHash and SimHash were developed and became popular for their use in deduplicating web pages in Google and Alta Vista search engines.With our experiments in this section we seek to show how DotHash compares to these methods in this important problem.
Once again, we took advantage of the broader capability of DotHash to estimate a metric richer than the Jaccard coefficient between documents.While MinHash and SimHash give equal weight to shingles (sequences of words) when comparing documents, with DotHash we can assign different weights to reflect how important each shingle is to each document in the corpus.Our intuition is that the more important a term is to identify each text, the more its presence in different texts indicates their similarity.
One of the most popular ways of evaluating how important a term is to a document is by the inverse document frequency, or IDF [36].The measure is widely used in the information retrieval literature for text mining [30,60], and is defined as: .Given this, we can compare documents and not only by the number, but also by the importance of common shingles, as follows: which is in the family of functions that DotHash can estimate.
In Table 3 we show the comparative results between the three estimators for the near-duplicate detection problem in the three different datasets described in Section 5.1.For each estimator we present the near-duplicity detection accuracy in terms of Hits@25 and execution time in seconds.The number of hash values and dimensions for MinHash, SimHash, and DotHash were set to 128, 500, and 10,000, respectively.These values were chosen to ensure comparable accuracy and enable the observation of differences in computational efficiency between the algorithms.
The numerical results presented indicate that DotHash is able to surpass the accuracy of MinHash in all datasets, even with a number of dimensions in which its execution time is between 0.5 and 3× faster.The same is observed in the comparison with SimHash, which obtains the lowest accuracy in all cases.These empirical results reinforce the findings observed in the link prediction experiments, highlighting the advantage of DotHash.By efficiently estimating richer metrics through a single dot product computation between set sketches, DotHash consistently delivers superior cost-benefit compared to other estimators.

CONCLUSION
We propose DotHash, a new baseline method for estimating the similarity between sets.The method takes advantage of the tendency to orthogonality of sets of random high-dimensional vectors to create fixed-size representations for sets.We show that a simple dot product of these sketches serves as an unbiased estimator for the size of the intersection of sets.DotHash allows estimating a larger set of metrics than existing estimators.Our experiments show that this makes it more appropriate for link prediction and duplicate detection tasks.Adding the theoretical and practical contribution, we see DotHash as a new framework for a problem of increasing relevance in data mining and related areas.

T 1 .
Consider arbitrary sets , ⊆ .Let : → R | | be any injective map of their elements to vectors to one of the orthonormal bases of R | | , and let

T 3 .
Consider an arbitrary graph = ( , ) with nodes and edges .Take any constant ∈ N + and let Γ( ) ⊆ denote the neighbors of node ∈ .Let : → R be a uniform random mapping of nodes in to unit vectors which are the vertices of a -dimensional hypercube, and let = where : → R is any function on intersection elements.Besides the item , the function can use any global parameters such as the cardinalities of , or .The DotHash sketches for this general set similarly metric are given by: is obtained by • .Observe that the intersection size, the Adamic-Adar index, and the Resource Allocation index all fit into this general framework: the intersection size corresponds to ( ) = 1, Adamic-Adar to ( ) = 1 log |Γ ( ) | , and Resource Allocation to ( ) = 1|Γ ( ) | .This group of metrics directly supported by DotHash includes the majority of metrics listed in Martínez et al.[53] and Lü and Zhou[50].

Figure 4 :
Figure 4: Link prediction accuracy results while varying the number of dimensions and hashes .

Figure 5 :
Figure 5: Normalized average execution time of different methods, relative to MinHash with = 8 running on CPU.The average is calculated over all link prediction datasets.
idf ( ) = log | | |{ ∈ : ∈ }| where | | is the number of documents in the entire corpus and |{ ∈ : ∈ }| the number of documents that contain the term

Table 2 :
Statistics of the graph datasets

Table 3 :
Accuracy and computation time (in seconds) results in the detection of duplicate documents.