ABSTRACT
(MATH) A locality sensitive hashing scheme is a distribution on a family $\F$ of hash functions operating on a collection of objects, such that for two objects x,y, PrhεF[h(x) = h(y)] = sim(x,y), where sim(x,y) ε [0,1] is some similarity function defined on the collection of objects. Such a scheme leads to a compact representation of objects so that similarity of objects can be estimated from their compact sketches, and also leads to efficient algorithms for approximate nearest neighbor search and clustering. Min-wise independent permutations provide an elegant construction of such a locality sensitive hashing scheme for a collection of subsets with the set similarity measure sim(A,B) = \frac{|A ∩ B|}{|A ∪ B|}.(MATH) We show that rounding algorithms for LPs and SDPs used in the context of approximation algorithms can be viewed as locality sensitive hashing schemes for several interesting collections of objects. Based on this insight, we construct new locality sensitive hashing schemes for:
A collection of vectors with the distance between → \over u and → \over v measured by Ø(→ \over u, → \over v)/π, where Ø(→ \over u, → \over v) is the angle between → \over u) and → \over v). This yields a sketching scheme for estimating the cosine similarity measure between two vectors, as well as a simple alternative to minwise independent permutations for estimating set similarity.
A collection of distributions on n points in a metric space, with distance between distributions measured by the Earth Mover Distance (EMD), (a popular distance measure in graphics and vision). Our hash functions map distributions to points in the metric space such that, for distributions P and Q, EMD(P,Q) ≤ Ehε\F [d(h(P),h(Q))] ≤ O(log n log log n). EMD(P, Q).
- N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking Join and Self-Join Sizes in Limited Storage. Proc. 18th PODS pp. 10-20, 1999. Google Scholar
Digital Library
- N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. JCSS 58(1): 137-147, 1999. Google Scholar
Digital Library
- Y. Bartal. Probabilistic approximation of metric spaces and its algorithmic application. Proc. 37th FOCS, pages 184--193, 1996. Google Scholar
Digital Library
- Y. Bartal. On approximating arbitrary metrics by tree metrics. In Proc. 30th STOC, pages 161--168, 1998. Google Scholar
Digital Library
- A. Z. Broder. On the resemblance and containment of documents. Proc. Compression and Complexity of SEQUENCES, pp. 21--29. IEEE Computer Society, 1997. Google Scholar
Digital Library
- A. Z. Broder. Filtering near-duplicate documents. Proc. FUN 98, 1998.Google Scholar
- A. Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Proc. 30th STOC, pp. 327--336, 1998. Google Scholar
Digital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Proc. 6th Int'l World Wide Web Conference, pp. 391--404, 1997. Google Scholar
Digital Library
- A. Z. Broder, R. Krauthgamer, and M. Mitzenmacher. Improved classification via connectivity information. Proc. 11th SODA, pp. 576--585, 2000. Google Scholar
Digital Library
- G. Calinescu, H. J. Karloff, and Y. Rabani. Approximation algorithms for the 0-extension problem. Proc. 11th SODA, pp. 8--16, 2000. Google Scholar
Digital Library
- C. Chekuri, S. Khanna, J. Naor, and L. Zosin. Approximation algorithms for the metric labeling problem via a new linear programming formulation. Proc. 12th SODA, pp. 109--118, 2001. Google Scholar
Digital Library
- Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting Twig Matches in a Tree. Proc. 17th ICDE pp. 595--604. 2001. Google Scholar
Digital Library
- Z. Chen, F. Korn, N. Koudas, and S. Muthukrishnan. Selectivity Estimation for Boolean Queries. Proc. 19th PODS, pp. 216--225, 2000. Google Scholar
Digital Library
- E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding Interesting Associations without Support Pruning. Proc. 16th ICDE pp. 489--499, 2000. Google Scholar
Digital Library
- S. Cohen and L. Guibas. The Earth Mover's Distance under Transformation Sets. Proc. 7th IEEE Intnl. Conf. Computer Vision, 1999. Google Scholar
Digital Library
- S. Cohen and L. Guibas. The Earth Mover's Distance: Lower Bounds and Invariance under Translation. Tech. report STAN-CS-TR-97-1597, Dept. of Computer Science, Stanford University, 1997. Google Scholar
Digital Library
- L. Engebretsen, P. Indyk and R. O'Donnell. Derandomized dimensionality reduction with applications. To appear in Proc. 13th SODA, 2002. Google Scholar
Digital Library
- P. B. Gibbons and Y. Matias. Synopsis Data Structures for Massive Data Sets. Proc. 10th SODA pp. 909--910, 1999. Google Scholar
Digital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. Proc. 27th VLDB pp. 79--88, 2001. Google Scholar
Digital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. QuickSAND: Quick Summary and Analysis of Network Data. DIMACS Technical Report 2001-43, November 2001.Google Scholar
- A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Fast, Small-Space Algorithms for Approximate Histogram Maintenance. these proceedings. Google Scholar
Digital Library
- A. Gionis, D. Gunopulos, and N. Koudas. Efficient and Tunable Similar Set Retrieval. Proc. SIGMOD Conference 2001. Google Scholar
Digital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. Proc. 25th VLDB} pp. 518--529, 1999. Google Scholar
Digital Library
- M. X. Goemans and D. P. Williamson. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems Using Semidefinite Programming. JACM 42(6): 1115--1145, 1995. Google Scholar
Digital Library
- S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. Proc. 41st FOCS, pp. 359--366, 2000. Google Scholar
Digital Library
- A. Gupta and Éva Tardos. A constant factor approximation algorithm for a class of classification problems. Proc. 32nd STOC, pp. 652--658, 2000. Google Scholar
Digital Library
- T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. Proc. 3rd WebDB, pp. 129--134, 2000.Google Scholar
- P. Indyk. A small approximately min-wise independent family of hash functions. Proc. 10th SODA, pp. 454--456, 1999. Google Scholar
Digital Library
- P. Indyk. On approximate nearest neighbors in non-Euclidean spaces. Proc. 40th FOCS, pp. 148--155, 1999. Google Scholar
Digital Library
- P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. Proc. 41st FOCS, 189-197, 2000. Google Scholar
Digital Library
- Indyk, P., Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. Proc. 30th STOC pp. 604--613, 1998. Google Scholar
Digital Library
- P. Indyk, R. Motwani, P. Raghavan, and S. Vempala. Locality-Preserving Hashing in Multidimensional Spaces. Proc. 29th STOC, pp. 618--625, 1997. Google Scholar
Digital Library
- J. M. Kleinberg and Éva Tardos Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields. Proc. 40th FOCS, pp. 14--23, 1999. Google Scholar
Digital Library
- N. Linial and O. Sasson. Non-Expansive Hashing. Combinatorica 18(1): 121--132, 1998.Google Scholar
Cross Ref
- N. Nisan. Pseudorandom sequences for space bounded computations. Combinatorica, 12:449--461, 1992.Google Scholar
Cross Ref
- Y. Rubner. Perceptual Metrics for Image Database Navigation. Phd Thesis, Stanford University, May 1999. Google Scholar
Digital Library
- Y. Rubner, L. J. Guibas, and C. Tomasi. The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval. Proc. of the ARPA Image Understanding Workshop, pp. 661--668, 1997.Google Scholar
- Y. Rubner, C. Tomasi. Texture Metrics. Proc. IEEE International Conference on Systems, Man, and Cybernetics, 1998, pp. 4601--4607.Google Scholar
- Y. Rubner, C. Tomasi, and L. J. Guibas. A Metric for Distributions with Applications to Image Databases. Proc. IEEE Int. Conf. on Computer Vision, pp. 59--66, 1998. Google Scholar
Digital Library
- Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover's Distance as a Metric for Image Retrieval. Tech. Report STAN-CS-TN-98-86, Dept. of Computer Science, Stanford University, 1998. Google Scholar
Digital Library
- M. Ruzon and C. Tomasi. Color edge detection with the compass operator. Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2:160--166, 1999.Google Scholar
Cross Ref
- M. Ruzon and C. Tomasi. Corner detection in textured color images. Proc. IEEE Int. Conf. Computer Vision, 2:1039--1045, 1999. Google Scholar
Digital Library
Index Terms
Similarity estimation techniques from rounding algorithms
Recommendations
A New Normalized Similarity for Discriminating Similar Documents
To find out similar document pairs from a set of documents, computing normalization similarities is inevitable because the sizes of documents are different from documents to documents. However, the normalized similarities proposed up to now are still ...
Problems and Techniques
The Problems and Techniques section in this issue contains three papers on very different topics: achieving reliability through optimization, improving the solution of systems of linear equations by preconditioning, and determining asymptotic ...
Distance Learning for Similarity Estimation
In this paper, we present a general guideline to find a better distance measure for similarity estimation based on statistical analysis of distribution models and distance functions. A new set of distance measures are derived from the harmonic distance, ...





Comments