ABSTRACT
We present a novel Locality-Sensitive Hashing scheme for the Approximate Nearest Neighbor Problem under lp norm, based on p-stable distributions.Our scheme improves the running time of the earlier algorithm for the case of the lp norm. It also yields the first known provably efficient approximate NN algorithm for the case p<1. We also show that the algorithm finds the exact near neigbhor in O(log n) time for data satisfying certain "bounded growth" condition.Unlike earlier schemes, our LSH scheme works directly on points in the Euclidean space without embeddings. Consequently, the resulting query time bound is free of large factors and is simple and easy to implement. Our experiments (on synthetic data sets) show that the our data structure is up to 40 times faster than kd-tree.
References
- C. Aggarwal and D. Keim A. Hinneburg. On the surprising behavior of distance metrics in high dimensional spaces. Proceedings of the International Conference on Database Theory, pages 420--434, 2001.]] Google Scholar
Digital Library
- S. Arya and D. Mount. Ann: Library for approximate nearest neighbor searching. available at http://www.cs.umd.edu/~mount/ANN/.]]Google Scholar
- S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 573--582, 1994.]] Google Scholar
Digital Library
- K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbor meaningful? Proceedings of the International Conference on Database Theory, pages 217--235, 1999.]] Google Scholar
Digital Library
- J. Buhler. Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics, 17:419--428, 2001.]]Google Scholar
Cross Ref
- J. Buhler. Provably sensitive indexing strategies for biosequence similarity search. Proceedings of the Annual International Conference on Computational Molecular Biology (RECOMB02), 2002.]] Google Scholar
Digital Library
- J. Buhler and M. Tompa. Finding motifs using random projections. Proceedings of the Annual International Conference on Computational Molecular Biology (RECOMB01), 2001.]] Google Scholar
Digital Library
- J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables. J. Amer. Statist. Assoc., 71:340--344, 1976.]]Google Scholar
Cross Ref
- K. Clarkson. Nearest neighbor queries in metric spaces. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pages 609--617, 1997.]] Google Scholar
Digital Library
- E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding interesting associations without support prunning. Proceedings of the 16th International Conference on Data Engineering (ICDE), 2000.]] Google Scholar
Digital Library
- G. Cormode, P. Indyk, N. Koudas, and S. Muthukrishnan. Fast mining of massive tabular data via approximate distance computations. Proc. 18th International Conference on Data Engineering (ICDE), 2002.]] Google Scholar
Digital Library
- T. Darrell, P. Indyk, G. Shakhnarovich, and P. Viola. Approximate nearest neighbors methods for learning and vision. NIPS Workshop at http://www.ai.mit.edu/projects/vip/nips03ann, 2003.]]Google Scholar
- B. Georgescu, I. Shimshoni, and P. Meer. Mean shift based clustering in high dimensions: A texture classification example. Proceedings of the 9th International Conference on Computer Vision, 2003.]] Google Scholar
Digital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. Proceedings of the 25th International Conference on Very Large Data Bases (VLDB), 1999.]] Google Scholar
Digital Library
- S. Har-Peled. A replacement for voronoi diagrams of near linear size. Proceedings of the Symposium on Foundations of Computer Science, 2001.]] Google Scholar
Digital Library
- T. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. WebDB Workshop, 2000.]]Google Scholar
- A. Hinneburg, C. C. Aggarwal, and D. A. Keim. What is the nearest neighbor in high dimensional spaces? Proceedings of the International Conference on Very Large Databases (VLDB), pages 506--515, 2000.]] Google Scholar
Digital Library
- P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. Proceedings of the Symposium on Foundations of Computer Science, 2000.]] Google Scholar
Digital Library
- P. Indyk and R. Motwani. Approximate nearest neighbor: towards removing the curse of dimensionality. Proceedings of the Symposium on Theory of Computing, 1998.]] Google Scholar
Digital Library
- P. Indyk and N. Thaper. Fast color image retrieval via embeddings. Workshop on Statistical and Computational Theories of Vision (at ICCV), 2003.]]Google Scholar
- D. Karger and M Ruhl. Finding nearest neighbors in growth-restricted metrics. Proceedings of the Symposium on Theory of Computing, 2002.]] Google Scholar
Digital Library
- J. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, 1997.]] Google Scholar
Digital Library
- R. Krauthgamer and J. R. Lee. Navigating nets: Simple algorithms for proximity search. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2004.]] Google Scholar
Digital Library
- E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. Proceedings of the Thirtieth ACM Symposium on Theory of Computing, pages 614--623, 1998.]] Google Scholar
Digital Library
- J. P. Nolan. An introduction to stable distributions. available at http://www.cas.american.edu/~jpnolan/chap1.ps.]]Google Scholar
- Z. Ouyang, N. Memon, T. Suel, and D. Trendafilov. Cluster-based delta compression of collections of files. Proceedings of the International Conference on Web Information Systems Engineering (WISE), 2002.]] Google Scholar
Digital Library
- N. Shivakumar. Detecting digital copyright violations on the Internet (Ph.D. thesis). Department of Computer Science, Stanford University, 2000.]] Google Scholar
Digital Library
- Roger Weber, Hans J. Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. Proceedings of the 24th Int. Conf. Very Large Data Bases (VLDB), 1998.]] Google Scholar
Digital Library
- C. Yang. Macs: Music audio characteristic sequence indexing for similarity retrieval. Proceedings of the Workshop on Applications of Signal Processing to Audio and Acoustics, 2001.]]Google Scholar
- P.N. Yiannilos. Locally lifting the curse of dimensionality for nearest neighbor search. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2000.]] Google Scholar
Digital Library
- V.M. Zolotarev. One-Dimensional Stable Distributions. Vol. 65 of Translations of Mathematical Monographs, American Mathematical Society, 1986.]]Google Scholar
Index Terms
Locality-sensitive hashing scheme based on p-stable distributions

Piotr Indyk


Comments