skip to main content
10.1145/1265530.1265545acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Finding near neighbors through cluster pruning

Published:11 June 2007Publication History

ABSTRACT

Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation.

Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memo.

Skip Supplemental Material Section

Supplemental Material

Low Resolution
High Resolution

References

  1. S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for nearest neighbor searching. In SODA'94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Berchtold, K. Keim, and H. -P. Kriegel. The X-Tree: An index structure for high dimensional data. In VLDB'96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2):451--474, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Bern. Approximate closest point queries in high dimensions. Information Processing Letters, 45, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T Bozkaya and M. Ozsoyoglu. Distance-based indexing for high-dimensional metric spaces. In PODS'97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Clarkson. Nearest neighbor queries in metric spaces. In STOC'97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Motwani, P. Indyk. Approximate nearest neighbor - towards removing the curse of dimensionality. In STOC'98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Vladimir Pestov. On the geometry of similarity search: dimensionality curse and concentration of measure. Information Processing Letters, To Appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. S. Beyer, J. Goldstein, R. Ramakrishnan, and Uri Shaft. When is "nearest neighbor" meaningful? In ICDT '99.Google ScholarGoogle Scholar
  10. K. P. Bennett, U. Fayyad, and D. Geiger. Density-based indexing for approximate nearest-neighbor queries. In KDD '99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sergey Brin. Near neighbor search in large metric spaces. In The VLDB Journal, 574--584, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Fagin, R. Kumar, and D. Sivakumar Efficient similarity search and classification via rank aggregation. SIGMOD '03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In The VLDB Journal, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Edelsbrunner. Algorithms in Combinatorial Geometry. Springer-Verlag, New York, NY, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Ertz, M. Steinbach, and V. Kumar. Finding topics in collections of documents: A shared nearest neighbor approach. In Text Mine '01.Google ScholarGoogle Scholar
  16. C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In SIGIR '85, 97--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.Google ScholarGoogle Scholar
  18. P. Willet. Recent trends in hierarchical document clustering: a critical review. In Information Processing and Management, vol. 24(5), 577--597, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Comer. The ubiquitous b-tree. In ACM Computing Surveys, 11(2):121--137, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD '84 . Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Karypis, E -H Han, and V. Kumar. Chameleon: Hierarchical Clustering Using Dynamic Modeling. IEEE Computer, 32(8):68--75, August 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Katayama and S. Satoh. The sr-tree: An index structure for high-dimensional nearest neighbor queries. In SIGMOD'97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. N. Papadopoulos Y. Manolopoulos, A. Nanopoulos and Y. Theodoridis. R-trees have grown everywhere. In Technical Report available at http://www.rtreeportal.org/, 2003.Google ScholarGoogle Scholar
  24. F. J. MacWilliams and N. J. A. Sloane. The Theory of Error Correcting Codes. Amsterdam: North-Holland, 1977.Google ScholarGoogle Scholar
  25. J. Goldstein and Raghu Ramakrishnan. Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches. In VLDB'00. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Finding near neighbors through cluster pruning

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
              June 2007
              328 pages
              ISBN:9781595936851
              DOI:10.1145/1265530

              Copyright © 2007 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 11 June 2007

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate476of1,835submissions,26%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!