ABSTRACT
Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation.
Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memo.
Supplemental Material
- S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for nearest neighbor searching. In SODA'94. Google Scholar
Digital Library
- S. Berchtold, K. Keim, and H. -P. Kriegel. The X-Tree: An index structure for high dimensional data. In VLDB'96. Google Scholar
Digital Library
- E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2):451--474, 2000. Google Scholar
Digital Library
- M. Bern. Approximate closest point queries in high dimensions. Information Processing Letters, 45, 1993. Google Scholar
Digital Library
- T Bozkaya and M. Ozsoyoglu. Distance-based indexing for high-dimensional metric spaces. In PODS'97. Google Scholar
Digital Library
- K. Clarkson. Nearest neighbor queries in metric spaces. In STOC'97. Google Scholar
Digital Library
- R. Motwani, P. Indyk. Approximate nearest neighbor - towards removing the curse of dimensionality. In STOC'98. Google Scholar
Digital Library
- Vladimir Pestov. On the geometry of similarity search: dimensionality curse and concentration of measure. Information Processing Letters, To Appear. Google Scholar
Digital Library
- K. S. Beyer, J. Goldstein, R. Ramakrishnan, and Uri Shaft. When is "nearest neighbor" meaningful? In ICDT '99.Google Scholar
- K. P. Bennett, U. Fayyad, and D. Geiger. Density-based indexing for approximate nearest-neighbor queries. In KDD '99. Google Scholar
Digital Library
- Sergey Brin. Near neighbor search in large metric spaces. In The VLDB Journal, 574--584, 1995. Google Scholar
Digital Library
- R. Fagin, R. Kumar, and D. Sivakumar Efficient similarity search and classification via rank aggregation. SIGMOD '03. Google Scholar
Digital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In The VLDB Journal, 1999.Google Scholar
Digital Library
- H. Edelsbrunner. Algorithms in Combinatorial Geometry. Springer-Verlag, New York, NY, 1987. Google Scholar
Digital Library
- L. Ertz, M. Steinbach, and V. Kumar. Finding topics in collections of documents: A shared nearest neighbor approach. In Text Mine '01.Google Scholar
- C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In SIGIR '85, 97--110. Google Scholar
Digital Library
- M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.Google Scholar
- P. Willet. Recent trends in hierarchical document clustering: a critical review. In Information Processing and Management, vol. 24(5), 577--597, 1988. Google Scholar
Digital Library
- D. Comer. The ubiquitous b-tree. In ACM Computing Surveys, 11(2):121--137, 1979. Google Scholar
Digital Library
- A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD '84 . Google Scholar
Digital Library
- G. Karypis, E -H Han, and V. Kumar. Chameleon: Hierarchical Clustering Using Dynamic Modeling. IEEE Computer, 32(8):68--75, August 1999. Google Scholar
Digital Library
- N. Katayama and S. Satoh. The sr-tree: An index structure for high-dimensional nearest neighbor queries. In SIGMOD'97. Google Scholar
Digital Library
- A. N. Papadopoulos Y. Manolopoulos, A. Nanopoulos and Y. Theodoridis. R-trees have grown everywhere. In Technical Report available at http://www.rtreeportal.org/, 2003.Google Scholar
- F. J. MacWilliams and N. J. A. Sloane. The Theory of Error Correcting Codes. Amsterdam: North-Holland, 1977.Google Scholar
- J. Goldstein and Raghu Ramakrishnan. Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches. In VLDB'00. Google Scholar
Digital Library
Index Terms
Finding near neighbors through cluster pruning
Recommendations
Extending LAESA Fast Nearest Neighbour Algorithm to Find the k Nearest Neighbours
Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern RecognitionMany pattern recognition tasks make use of the k nearest neighbour (k-NN) technique. In this paper we are interested on fast k- NN search algorithms that can work in any metric space i.e. they are not restricted to Euclidean-like distance functions. ...
Ranked Reverse Nearest Neighbor Search
Given a set of data points P and a query point q in a multidimensional space, Reverse Nearest Neighbor (RNN) query finds data points in P whose nearest neighbors are q. Reverse k-Nearest Neighbor (RkNN) query (where k ≥ 1) generalizes RNN query to find ...
On kernel difference-weighted k-nearest neighbor classification
Special Issue: Non-parametric distance-based classification techniques and their applicationsNearest neighbor (NN) rule is one of the simplest and the most important methods in pattern recognition. In this paper, we propose a kernel difference-weighted k-nearest neighbor (KDF-KNN) method for pattern classification. The proposed method defines ...






Comments