skip to main content
10.1145/1376916.1376946acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

The power of two min-hashes for similarity search among hierarchical data objects

Published:09 June 2008Publication History

ABSTRACT

In this study we propose sketching algorithms for computing similarities between hierarchical data. Specifically, we look at data objects that are represented using leaf-labeled trees denoting a set of elements at the leaves organized in a hierarchy. Such representations are richer alternatives to a set. For example, a document can be represented as a hierarchy of sets wherein chapters, sections, and paragraphs represent different levels in the hierarchy. Such a representation is richer than viewing the document simply as a set of words. We measure distance between trees using the best possible super-imposition that minimizes the number of mismatched leaf labels. Our distance measure is equivalent to an Earth Mover's Distance measure since the leaf-labeled trees of height one can be viewed as sets and can be recursively extended to trees of larger height by viewing them as set of sets. We compute sketches of arbitrary weighted trees and analyze them in the context of locality-sensitive hashing (LSH) where the probability of two sketches matching is high when two trees are similar and low when the two trees are far under the given distance measure. Specifically, we compute sketches of such trees by propagating min-hash computations up the tree. Furthermore, we show that propagating one min-hash results in poor sketch properties while propagating two min-hashes results in good sketches.

References

  1. N. Augsten, M. Böhlen, C. Dyreson, and J. Gamper. Approximate joins for data-centric XML. In Proceedings of the International Conference on Data Engineering (ICDE), Cancún, Mexico, April 2008. To appear. Google ScholarGoogle Scholar
  2. N. Augsten, M. Böhlen, and J. Gamper. Approximate matching of hierarchical data using pq-grams. In Proc. of the 31st VLDB Conference, pages 301--312, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Andrei Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of SEQUENCES SEQS: Sequences '91, 1998. Google ScholarGoogle Scholar
  4. Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630--659, 2000. Google ScholarGoogle Scholar
  5. Moses Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual ACM Symposium on Theory of Computing, pages 380--388, 2002. Google ScholarGoogle Scholar
  6. S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change detection in hierarchically structured information. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages 493--504, 1996. Google ScholarGoogle Scholar
  7. W. Chen. New algorithms for ordered tree-to-tree correction problem. Journal of Algorithms, 40(2):135--158, August 2001. Google ScholarGoogle Scholar
  8. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. of the 20th ACM Symposium on Computational Geometry, pages 253--262, 2004. Google ScholarGoogle Scholar
  9. M. Garofalakis and A. Kumar. Xml stream processing using tree-edit distance embeddings. ACM Transactions on Database Systems, 30(1):279--332, 2005. Google ScholarGoogle Scholar
  10. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of 25th International Conference on Very Large Data Bases, VLDB, pages 518--529, 1999. Google ScholarGoogle Scholar
  11. S. Gollapudi and R. Panigrahy. Exploiting asymmetry in hierarchical topic extraction. In Proc. of 13th Conference on Information and Knowledge Management, 2006. Google ScholarGoogle Scholar
  12. Taher H. Haveliwala, Aristides Gionis, and Piotr Indyk. Scalable techniques for clustering the web. In WebDB (Informal Proceedings), pages 129--134, 2000.Google ScholarGoogle Scholar
  13. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. of 30th ACM Symposium on Theory of Computing (STOC), pages 604--613, 1998. Google ScholarGoogle Scholar
  14. T. Jiang, L. Wang, and K. Zhang. Alignment of trees - an alternative to tree edit. In Proc. Intl Conference on Combinatorial Pattern Matching, pages 75--86, 1994. Google ScholarGoogle Scholar
  15. K. Kailing, H-P. Kriegel, S. Schonauer, and T. Seidl. Efficient similarity search for hierarchical data in large databases. In Proc. 9th Intl Conference on Extending Database Technology, pages 676--693, 2004.Google ScholarGoogle Scholar
  16. T. Margush and F. R. McMorris. Consensus n-trees. Bulletin of Mathematical Biology, 3:239--244, 1981.Google ScholarGoogle Scholar
  17. Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, pages 218--224, 2005. Google ScholarGoogle Scholar
  18. Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proc. of 17th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, pages 1186--1195, 2006. Google ScholarGoogle Scholar
  19. K. Zhang. A constrained editing distance between unordered labeled trees. Algorithmica, 15:205--222, 1996.Google ScholarGoogle Scholar
  20. K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245--1262, 1989. Google ScholarGoogle Scholar
  21. Li Zhang. On matching nodes between trees. Technical Report HPL-2003-67, HP Laboratories, Palo Alto, CA, April 2003.Google ScholarGoogle Scholar

Index Terms

  1. The power of two min-hashes for similarity search among hierarchical data objects

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              PODS '08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
              June 2008
              330 pages
              ISBN:9781605581521
              DOI:10.1145/1376916

              Copyright © 2008 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 9 June 2008

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate476of1,835submissions,26%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!