ABSTRACT
In this study we propose sketching algorithms for computing similarities between hierarchical data. Specifically, we look at data objects that are represented using leaf-labeled trees denoting a set of elements at the leaves organized in a hierarchy. Such representations are richer alternatives to a set. For example, a document can be represented as a hierarchy of sets wherein chapters, sections, and paragraphs represent different levels in the hierarchy. Such a representation is richer than viewing the document simply as a set of words. We measure distance between trees using the best possible super-imposition that minimizes the number of mismatched leaf labels. Our distance measure is equivalent to an Earth Mover's Distance measure since the leaf-labeled trees of height one can be viewed as sets and can be recursively extended to trees of larger height by viewing them as set of sets. We compute sketches of arbitrary weighted trees and analyze them in the context of locality-sensitive hashing (LSH) where the probability of two sketches matching is high when two trees are similar and low when the two trees are far under the given distance measure. Specifically, we compute sketches of such trees by propagating min-hash computations up the tree. Furthermore, we show that propagating one min-hash results in poor sketch properties while propagating two min-hashes results in good sketches.
- N. Augsten, M. Böhlen, C. Dyreson, and J. Gamper. Approximate joins for data-centric XML. In Proceedings of the International Conference on Data Engineering (ICDE), Cancún, Mexico, April 2008. To appear. Google Scholar
- N. Augsten, M. Böhlen, and J. Gamper. Approximate matching of hierarchical data using pq-grams. In Proc. of the 31st VLDB Conference, pages 301--312, 2005. Google Scholar
Digital Library
- Andrei Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of SEQUENCES SEQS: Sequences '91, 1998. Google Scholar
- Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630--659, 2000. Google Scholar
- Moses Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual ACM Symposium on Theory of Computing, pages 380--388, 2002. Google Scholar
- S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change detection in hierarchically structured information. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages 493--504, 1996. Google Scholar
- W. Chen. New algorithms for ordered tree-to-tree correction problem. Journal of Algorithms, 40(2):135--158, August 2001. Google Scholar
- Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. of the 20th ACM Symposium on Computational Geometry, pages 253--262, 2004. Google Scholar
- M. Garofalakis and A. Kumar. Xml stream processing using tree-edit distance embeddings. ACM Transactions on Database Systems, 30(1):279--332, 2005. Google Scholar
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of 25th International Conference on Very Large Data Bases, VLDB, pages 518--529, 1999. Google Scholar
- S. Gollapudi and R. Panigrahy. Exploiting asymmetry in hierarchical topic extraction. In Proc. of 13th Conference on Information and Knowledge Management, 2006. Google Scholar
- Taher H. Haveliwala, Aristides Gionis, and Piotr Indyk. Scalable techniques for clustering the web. In WebDB (Informal Proceedings), pages 129--134, 2000.Google Scholar
- Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. of 30th ACM Symposium on Theory of Computing (STOC), pages 604--613, 1998. Google Scholar
- T. Jiang, L. Wang, and K. Zhang. Alignment of trees - an alternative to tree edit. In Proc. Intl Conference on Combinatorial Pattern Matching, pages 75--86, 1994. Google Scholar
- K. Kailing, H-P. Kriegel, S. Schonauer, and T. Seidl. Efficient similarity search for hierarchical data in large databases. In Proc. 9th Intl Conference on Extending Database Technology, pages 676--693, 2004.Google Scholar
- T. Margush and F. R. McMorris. Consensus n-trees. Bulletin of Mathematical Biology, 3:239--244, 1981.Google Scholar
- Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, pages 218--224, 2005. Google Scholar
- Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proc. of 17th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, pages 1186--1195, 2006. Google Scholar
- K. Zhang. A constrained editing distance between unordered labeled trees. Algorithmica, 15:205--222, 1996.Google Scholar
- K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245--1262, 1989. Google Scholar
- Li Zhang. On matching nodes between trees. Technical Report HPL-2003-67, HP Laboratories, Palo Alto, CA, April 2003.Google Scholar
Index Terms
The power of two min-hashes for similarity search among hierarchical data objects
Recommendations
XML Data Integration Based on Content and Structure Similarity Using Keys
OTM '08: Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:This paper proposes a technique for approximately matching XML data based on the content and structure by detecting the similarity of subtrees clustered semantically using <em>leaf-node parents</em> . The leaf-node parents are considered as a root of a ...
ComBI: Compressed Binary Search Tree for Approximate k-NN Searches in Hamming Space
AbstractThe space-partitioning based hashing techniques are widely used to represent high-dimensional data points as bit-codes. Although Binary Search Trees (BSTs) can be used for storing bit-codes, their size grows exponentially with code ...
Highlights- A geometrically motivated approach that makes Binary Search Trees practicable for nearest neighbor search.
Similar Subtree Search Using Extended Tree Inclusion
This paper considers the problem of identifying all locations of subtrees in a large tree or in a large collection of trees that are similar to a specified pattern tree, where all trees are assumed to be rooted and node-labeled. The tree edit distance is ...






Comments