skip to main content
10.1145/2213556.2213586acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

The wavelet trie: maintaining an indexed sequence of strings in compressed space

Published:21 May 2012Publication History

ABSTRACT

An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory.

We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations.

We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence.

References

  1. D. Arroyuelo, R. Cánovas, G. Navarro, and K. Sadakane. Succinct trees in practice. In ALENEX, pages 84--97, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Benoit, E. D. Demaine, J. I. Munro, R. Raman, V. Raman, and S. S. Rao. Representing trees of higher degree. Algorithmica, 43(4):275--292, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5280, pages 176--187. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1):19--51, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194--203, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Ferragina, R. Giancarlo, and G. Manzini. The myriad virtues of wavelet trees. Inf. Comput., 207(8):849--866, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter. On searching compressed string collections cache-obliviously. In PODS, pages 181--190, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. J. ACM, 57(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Foschini, R. Grossi, A. Gupta, and J. S. Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Trans. on Algorithms, 2(4):611--639, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. L. Fredman and D. E. Willard. Surpassing the information theoretic bound with fusion trees. Journal of Computer and System Sciences, 47(3):424--436, Dec. 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Gagie, G. Navarro, and S. J. Puglisi. New algorithms on wavelet trees and applications to information retrieval. Theoretical Computer Science, to appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. González and G. Navarro. Rank/Select on dynamic compressed sequences and applications. Theor. Comput. Sci., 410(43):4414--4422, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In SODA, pages 841--850, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Grossi and G. Ottaviano. Fast compressed tries through path decompositions. In ALENEX, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  15. G. Jacobson. Space-efficient static trees and graphs. In FOCS, pages 549--554, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Lee and K. Park. Dynamic compressed representation of texts with rank/select. JCSE, 3(1):15--26, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  17. V. Mäkinen and G. Navarro. Position-restricted substring searching. In LATIN, pages 703--714, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Mäkinen and G. Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms, 4(3), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. R. Morrison. Patricia - practical algorithm to retrieve information coded in alphanumeric. J. ACM, 15(4):514--534, 1968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In ALENEX, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. H. Overmars. The design of dynamic data structures, volume 156 of Lecture Notes in Computer Science. Springer-Verlag, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding n-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. S. Vitter. Algorithms and Data Structures for External Memory. Foundations and trends in theoretical computer science. Now Publishers, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The wavelet trie: maintaining an indexed sequence of strings in compressed space

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PODS '12: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems
          May 2012
          332 pages
          ISBN:9781450312486
          DOI:10.1145/2213556

          Copyright © 2012 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 May 2012

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate476of1,835submissions,26%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!