ABSTRACT
An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of column-oriented databases, log processing, and other storage and query tasks. In these applications each string can appear several times and the order of the strings in the sequence is relevant. The prefix structure of the strings is relevant as well: common prefixes are sought in strings to extract interesting features from the sequence. Moreover, space-efficiency is highly desirable as it translates directly into higher performance, since more data can fit in fast memory.
We introduce and study the problem of compressed indexed sequence of strings, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while preserving provably good performance for the supported operations.
We present a new data structure for this problem, the Wavelet Trie, which combines the classical Patricia Trie with the Wavelet Tree, a succinct data structure for storing a compressed sequence. The resulting Wavelet Trie smoothly adapts to a sequence of strings that changes over time. It improves on the state-of-the-art compressed data structures by supporting a dynamic alphabet (i.e. the set of distinct strings) and prefix queries, both crucial requirements in the aforementioned applications, and on traditional indexes by reducing space occupancy to close to the entropy of the sequence.
- D. Arroyuelo, R. Cánovas, G. Navarro, and K. Sadakane. Succinct trees in practice. In ALENEX, pages 84--97, 2010.Google Scholar
Digital Library
- D. Benoit, E. D. Demaine, J. I. Munro, R. Raman, V. Raman, and S. S. Rao. Representing trees of higher degree. Algorithmica, 43(4):275--292, 2005. Google Scholar
Digital Library
- F. Claude and G. Navarro. Practical rank/select queries over arbitrary sequences. In Proc. 15th International Symposium on String Processing and Information Retrieval (SPIRE), LNCS 5280, pages 176--187. Springer, 2008. Google Scholar
Digital Library
- M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1):19--51, 1997. Google Scholar
Digital Library
- P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194--203, 1975. Google Scholar
Digital Library
- P. Ferragina, R. Giancarlo, and G. Manzini. The myriad virtues of wavelet trees. Inf. Comput., 207(8):849--866, 2009. Google Scholar
Digital Library
- P. Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter. On searching compressed string collections cache-obliviously. In PODS, pages 181--190, 2008. Google Scholar
Digital Library
- P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. J. ACM, 57(1), 2009. Google Scholar
Digital Library
- L. Foschini, R. Grossi, A. Gupta, and J. S. Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Trans. on Algorithms, 2(4):611--639, 2006. Google Scholar
Digital Library
- M. L. Fredman and D. E. Willard. Surpassing the information theoretic bound with fusion trees. Journal of Computer and System Sciences, 47(3):424--436, Dec. 1993. Google Scholar
Digital Library
- T. Gagie, G. Navarro, and S. J. Puglisi. New algorithms on wavelet trees and applications to information retrieval. Theoretical Computer Science, to appear. Google Scholar
Digital Library
- R. González and G. Navarro. Rank/Select on dynamic compressed sequences and applications. Theor. Comput. Sci., 410(43):4414--4422, 2009. Google Scholar
Digital Library
- R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In SODA, pages 841--850, 2003. Google Scholar
Digital Library
- R. Grossi and G. Ottaviano. Fast compressed tries through path decompositions. In ALENEX, 2012.Google Scholar
Cross Ref
- G. Jacobson. Space-efficient static trees and graphs. In FOCS, pages 549--554, 1989. Google Scholar
Digital Library
- S. Lee and K. Park. Dynamic compressed representation of texts with rank/select. JCSE, 3(1):15--26, 2009.Google Scholar
Cross Ref
- V. Mäkinen and G. Navarro. Position-restricted substring searching. In LATIN, pages 703--714, 2006. Google Scholar
Digital Library
- V. Mäkinen and G. Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms, 4(3), 2008. Google Scholar
Digital Library
- D. R. Morrison. Patricia - practical algorithm to retrieve information coded in alphanumeric. J. ACM, 15(4):514--534, 1968. Google Scholar
Digital Library
- D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In ALENEX, 2007.Google Scholar
Digital Library
- M. H. Overmars. The design of dynamic data structures, volume 156 of Lecture Notes in Computer Science. Springer-Verlag, 1983. Google Scholar
Digital Library
- R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding n-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4), 2007. Google Scholar
Digital Library
- J. S. Vitter. Algorithms and Data Structures for External Memory. Foundations and trends in theoretical computer science. Now Publishers, 2008. Google Scholar
Digital Library
Index Terms
The wavelet trie: maintaining an indexed sequence of strings in compressed space
Recommendations
Languages with mismatches and an application to approximate indexing
DLT'05: Proceedings of the 9th international conference on Developments in Language TheoryIn this paper we describe a factorial language, denoted by L(S,k,r), that contains all words that occur in a string S up to k mismatches every r symbols. Then we give some combinatorial properties of a parameter, called repetition index and denoted by R(...
Compressed suffix trees: Efficient computation and storage of LCP-values
The suffix tree is a very important data structure in string processing, but typical implementations suffer from huge space consumption. In large-scale applications, compressed suffix trees (CSTs) are therefore used instead. A CST consists of three (...
Wavelet Trees: From Theory to Practice
CCP '11: Proceedings of the 2011 First International Conference on Data Compression, Communications and ProcessingThe \emph{wavelet tree} data structure is a space-efficient technique for rank and select queries that generalizes from binary characters to an arbitrary multicharacter alphabet. It has become a key tool in modern full-text indexing and data compression ...






Comments