ABSTRACT
We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet £ of size à is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases [15, 14, 5]. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing [8, 11, 6, 17] providing an optimal data structure that requires ?(|T|logÃ/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.
- J. Barbay, T. Gagie, G. Navarro, and Y. Nekrich. Alphabet partitioning for compressed rank/select and applications. In ISAAC (2), pages 315--326, 2010.Google Scholar
- J. Barbay, M. He, J.I. Munro, and S. Srinivasa Rao. Succinct indexes for string, binary relations and multi-labeled trees. In Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 680--689, 2007. Google Scholar
Digital Library
- D. Belazzougui, P. Boldi, R. Pagh, and S. Vigna. Fast prefix search in little space, with applications. In ESA (1), pages 427--438, 2010. Google Scholar
Digital Library
- M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.Google Scholar
- S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem. In Proceedings of the 20th International Conference on Data Engineering, ICDE '04, pages 227-, 2004. Google Scholar
Digital Library
- P. Ferragina, R. Gonalez, G. Navarro, and R. Venturini. Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics, 13, 2008. Google Scholar
Digital Library
- P. Ferragina and R. Grossi. The string b-tree: a new data structure for string search in external memory and its applications. J. ACM, 46:236--280, March 1999. Google Scholar
Digital Library
- P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM, 52(4):552--581, 2005. Google Scholar
Digital Library
- P. Ferragina, G. Manzini, V. Makinen, and G. Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2), 2007. Google Scholar
Digital Library
- P. Ferragina and R. Venturini. Compressed permuterm index. In SIGIR, pages 535--542, 2007. Google Scholar
Digital Library
- R. Grossi and J. S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378--407, 2005. Google Scholar
Digital Library
- D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google Scholar
Digital Library
- T. Hagerup and T. Tholey. Efficient minimal perfect hashing in nearly minimal space. In Proceedings of the 18th Annual Symposium on Theoretical Aspects of Computer Science(STACS), pages 317--326, 2001. Google Scholar
Digital Library
- H.V. Jagadish, R. T. Ng, and D. Srivastava. Substring selectivity estimation. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS '99, pages 249--260, 1999. Google Scholar
Digital Library
- P. Krishnan, J. S. Vitter, and B. R. Iyer. Estimating alphanumeric selectivity in the presence of wildcards. In SIGMOD Conference, pages 282--293, 1996. Google Scholar
Digital Library
- G. Manzini. An analysis of the Burrows-Wheeler transform. Journal of the ACM, 48(3):407--430, 2001. Google Scholar
Digital Library
- G. Navarro and V. Makinen. Compressed full text indexes. ACM Computing Surveys, 39(1), 2007. Google Scholar
Digital Library
- D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In ALENEX, 2007.Google Scholar
Digital Library
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, 1999. Google Scholar
Digital Library
Index Terms
Space-efficient substring occurrence estimation
Recommendations
Space-Efficient Substring Occurrence Estimation
In this paper we study the problem of estimating the number of occurrences of substrings in textual data: A text $$T$$T on some alphabet $$\varSigma =[\sigma ]$$Σ=[ ] of length $$n$$n is preprocessed and an index $${\mathcal {I}}$$I is built. The index ...
On position restricted substring searching in succinct space
We study the position restricted substring searching (PRSS) problem, where the task is to index a text T[0...n-1] of n characters over an alphabet set @S of size @s, in order to answer the following: given a query pattern P (of length p) and two indices ...
Two-pattern strings II-frequency of occurrence and substring complexity
The previous paper in this series introduced a class of infinite binary strings, called two-pattern strings, that constitute a significant generalization of, and include, the much-studied Sturmian strings. The class of two-pattern strings is a union of ...






Comments