skip to main content
10.1145/1989284.1989300acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Space-efficient substring occurrence estimation

Published:13 June 2011Publication History

ABSTRACT

We study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet £ of size à is preprocessed and an index I is built. The index is used in lieu of the text to answer queries of the form CountH(P), returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T. The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases [15, 14, 5]. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing [8, 11, 6, 17] providing an optimal data structure that requires ?(|T|logÃ/l) bits where l e 1 is the additive error on any answer. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are sustained by experiments showing the practical impact of our data structures.

References

  1. J. Barbay, T. Gagie, G. Navarro, and Y. Nekrich. Alphabet partitioning for compressed rank/select and applications. In ISAAC (2), pages 315--326, 2010.Google ScholarGoogle Scholar
  2. J. Barbay, M. He, J.I. Munro, and S. Srinivasa Rao. Succinct indexes for string, binary relations and multi-labeled trees. In Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 680--689, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Belazzougui, P. Boldi, R. Pagh, and S. Vigna. Fast prefix search in little space, with applications. In ESA (1), pages 427--438, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.Google ScholarGoogle Scholar
  5. S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem. In Proceedings of the 20th International Conference on Data Engineering, ICDE '04, pages 227-, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Ferragina, R. Gonalez, G. Navarro, and R. Venturini. Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics, 13, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Ferragina and R. Grossi. The string b-tree: a new data structure for string search in external memory and its applications. J. ACM, 46:236--280, March 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM, 52(4):552--581, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Ferragina, G. Manzini, V. Makinen, and G. Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Ferragina and R. Venturini. Compressed permuterm index. In SIGIR, pages 535--542, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Grossi and J. S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378--407, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Hagerup and T. Tholey. Efficient minimal perfect hashing in nearly minimal space. In Proceedings of the 18th Annual Symposium on Theoretical Aspects of Computer Science(STACS), pages 317--326, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H.V. Jagadish, R. T. Ng, and D. Srivastava. Substring selectivity estimation. In Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS '99, pages 249--260, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Krishnan, J. S. Vitter, and B. R. Iyer. Estimating alphanumeric selectivity in the presence of wildcards. In SIGMOD Conference, pages 282--293, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Manzini. An analysis of the Burrows-Wheeler transform. Journal of the ACM, 48(3):407--430, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Navarro and V. Makinen. Compressed full text indexes. ACM Computing Surveys, 39(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In ALENEX, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Space-efficient substring occurrence estimation

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              PODS '11: Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
              June 2011
              332 pages
              ISBN:9781450306607
              DOI:10.1145/1989284

              Copyright © 2011 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 13 June 2011

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate476of1,835submissions,26%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!