skip to main content
10.1145/1989284.1989299acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Beyond simple aggregates: indexing for summary queries

Authors Info & Claims
Published:13 June 2011Publication History

ABSTRACT

Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index.

However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this paper, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, various sketches, and wavelets, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost.

References

  1. P. Afshani, G. S. Brodal, and N. Zeh. Ordered and unordered top-k range reporting in large data sets. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. K. Agarwal and J. Erickson. Geometric range searching and its relatives. In Advances in Discrete and Computational Geometry, pages 1--56. American Mathematical Society, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  3. N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. Journal of Computer and System Sciences, 64(3):719--747, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137--147, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Arasu and G. Manku. Approximate counts and quantiles over sliding windows. In Proc. ACM Symposium on Principles of Database Systems, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In Proc. ACM SIGMOD International Conference on Management of Data, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. S. Brodal, B. Gfeller, A. G. Jø rgensen, and P. Sanders. Towards optimal range medians. Theoretical Computer Science, to appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Garofalakis and A. Kumar. Wavelet synopses for general error metrics. ACM Transactions on Database Systems, 30(4):888--928, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams. In Proc. ACM SIGMOD International Conference on Management of Data, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Fast, small-space algorithms for approximate histogram maintenance. In Proc. ACM Symposium on Theory of Computing, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proc. International Conference on Very Large Data Bases, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29--53, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. ACM SIGMOD International Conference on Management of Data, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Guha, C. Kim, and K. Shim. XWAVE: Optimal and approximate extended wavelets for streaming data. In Proc. International Conference on Very Large Data Bases, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Hellerstein, P. Haas, and H. Wang. Online aggregation. In Proc. ACM SIGMOD International Conference on Management of Data, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Jermaine, S. Arumugam, A. Pol, and A. Dobra. Scalable approximate query processing with the dbo engine. ACM Transactions on Database Systems, 33(4), Article 23, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Jørgensen and K. Larsen. Range selection and median: Tight cell probe lower bounds and adaptive data structures. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang. Selecting stars: The k most representative skyline operator. In Proc. IEEE International Conference on Data Engineering, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  20. Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proc. ACM SIGMOD International Conference on Management of Data, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Matias, J. S. Vitter, and M. Wang. Dynamic maintenance of wavelet-based histograms. In Proc. International Conference on Very Large Data Bases, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Metwally, D. Agrawal, and A. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 31(3):1095--1133, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  24. M. H. Overmars. The Design of Dynamic Data Structures. Springer-Verlag, LNCS 156, 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. Rusu and A. Dobra. Pseudo-random number generation for sketch-based estimations. ACM Transactions on Database Systems, 32(2), Article 11, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Tao, L. Ding, X. Lin, and J. Pei. Distance-based representative skyline. In Proc. IEEE International Conference on Data Engineering, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264--280, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  29. J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proc. SIGMOD International Conference on Management of Data, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Beyond simple aggregates: indexing for summary queries

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              PODS '11: Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
              June 2011
              332 pages
              ISBN:9781450306607
              DOI:10.1145/1989284

              Copyright © 2011 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 13 June 2011

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate476of1,835submissions,26%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!