ABSTRACT
Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index.
However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this paper, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, various sketches, and wavelets, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost.
- P. Afshani, G. S. Brodal, and N. Zeh. Ordered and unordered top-k range reporting in large data sets. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2011. Google Scholar
Digital Library
- P. K. Agarwal and J. Erickson. Geometric range searching and its relatives. In Advances in Discrete and Computational Geometry, pages 1--56. American Mathematical Society, 1999.Google Scholar
Cross Ref
- N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. Journal of Computer and System Sciences, 64(3):719--747, 2002.Google Scholar
Digital Library
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137--147, 1999. Google Scholar
Digital Library
- A. Arasu and G. Manku. Approximate counts and quantiles over sliding windows. In Proc. ACM Symposium on Principles of Database Systems, 2004. Google Scholar
Digital Library
- K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In Proc. ACM SIGMOD International Conference on Management of Data, 2007. Google Scholar
Digital Library
- G. S. Brodal, B. Gfeller, A. G. Jø rgensen, and P. Sanders. Towards optimal range medians. Theoretical Computer Science, to appear. Google Scholar
Digital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google Scholar
Digital Library
- M. Garofalakis and A. Kumar. Wavelet synopses for general error metrics. ACM Transactions on Database Systems, 30(4):888--928, 2005. Google Scholar
Digital Library
- J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams. In Proc. ACM SIGMOD International Conference on Management of Data, 2001. Google Scholar
Digital Library
- A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. Fast, small-space algorithms for approximate histogram maintenance. In Proc. ACM Symposium on Theory of Computing, 2002. Google Scholar
Digital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proc. International Conference on Very Large Data Bases, 2001. Google Scholar
Digital Library
- J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29--53, 1997. Google Scholar
Digital Library
- M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. ACM SIGMOD International Conference on Management of Data, 2001. Google Scholar
Digital Library
- S. Guha, C. Kim, and K. Shim. XWAVE: Optimal and approximate extended wavelets for streaming data. In Proc. International Conference on Very Large Data Bases, 2004. Google Scholar
Digital Library
- J. Hellerstein, P. Haas, and H. Wang. Online aggregation. In Proc. ACM SIGMOD International Conference on Management of Data, 1997. Google Scholar
Digital Library
- C. Jermaine, S. Arumugam, A. Pol, and A. Dobra. Scalable approximate query processing with the dbo engine. ACM Transactions on Database Systems, 33(4), Article 23, 2008. Google Scholar
Digital Library
- A. Jørgensen and K. Larsen. Range selection and median: Tight cell probe lower bounds and adaptive data structures. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2011. Google Scholar
Digital Library
- X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang. Selecting stars: The k most representative skyline operator. In Proc. IEEE International Conference on Data Engineering, 2007.Google Scholar
Cross Ref
- Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proc. ACM SIGMOD International Conference on Management of Data, 1998. Google Scholar
Digital Library
- Y. Matias, J. S. Vitter, and M. Wang. Dynamic maintenance of wavelet-based histograms. In Proc. International Conference on Very Large Data Bases, 2000. Google Scholar
Digital Library
- A. Metwally, D. Agrawal, and A. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 31(3):1095--1133, 2006. Google Scholar
Digital Library
- J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.Google Scholar
Cross Ref
- M. H. Overmars. The Design of Dynamic Data Structures. Springer-Verlag, LNCS 156, 1983.Google Scholar
Digital Library
- F. Rusu and A. Dobra. Pseudo-random number generation for sketch-based estimations. ACM Transactions on Database Systems, 32(2), Article 11, 2007. Google Scholar
Digital Library
- H. Samet. Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2006. Google Scholar
Digital Library
- Y. Tao, L. Ding, X. Lin, and J. Pei. Distance-based representative skyline. In Proc. IEEE International Conference on Data Engineering, 2009. Google Scholar
Digital Library
- V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264--280, 1971.Google Scholar
Cross Ref
- J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proc. SIGMOD International Conference on Management of Data, 1999. Google Scholar
Digital Library
Index Terms
Beyond simple aggregates: indexing for summary queries
Recommendations
Indexing for summary queries: Theory and practice
Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as ...
On computing temporal aggregates with range predicates
Computing temporal aggregates is an important but costly operation for applications that maintain time-evolving data (data warehouses, temporal databases, etc.) Due to the large volume of such data, performance improvements for temporal aggregate ...
Structure-aware indexing for keyword search in databases
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementMost of existing methods of keyword search over relational databases find the Steiner trees composed of relevant tuples as the answers. They identify the Steiner trees by discovering the rich structural relationships between tuples, and neglect the fact ...






Comments