ABSTRACT
Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality estimation in a principled framework: given a set of statistical assertions about the number of tuples returned by a fixed set of queries, predict the number of tuples returned by a new query. We model this problem using the probability space, over possible worlds, that satisfies all provided statistical assertions and maximizes entropy. We call this the Entropy Maximization model for statistics (MaxEnt). In this paper we develop the mathematical techniques needed to use the MaxEnt model for predicting the cardinality of conjunctive queries.
- S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley Publishing Co, 1995. Google Scholar
Digital Library
- N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In PODS, pages 10--20, 1999. Google Scholar
Digital Library
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29, 1996. Google Scholar
Digital Library
- L. Antova, C. Koch, and D. Olteanu. World-set decompositions: Expressiveness and efficient algorithms. In ICDT, pages 194--208, 2007. Google Scholar
Digital Library
- S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google Scholar
Digital Library
- S. Chaudhuri, V. R. Narasayya, and R. Ramamurthy. Diagnosing estimation errors in page counts using execution feedback. In ICDE, pages 1013--1022, 2008. Google Scholar
Digital Library
- R. M. Corless, D. J. Jeffrey, and D. E. Knuth. A sequence of series for the lambert w function. In ISSAC, pages 197--204, 1997. Google Scholar
Digital Library
- N. N. Dalvi, G. Miklau, and D. Suciu. Asymptotic conditional probabilities for conjunctive queries. In ICDT, pages 289--305, 2005. Google Scholar
Digital Library
- N. N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on probabilistic structures. In PODS, pages 293--302, 2007. Google Scholar
Digital Library
- A. Deligiannakis, M. N. Garofalakis, and N. Roussopoulos. Extended wavelets for multiple measures. ACM Trans. Database Syst., 32(2):10, 2007. Google Scholar
Digital Library
- L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD Conference, pages 461--472, 2001. Google Scholar
Digital Library
- P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami. Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci., 52(3):550--569, 1996. Google Scholar
Digital Library
- Y. E. Ioannidis. The history of histograms (abridged). In VLDB, pages 19--30, 2003. Google Scholar
Digital Library
- Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In SIGMOD Conference, pages 268--277, 1991. Google Scholar
Digital Library
- E. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, UK, 2003.Google Scholar
- R. Kaushik, C. Ré, and D. Suciu. General database statistics using entropy maximization. In DBPL, pages 84--99, 2009. Google Scholar
Digital Library
- R. Kaushik and D. Suciu. Consistent histograms in the presence of distinct value counts. PVLDB, 2(1):850--861, 2009. Google Scholar
Digital Library
- C. Koch and D. Olteanu. Conditioning probabilistic databases. PVLDB, 1(1):313--325, 2008. Google Scholar
Digital Library
- V. Markl, N. Megiddo, M. Kutsch, T. M. Tran, P. J. Haas, and U. Srivastava. Consistently estimating the selectivity of conjuncts of predicates. In VLDB, pages 373--384, 2005. Google Scholar
Digital Library
- M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 2005. Google Scholar
Digital Library
- F. Olken. Random Sampling from Databases. PhD thesis, University of California at Berkeley, 1993.Google Scholar
- C. Papadimitriou. Computational Complexity. Addison Wesley Publishing Company, 1994.Google Scholar
- V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In VLDB, pages 486--495, 1997. Google Scholar
Digital Library
- M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1-2):107--136, 2006. Google Scholar
Digital Library
- W. Rudin. Principles of Mathematical Analysis, Third Edition. McGraw-Hill Science/Engineering/Math, 3rd edition, January 1976.Google Scholar
- F. Rusu and A. Dobra. Sketches for size of join estimation. ACM Trans. Database Syst., 33(3), 2008. Google Scholar
Digital Library
- Sage. Open-source mathematics software. http://sagemath.org, 2009.Google Scholar
- P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, pages 596--605, 2007.Google Scholar
Cross Ref
- J. Shao. Mathematical Statistics. Springer, 2nd edition, 2003.Google Scholar
- U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and T. M. Tran. Isomer: Consistent histogram construction using query feedback. In ICDE, page 39, 2006. Google Scholar
Digital Library
- M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. Leo - db2's learning optimizer. In VLDB, pages 19--28, 2001. Google Scholar
Digital Library
- M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1--305, 2008. Google Scholar
Digital Library
- J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google Scholar
Index Terms
Understanding cardinality estimation using entropy maximization
Recommendations
Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataIn this work we introduce a novel approach to the problem of cardinality estimation over multijoin queries. Our approach leveraging randomized hashing and data sketching to tighten these bounds beyond the current state of the art. We demonstrate that ...
Understanding cardinality estimation using entropy maximization
Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality ...
Weighted Distinct Sampling: Cardinality Estimation for SPJ Queries
SIGMOD '21: Proceedings of the 2021 International Conference on Management of DataSPJ (select-project-join) queries form the backbone of many SQL queries used in practice. Accurate cardinality estimation of these queries is thus an important problem, with applications in query optimization, approximate query processing, and data ...






Comments