skip to main content
10.1145/1807085.1807095acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Understanding cardinality estimation using entropy maximization

Published:06 June 2010Publication History

ABSTRACT

Cardinality estimation is the problem of estimating the number of tuples returned by a query; it is a fundamentally important task in data management, used in query optimization, progress estimation, and resource provisioning. We study cardinality estimation in a principled framework: given a set of statistical assertions about the number of tuples returned by a fixed set of queries, predict the number of tuples returned by a new query. We model this problem using the probability space, over possible worlds, that satisfies all provided statistical assertions and maximizes entropy. We call this the Entropy Maximization model for statistics (MaxEnt). In this paper we develop the mathematical techniques needed to use the MaxEnt model for predicting the cardinality of conjunctive queries.

References

  1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley Publishing Co, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In PODS, pages 10--20, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Antova, C. Koch, and D. Olteanu. World-set decompositions: Expressiveness and efficient algorithms. In ICDT, pages 194--208, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chaudhuri, V. R. Narasayya, and R. Ramamurthy. Diagnosing estimation errors in page counts using execution feedback. In ICDE, pages 1013--1022, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. M. Corless, D. J. Jeffrey, and D. E. Knuth. A sequence of series for the lambert w function. In ISSAC, pages 197--204, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. N. Dalvi, G. Miklau, and D. Suciu. Asymptotic conditional probabilities for conjunctive queries. In ICDT, pages 289--305, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on probabilistic structures. In PODS, pages 293--302, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Deligiannakis, M. N. Garofalakis, and N. Roussopoulos. Extended wavelets for multiple measures. ACM Trans. Database Syst., 32(2):10, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In SIGMOD Conference, pages 461--472, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. J. Haas, J. F. Naughton, S. Seshadri, and A. N. Swami. Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci., 52(3):550--569, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. E. Ioannidis. The history of histograms (abridged). In VLDB, pages 19--30, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In SIGMOD Conference, pages 268--277, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, UK, 2003.Google ScholarGoogle Scholar
  16. R. Kaushik, C. Ré, and D. Suciu. General database statistics using entropy maximization. In DBPL, pages 84--99, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Kaushik and D. Suciu. Consistent histograms in the presence of distinct value counts. PVLDB, 2(1):850--861, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Koch and D. Olteanu. Conditioning probabilistic databases. PVLDB, 1(1):313--325, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. V. Markl, N. Megiddo, M. Kutsch, T. M. Tran, P. J. Haas, and U. Srivastava. Consistently estimating the selectivity of conjuncts of predicates. In VLDB, pages 373--384, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Olken. Random Sampling from Databases. PhD thesis, University of California at Berkeley, 1993.Google ScholarGoogle Scholar
  22. C. Papadimitriou. Computational Complexity. Addison Wesley Publishing Company, 1994.Google ScholarGoogle Scholar
  23. V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In VLDB, pages 486--495, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1-2):107--136, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. W. Rudin. Principles of Mathematical Analysis, Third Edition. McGraw-Hill Science/Engineering/Math, 3rd edition, January 1976.Google ScholarGoogle Scholar
  26. F. Rusu and A. Dobra. Sketches for size of join estimation. ACM Trans. Database Syst., 33(3), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Sage. Open-source mathematics software. http://sagemath.org, 2009.Google ScholarGoogle Scholar
  28. P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, pages 596--605, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  29. J. Shao. Mathematical Statistics. Springer, 2nd edition, 2003.Google ScholarGoogle Scholar
  30. U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and T. M. Tran. Isomer: Consistent histogram construction using query feedback. In ICDE, page 39, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. Leo - db2's learning optimizer. In VLDB, pages 19--28, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1--305, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.Google ScholarGoogle Scholar

Index Terms

  1. Understanding cardinality estimation using entropy maximization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
      June 2010
      350 pages
      ISBN:9781450300339
      DOI:10.1145/1807085

      Copyright © 2010 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 June 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate476of1,835submissions,26%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!