skip to main content
10.1145/2213556.2213561acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Approximating and testing k-histogram distributions in sub-linear time

Authors Info & Claims
Published:21 May 2012Publication History

ABSTRACT

A discrete distribution p, over [n], is a k histogram if its probability distribution function can be represented as a piece-wise constant function with k pieces. Such a function is represented by a list of k intervals and k corresponding values. We consider the following problem: given a collection of samples from a distribution p, find a k-histogram that (approximately) minimizes the l 2 distance to the distribution p. We give time and sample efficient algorithms for this problem.

We further provide algorithms that distinguish distributions that have the property of being a k-histogram from distributions that are ε-far from any k-histogram in the l 1 distance and l 2 distance respectively.

Skip Supplemental Material Section

Supplemental Material

References

  1. N. Alon, A. Andoni, T. Kaufman, K. Matulef, R. Rubinfeld, and N. Xie. Testing k-wise and almost-wise independence. In Proceedings of the Thirty-Ninth Annual ACM Symposium on the Theory of Computing (STOC), pages 496--505, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Batu, S. Dasgupta, R. Kumar, and R. Rubinfeld. The complexity of approximating the entropy. SIAM Journal on Computing, 35(1):132--150, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Batu, L. Fortnow, E. Fischer, R. Kumar, R. Rubinfeld, and P. White. Testing random variables for independence and identity. In Proceedings of the Forty-Second Annual Symposium on Foundations of Computer Science (FOCS), pages 442--451, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Batu, L. Fortnow, R. Rubinfeld, W.D. Smith, and P. White. Testing that distributions are close. In Proceedings of the Forty-First Annual Symposium on Foundations of Computer Science (FOCS), pages 259--269, Los Alamitos, CA, USA, 2000. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing closeness of discrete distributions. CoRR, abs/1009.5397, 2010. This is a long version of BFR++00.Google ScholarGoogle Scholar
  6. T. Batu, R. Kumar, and R. Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In Proceedings of the Thirty-Sixth Annual ACM Symposium on the Theory of Computing (STOC), pages 381--390, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: how much is enough? SIGMOD, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, M. Muthukrishnan, and M. Strauss. Fast, small-space algorithms for approximate histogram maintenance. STOC, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. O. Goldreich, S. Goldwasser, and D. Ron. Property testing and its connection to learning and approximation. Journal of the ACM, 45(4):653--750, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Guha, N. Koudas, and K. Shim. Approximation and streaming algorithms for histogram construction problems. ACM Transactions on Database Systems (TODS), 31(1), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P.B. Gibbons, Y Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. VLDB, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. O. Goldreich and D. Ron. On testing expansion in bounded-degree graphs. Electronic Colloqium on Computational Complexity, 7(20), 2000.Google ScholarGoogle Scholar
  13. Y. Ioannidis. The history of histograms (abridged). VLDB, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. V. Jagadish, V. Poosala, N. Koudas, K. Sevcik, S. Muthukrishnan, and T. Suel. Optimal histograms with quality guarantees. VLDB, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Paninski. Testing for uniformity given very sparsely-sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750--4755, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Ron. Property testing: A learning theory perspective. Foundations and Trends in Machine Learning, 3:307--402, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Raskhodnikova, D. Ron, A. Shpilka, and A. Smith. Strong lower bonds for approximating distributions support size and the distinct elements problem. SIAM Journal on Computing, 39(3):813--842, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Rubinfeld and M. Sudan. Robust characterization of polynomials with applications to program testing. SIAM Journal on Computing, 25(2):252--271, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Rubinfeld. Sublinear time algorithms. In Proc. International Congress of Mathematicians, volume 3, pages 1095--1111, 2006.Google ScholarGoogle Scholar
  20. Nitin Thaper, Sudipto Guha, Piotr Indyk, and Nick Koudas. Dynamic multidimensional histograms. In SIGMOD Conference, pages 428--439, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Valiant. Testing symmetric properties of distributions. In Proceedings of the Fourtieth Annual ACM Symposium on the Theory of Computing (STOC), pages 383--392, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Valiant and P. Valiant. Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the Fourty-Third Annual ACM Symposium on the Theory of Computing, pages 685--694, 2011. See also ECCC TR10-179 and TR10-180. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Approximating and testing k-histogram distributions in sub-linear time

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PODS '12: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems
      May 2012
      332 pages
      ISBN:9781450312486
      DOI:10.1145/2213556

      Copyright © 2012 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 May 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate476of1,835submissions,26%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!