skip to main content
10.1145/1989284.1989288acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Get the most out of your sample: optimal unbiased estimators using partial information

Published:13 June 2011Publication History

ABSTRACT

Random sampling is an essential tool in the processing and transmission of data. It is used to summarize data too large to store or manipulate and meet resource constraints on bandwidth or battery power. Estimators that are applied to the sample facilitate fast approximate processing of queries posed over the original data and the value of the sample hinges on the quality of these estimators.

Our work targets data sets such as request and traffic logs and sensor measurements, where data is repeatedly collected over multiple instances: time periods, locations, or snapshots. We are interested in operations, like quantiles and range, that span multiple instances. Subset-sums of these operations are used for applications ranging from planning to anomaly and change detection.

Unbiased low-variance estimators are particularly effective as the relative error decreases with aggregation. The Horvitz-Thompson estimator, known to minimize variance for subset-sums over a sample of a single instance, is not optimal for multi-instance operations because it fails to exploit samples which provide partial information on the estimated quantity.

We present a general principled methodology for the derivation of optimal unbiased estimators over sampled instances and aim to understand its potential. We demonstrate significant improvement in estimate accuracy of fundamental queries for common sampling schemes.

References

  1. K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In ACM SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samples from a single population. Australian J. of Statistics, 14(3):231--239, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  3. A. Z. Broder. On the resemblance and containment of documents. In IEEE Compression and Complexity of Sequences, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc. of the 11th Ann. Symp. on Combinatorial Pattern Matching, LLNCS vol 1848, pages 1--10. Springer, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. T. Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3):653--656, 1982.Google ScholarGoogle Scholar
  6. M. S. Charikar. Similarity estimation techniques from rounding algorithms. In ACM STOC. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Stream sampling for variance-optimal estimation of subset sums. In ACM-SIAM SODA. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In ACM PODC', 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Cohen and H. Kaplan. Leveraging discarded samples for tighter estimation of multiple-set aggregates. In ACM SIGMETRICS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. arXiv:0906.4560, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Cormode and S. Muthukrishnan. Estimating dominance norms of multiple data streams. In ESA. Springer-Verlag, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  15. G. Cormode and S. Muthukrishnan. What's new: finding significant differences in network data streams. IEEE/ACM Tran. on Networking, 13(6):1219--1232, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In ACM SIGMOD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. ACM, 54(6), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In ACM SPAA. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava. Hashed samples: Selectivity estimators for set similarity selection queries. In VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. J. of the American Stat. Assoc., 47(260):663--685, 1952.Google ScholarGoogle ScholarCross RefCross Ref
  22. D.E. Knuth. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms. Addison-Wesley, 1969.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based change detection: Methods, evaluation, and applications. In ACM IMC, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. Ohlsson. Coordination of pps samples over time. In Int. Conf. on Establishment Surveys, pages 255--264. American Stat. Assoc., 2000.Google ScholarGoogle Scholar
  25. B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135--158, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  26. P. J. Saavedra. Fixed sample size pps approximations with a permanent random number. In Proc. of the Section on Survey Research Methods, pages 697--700. American Stat. Assoc., 1995.Google ScholarGoogle Scholar
  27. M. Szegedy. The DLT priority sampling is essentially optimal. In ACM STOC. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J.S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Get the most out of your sample: optimal unbiased estimators using partial information

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PODS '11: Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
      June 2011
      332 pages
      ISBN:9781450306607
      DOI:10.1145/1989284

      Copyright © 2011 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 June 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate476of1,835submissions,26%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!