skip to main content
10.1145/1265530.1265544acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Maintaining bernoulli samples over evolving multisets

Published:11 June 2007Publication History

ABSTRACT

Random sampling has become a crucial component of modern data management systems. Although the literature on database sampling is large, there has been relatively little work on the problem of maintaining a sample in the presence of arbitrary insertions and deletions to the underlying dataset. Most existing maintenance techniques apply either to the insert-only case or to datasets that do not contain duplicates. In this paper, we provide a scheme that maintains a Bernoulli sample of an underlying multiset in the presence of an arbitrary stream of updates, deletions, and insertions. Importantly, the scheme never needs to access the underlying multiset. Such Bernoulli samples are easy to manipulate, and are well suited to parallel processing environments. Our method can be viewed as an enhancement of the "counting sample" scheme developed by Gibbons and Matias for estimating the frequency of highly frequent items. We show how the "tracking counters" used by our maintenance scheme can be exploited to estimate population frequencies, sums, and averages in an unbiased manner, with lower variance than the usual estimators based on a Bernoulli sample. The number of distinct items in the multiset can also be estimated without bias. Finally, we discuss certain problems of subsampling and merging that a rise in systems with limited memory resources or distributed processing, respectively.

Skip Supplemental Material Section

Supplemental Material

Low Resolution
High Resolution

References

  1. B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In Proc. SODA, pages 633--634, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. G. Brown and P. J. Haas. Techniques for warehousing of sample data. In Proc. ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Cormode, S. Muthukrishnan, and I. Rozenbaum. Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling. In Proc. VLDB, pages 25--36, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Devroye. Non-Uniform Random Variate Generation. Springer, New York, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  5. C. Fan, M. Muller, and I. Rezucha. Development of sampling plans by using sequential (item by item) techniques and digital computers. J. Amer. Statist. Assoc., 57:387--402, 1962.Google ScholarGoogle ScholarCross RefCross Ref
  6. G. Frahling, P. Indyk, and C. Sohler. Sampling in dynamic data streams and applications. In Proc. 21st Annual Symp. Comput. Geom., pages 142--149, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Gemulla, W. Lehner, and P. J. Haas. A dip in the reservoir: maintaining sample synopses of evolving datasets. In Proc. VLDB, pages 595--606, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. VLDB, pages 541--550, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. B. Gibbons. Distinct-values estimation over data streams. In Data Stream Management: Processing High Speed Data Streams. Springer, 2007.Google ScholarGoogle Scholar
  10. P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proc.ACM SIGMOD, pages 331--342, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Jermaine, A. Pol, and S. Arumugam. Online maintenance of very large random samples. In Proc. ACM SIGMOD, pages 299--310, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Olken and D. Rotem. Simple random sampling from relational databases. In Proc. VLDB, pages 160--169, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C -E. Särndal, B. Swensson, and J. Wretman. Model Assisted Survey Sampling. Springer, New York, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  14. J. S. Vitter. Random Sampling with a Reservoir. ACM Trans. Math. Software, 11(1):37--57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Maintaining bernoulli samples over evolving multisets

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
            June 2007
            328 pages
            ISBN:9781595936851
            DOI:10.1145/1265530

            Copyright © 2007 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 11 June 2007

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate476of1,835submissions,26%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!