ABSTRACT
Random sampling has become a crucial component of modern data management systems. Although the literature on database sampling is large, there has been relatively little work on the problem of maintaining a sample in the presence of arbitrary insertions and deletions to the underlying dataset. Most existing maintenance techniques apply either to the insert-only case or to datasets that do not contain duplicates. In this paper, we provide a scheme that maintains a Bernoulli sample of an underlying multiset in the presence of an arbitrary stream of updates, deletions, and insertions. Importantly, the scheme never needs to access the underlying multiset. Such Bernoulli samples are easy to manipulate, and are well suited to parallel processing environments. Our method can be viewed as an enhancement of the "counting sample" scheme developed by Gibbons and Matias for estimating the frequency of highly frequent items. We show how the "tracking counters" used by our maintenance scheme can be exploited to estimate population frequencies, sums, and averages in an unbiased manner, with lower variance than the usual estimators based on a Bernoulli sample. The number of distinct items in the multiset can also be estimated without bias. Finally, we discuss certain problems of subsampling and merging that a rise in systems with limited memory resources or distributed processing, respectively.
Supplemental Material
- B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In Proc. SODA, pages 633--634, 2002. Google Scholar
Digital Library
- P. G. Brown and P. J. Haas. Techniques for warehousing of sample data. In Proc. ICDE, 2006. Google Scholar
Digital Library
- G. Cormode, S. Muthukrishnan, and I. Rozenbaum. Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling. In Proc. VLDB, pages 25--36, 2005. Google Scholar
Digital Library
- L. Devroye. Non-Uniform Random Variate Generation. Springer, New York, 1986.Google Scholar
Cross Ref
- C. Fan, M. Muller, and I. Rezucha. Development of sampling plans by using sequential (item by item) techniques and digital computers. J. Amer. Statist. Assoc., 57:387--402, 1962.Google Scholar
Cross Ref
- G. Frahling, P. Indyk, and C. Sohler. Sampling in dynamic data streams and applications. In Proc. 21st Annual Symp. Comput. Geom., pages 142--149, 2005. Google Scholar
Digital Library
- R. Gemulla, W. Lehner, and P. J. Haas. A dip in the reservoir: maintaining sample synopses of evolving datasets. In Proc. VLDB, pages 595--606, 2006. Google Scholar
Digital Library
- P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. VLDB, pages 541--550, 2001. Google Scholar
Digital Library
- P. B. Gibbons. Distinct-values estimation over data streams. In Data Stream Management: Processing High Speed Data Streams. Springer, 2007.Google Scholar
- P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proc.ACM SIGMOD, pages 331--342, 1998. Google Scholar
Digital Library
- C. Jermaine, A. Pol, and S. Arumugam. Online maintenance of very large random samples. In Proc. ACM SIGMOD, pages 299--310, 2004. Google Scholar
Digital Library
- F. Olken and D. Rotem. Simple random sampling from relational databases. In Proc. VLDB, pages 160--169, 1986. Google Scholar
Digital Library
- C -E. Särndal, B. Swensson, and J. Wretman. Model Assisted Survey Sampling. Springer, New York, 1992.Google Scholar
Cross Ref
- J. S. Vitter. Random Sampling with a Reservoir. ACM Trans. Math. Software, 11(1):37--57, 1985. Google Scholar
Digital Library
Index Terms
Maintaining bernoulli samples over evolving multisets
Recommendations
Maintaining bounded-size sample synopses of evolving datasets
Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. ...
Maintaining very large random samples using the geometric file
Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a ...






Comments