ABSTRACT
A fundamental problem in data management is to draw a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol and track parameters of the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this paper, we present communication-efficient protocols for sampling (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent items, or arrivals within the last w time units. We show that our protocols are optimal, not just in terms of the communication used, but also that they use minimal or near minimal (up to logarithmic factors) time to process each new item, and space to operate.
- C. Arackaparambil, J. Brody, and A. Chakrabarti. Functional monitoring without monotonicity. In ICALP, 2009. Google Scholar
Digital Library
- B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In SODA, 2002. Google Scholar
Digital Library
- B. Babcock and C. Olston. Distributed top-k monitoring. In SIGMOD, 2003. Google Scholar
Digital Library
- V. Braverman, R. Ostrovsky, and C. Zaniolo. Optimal sampling for sliding windows. In PODS, 2009. Google Scholar
Digital Library
- B. Chazelle. The Discrepancy Method. Cambridge University Press, 2000. Google Scholar
Digital Library
- G. Cormode and M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In VLDB, 2005. Google Scholar
Digital Library
- G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In SIGMOD, 2005. Google Scholar
Digital Library
- G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring. In SODA, 2008. Google Scholar
Digital Library
- G. Cormode, S. Muthukrishnan, and W. Zhuang. What's different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In ICDE, pages 20--31, 2006. Google Scholar
Digital Library
- R. Gemulla and W. Lehner. Sampling time-based sliding windows in bounded space. In SIGMOD, pages 379--392, 2008. Google Scholar
Digital Library
- M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD, 2001. Google Scholar
Digital Library
- D. Haussler and E. Welzl. Epsilon-nets and simplex range queries. Discrete and Computational Geometry, 2:127--151, 1987.Google Scholar
Digital Library
- L. Huang, X. Nguyen, M. Garofalakis, J. Hellerstein, A. D. Joseph, M. Jordan, and N. Taft. Communication-efficient online detection of network-wide anomalies. In IEEE INFOCOM, 2007.Google Scholar
Digital Library
- R. Keralapura, G. Cormode, and J. Ramamirtham. Communication-efficient distributed monitoring of thresholded counts. In SIGMOD, 2006. Google Scholar
Digital Library
- D. E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming. Addison-Wesley, Reading, MA, 2nd edition, 1981.Google Scholar
- A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In ICDE, 2005. Google Scholar
Digital Library
- A. Metwally, D. Agrawal, and A. E. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 2006. Google Scholar
Digital Library
- I. Sharfman, A. Schuster, and D. Keren. A geometric approach to monitoring threshold functions over distribtuted data streams. In SIGMOD, 2006. Google Scholar
Digital Library
- I. Sharfman, A. Schuster, and D. Keren. Shape sensitive geometric monitoring. In PODS, 2008. Google Scholar
Digital Library
- V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264--280, 1971.Google Scholar
Cross Ref
- J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37--57, Mar. 1985. Google Scholar
Digital Library
- K. Yi and Q. Zhang. Optimal tracking of distributed heavy hitters and quantiles. In PODS, 2009. Google Scholar
Digital Library
Index Terms
Optimal sampling from distributed streams
Recommendations
Continuous sampling from distributed streams
A fundamental problem in data management is to draw and maintain a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when ...
Optimal sampling from sliding windows
PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsA sliding windows model is an important case of the streaming model, where only the most "recent" elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, ...
Weighted Reservoir Sampling from Distributed Streams
PODS '19: Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsWe consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is ...






Comments