skip to main content
10.1145/1807085.1807099acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Optimal sampling from distributed streams

Published:06 June 2010Publication History

ABSTRACT

A fundamental problem in data management is to draw a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol and track parameters of the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this paper, we present communication-efficient protocols for sampling (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent items, or arrivals within the last w time units. We show that our protocols are optimal, not just in terms of the communication used, but also that they use minimal or near minimal (up to logarithmic factors) time to process each new item, and space to operate.

References

  1. C. Arackaparambil, J. Brody, and A. Chakrabarti. Functional monitoring without monotonicity. In ICALP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In SODA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Babcock and C. Olston. Distributed top-k monitoring. In SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. Braverman, R. Ostrovsky, and C. Zaniolo. Optimal sampling for sliding windows. In PODS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Chazelle. The Discrepancy Method. Cambridge University Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Cormode and M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring. In SODA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Cormode, S. Muthukrishnan, and W. Zhuang. What's different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In ICDE, pages 20--31, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Gemulla and W. Lehner. Sampling time-based sliding windows in bounded space. In SIGMOD, pages 379--392, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In SIGMOD, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Haussler and E. Welzl. Epsilon-nets and simplex range queries. Discrete and Computational Geometry, 2:127--151, 1987.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Huang, X. Nguyen, M. Garofalakis, J. Hellerstein, A. D. Joseph, M. Jordan, and N. Taft. Communication-efficient online detection of network-wide anomalies. In IEEE INFOCOM, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Keralapura, G. Cormode, and J. Ramamirtham. Communication-efficient distributed monitoring of thresholded counts. In SIGMOD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming. Addison-Wesley, Reading, MA, 2nd edition, 1981.Google ScholarGoogle Scholar
  16. A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In ICDE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Metwally, D. Agrawal, and A. E. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. I. Sharfman, A. Schuster, and D. Keren. A geometric approach to monitoring threshold functions over distribtuted data streams. In SIGMOD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Sharfman, A. Schuster, and D. Keren. Shape sensitive geometric monitoring. In PODS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264--280, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  21. J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37--57, Mar. 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Yi and Q. Zhang. Optimal tracking of distributed heavy hitters and quantiles. In PODS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimal sampling from distributed streams

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
      June 2010
      350 pages
      ISBN:9781450300339
      DOI:10.1145/1807085

      Copyright © 2010 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 June 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate476of1,835submissions,26%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!