skip to main content
10.1145/1559795.1559818acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Optimal sampling from sliding windows

Published:29 June 2009Publication History

ABSTRACT

A sliding windows model is an important case of the streaming model, where only the most "recent" elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gionis, Indyk and Motwani (SODA 02)). There are two equally important types of the sliding windows model -- windows with fixed size, (e.g., where items arrive one at a time, and only the most recent n items remain active for some fixed parameter n), and bursty windows (e.g., where many items can arrive in "bursts" at a single step and where only items from the last t steps remain active, again for some fixed parameter t).

Random sampling is a fundamental tool for data streams, as numerous algorithms operate on the sampled data instead of on the entire stream. Effective sampling from sliding windows is a nontrivial problem, as elements eventually expire. In fact, the deletions are implicit; i.e., it is not possible to identify deleted elements without storing the entire window. The implicit nature of deletions on sliding windows does not allow the existing methods (even those that support explicit deletions, e.g., Cormode, Muthukrishnan and Rozenbaum (VLDB 05); Frahling, Indyk and Sohler (SOCG 05)) to be directly "translated" to the sliding windows model. One trivial approach to overcoming the problem of implicit deletions is that of over-sampling. When k samples are required, the over-sampling method maintains k'>k samples in the hope that at least k samples are not expired. The obvious disadvantages of this method are twofold:

(a) It introduces additional costs and thus decreases the performance; and

(b) The memory bounds are not deterministic, which is atypical for streaming algorithms (where even small probability events may eventually happen for a stream that is big enough).

Babcock, Datar and Motwani (SODA 02), were the first to stress the importance of improvements to over-sampling. They formally introduced the problem of sampling from sliding windows and improved the over-sampling method for sampling with replacement. Their elegant solutions for sampling with replacement are optimal in expectation, and thus resolve disadvantage (a) mentioned above. Unfortunately, the randomized bounds do not resolve disadvantage (b) above. Interestingly, all algorithms that employ the ideas of Babcock, Datar and Motwani have the same central problem of having to deal with randomized complexity (see e.g., Datar and Muthukrishnan (ESA 02); Chakrabarti, Cormode and McGregor (SODA 07)). Further, the proposed solutions of Babcock, Datar and Motwani for sampling without replacement are based on the criticized over-sampling method and thus do not solve problem (a). Therefore, the question of whether we can solve sampling on sliding windows optimally (i.e., resolving both disadvantages) is implicit in the paper of Babcock, Datar and Motwani and has remained open for all variants of the problem.

In this paper we answer these questions affirmatively and provide optimal sampling schemas for all variants of the problem, i.e., sampling with or without replacement from fixed or bursty windows. Specifically, for fixed-size windows, we provide optimal solutions that require O(k) memory; for bursty windows, we show algorithms that require O(klogn), which is optimal since it matches the lower bound by Gemulla and Lehner (SIGMOD 08). In contrast to the work of of Babcock, Datar and Motwani, our solutions have deterministic bounds. Thus, we prove a perhaps somewhat surprising fact: the memory complexity of the sampling-based algorithm for all variants of the sliding windows model is comparable with that of streaming models (i.e., without the sliding windows). This is the first result of this type, since all previous "translations" of sampling-based algorithms to sliding windows incur randomized memory guarantees only.

References

  1. ]]C. Aggarwal (editor), Data Streams: Models and Algorithms, Springer Verlag, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ]]C. Aggarwal, "On biased reservoir sampling in the presence of stream evolution", Proceedings of the 32nd international conference on Very large data bases, pp. 607--618, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. ]]N. Alon, N. Duffield, C. Lund, M. Thorup, "Estimating arbitrary subset sums with few probes," Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 317--325, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. ]]N. Alon, Y. Matias, M. Szegedy, "The space complexity of approximating the frequency moments," Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pp. 20--29, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. ]]A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, J. Widom, "STREAM: The Stanford Data Stream Management System," Book Chapter, "Data-Stream Management: Processing High-Speed Data Streams", Springer-Verlag, 2005.Google ScholarGoogle Scholar
  6. ]]A. Arasu, G.S. Manku, "Approximate counts and quantiles over sliding windows," Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. ]]A.M. Ayad, J.F. Naughton, "Static optimization of conjunctive queries with sliding windows over infinite streams," Proceedings of the 2004 ACM SIGMOD international conference on Management of data, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. ]]B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom, "Models and issues in data stream systems", Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. ]]B. Babcock, S. Babu, M. Datar, R. Motwani, D. Thomas, "Operator scheduling in data stream systems", The VLDB Journal of The International Journal on Very Large Data Bases, v.13 n.4, pp.333--353, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. ]]B. Babcock, M. Datar, R. Motwani, "Sampling from a moving window over streaming data", Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pp.633--634, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. ]]B. Babcock, M. Datar, R. Motwani, "Load Shedding for Aggregation Queries over Data Streams", Proceedings of the 20th International Conference on Data Engineering, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. ]]B. Babcock, M. Datar, R. Motwani, L. O'Callaghan, "Maintaining variance and k-medians over data stream windows", Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp.234--243, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. ]]Z. Bar-Yossef, "Sampling lower bounds via information theory", STOC, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. ]]Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, "An Information Statistics Approach to Data Stream and Communication Complexity", Proceedings of the 43rd Symposium on Foundations of Computer Science, pp. 209--218, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ]]Z. Bar-Yosseff, R. Kumar, D. Sivakumar, "Reductions in streaming algorithms, with an application to counting triangles in graphs", Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pp.623--632, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. ]]Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, L. Trevisan, "Counting Distinct Elements in a Data Stream", Proceedings of the 6th International Workshop on Randomization and Approximation Techniques, pp.1--10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ]]Z. Bar-Yossef, R. Kumar, D. Sivakumar, "Sampling algorithms: lower bounds and applications", STOC, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. ]]L. Bhuvanagiri, S. Ganguly, D. Kesh, C. Saha, "Simpler algorithm for estimating frequency moments of data streams", Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pp.708--713, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. ]]V. Braverman, R. Ostrovsky, "Smooth histograms on stream windows", Proceedings of the 48th Symposium on Foundations of Computer Science, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. ]]L.S. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, C. Sohler, "Counting triangles in data streams", Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp.253--262, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. ]]A. Chakrabarti, G. Cormode, A. McGregor, "A near-optimal algorithm for computing the entropy of a stream". In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. ]]A. Chakrabarti, K. Do Ba, S. Muthukrishnan, "Estimating Entropy and Entropy Norm on Data Streams", In Proceedings of the 23rd International Symposium on Theoretical Aspects of Computer Science, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. ]]A. Chakrabarti, S. Khot, X. Sun, "Near-optimal lower bounds on the multi-party communication complexity of set-disjointness", Proceedings of the 18th Annual IEEE Conference on Computational Complexity, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  24. ]]K.L. Chang, R. Kannan, "The space complexity of pass-efficient algorithms for clustering", in ACM-SIAM Symposium on Discrete Algorithms, 2006, pp. 1157--1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ]]M. Charikar, C. Chekuri, T. Feder, R. Motwani, "Incremental clustering and dynamic information retrieval", SIAM J. Comput., 33 (2004), pp. 1417--1440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. ]]K. Chaudhuri, N. Mishra, "When Random Sampling Preserves Privacy", CRYPTO, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. ]]S. Chaudhuri, R. Motwani, V. Narasayya, "On random sampling over joins", Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pp.263--274, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. ]]Y. Chi, H. Wang, P.S. Yu, R.R. Muntz, "Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window", Fourth IEEE International Conference on Data Mining (ICDM'04), pp. 59--66, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. ]]E. Cohen, "Size-estimation framework with applications to transitive closure and reachability," Journal of Computer and System Sciences, v.55 n.3, pp.441--453, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. ]]E. Cohen, H. Kaplan, "Summarizing data using bottom-k sketches,", Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. ]]G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan, "Comparing Data Streams Using Hamming Norms (How to Zero In)", IEEE Transactions on Knowledge and Data Engineering, v.15 n.3, pp.529--540, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. ]]G. Cormode, S. Muthukrishnan, I. Rozenbaum, "Summarizing and mining inverse distributions on data streams via dynamic inverse sampling", Proceedings of the 31st international conference on Very large data bases, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. ]]D. Coppersmith, R. Kumar, "An improved data stream algorithm for frequency moments", Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pp.151--156, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. ]]A. Das, J. Gehrke, M. Riedewald, "Semantic Approximation of Data Stream Joins", IEEE Transactions on Knowledge and Data Engineering, v.17 n.1, pp.44--59, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. ]]A. Dasgupta, P. Drineas, B. Harb, R. Kumar, M.W. Mahoney, "Sampling algorithms and coresets for lp regression", SODA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. ]]M. Datar, A. Gionis, P. Indyk, R. Motwani, "Maintaining stream statistics over sliding windows: (extended abstract)", Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pp.635--644, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. ]]M. Datar, S. Muthukrishnan, "Estimating Rarity and Similarity over Data Stream Windows", Proceedings of the 10th Annual European Symposium on Algorithms, pp.323--334, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. ]]N. Duffield, C. Lund, M. Thorup, "Flow sampling under hard resource constraints", ACM SIGMETRICS Performance Evaluation Review, v.32 n.1, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. ]]J. Feigenbaum, S. Kannan, and J. Zhang, "Computing diameter in the streaming and sliding-window models", Algorithmica, 41:25--41, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. ]]J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, J. Zhang, "Graph distances in the streaming model: the value of space", SODA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. ]]J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, J. Zhang, "On graph problems in a semi-streaming model", Theor. Comput. Sci., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. ]]J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan, "Testing and Spot-Checking of Data Streams", Algorithmica, 34(1): 67--80, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. ]]G. Frahling, P. Indyk, C. Sohler, "Sampling in dynamic data streams and applications", Proceedings of the twenty-first annual symposium on Computational geometry, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. ]]S. Ganguly. "Estimating Frequency Moments of Update Streams using Random Linear Combinations". Proceedings of the 8th International Workshop on Randomized Algorithms, pp. 369--380, 2004.Google ScholarGoogle Scholar
  45. ]]S. Ganguly, "Counting distinct items over update streams", Theoretical Computer Science, pp.211--222, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. ]]S. Gandhi, S. Suri, E. Welzl, "Catching elephants with mice: sparse sampling for monitoring sensor networks", SenSys, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. ]]R. Gemulla, "Sampling Algorithms for Evolving Datasets", PhD Dissertation.Google ScholarGoogle Scholar
  48. ]]R. Gemulla and W. Lehner, "Sampling time-based sliding windows in bounded space", In Proc. of the 2008 ACM SIGMOD Intl. Conf. on Management of Data, pp. 379--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. ]]P.B. Gibbons, Y. Matias, "New sampling-based summary statistics for improving approximate query answers", Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp.331--342, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. ]]P.B. Gibbons, S. Tirthapura, "Distributed streams algorithms for sliding windows", Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures, pp.10--13, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. ]]L. Golab, D. DeHaan, E.D. Demaine, A. Lopez-Ortiz, J.I. Munro, "Identifying frequent items in sliding windows over on-line packet streams", Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. ]]L. Golab , M.T. Özsu, "Processing sliding window multi-joins in continuous queries over data streams", Proceedings of the 29th international conference on Very large data bases, pp.500--511, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. ]]S. Guha, A. McGregor, S. Venkatasubramanian, "Streaming and sublinear approximation of entropy and information distances", Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pp.733--742, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. ]]S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan, "Clustering Data Streams: Theory and Practice", IEEE Trans. on Knowledge and Data Engineering, vol. 15, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. ]]P.J. Haas, "Data stream sampling: Basic techniques and results", In M. Garofalakis, J. Gehrke, and R. Rastogi (Eds.), Data Stream Management: Processing High Speed Data Streams, Springer.Google ScholarGoogle Scholar
  56. ]]N. Harvey, J. Nelson, K. Onak, "Sketching and Streaming Entropy via Approximation Theory", The 49th Annual Symposium on Foundations of Computer Science (FOCS 2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. ]]P. Indyk, D. Woodruff, "Optimal approximations of the frequency moments of data streams", Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pp.202--208, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. ]]H. Jowhari, M. Ghodsi, "New streaming algorithms for counting triangles in graphs", Proceedings of the 11th COCOON, pp. 710--716, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  59. ]]M. Kolonko, D. Wäsch, "Sequential reservoir sampling with a nonuniform distribution", v.32, i.2, pp.257--273, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. ]]L.K. Lee, H.F. Ting, "Frequency counting and aggregation: A simpler and more efficient deterministic scheme for finding frequent items over sliding windows", Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS '06), pp. 290--297, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. ]]L.K. Lee, H.F. Ting, "Maintaining significant stream statistics over sliding windows", Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pp.724--732, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. ]]K. Li, "Reservoir-sampling algorithms of time complexity O(n(1 + log(N=n)))", ACM Transactions on Mathematical Software (TOMS), v.20 n.4, pp.481--493, Dec. 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. ]]J. Li, D. Maier, K. Tufte, V. Papadimos, P.A. Tucker, "Semantics and Evaluation Techniques for Window Aggregates in Data Streams", SIGMOD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. ]]J. Li, D. Maier, K. Tufte, V. Papadimos, P.A. Tucker, "No pane, no gain: efficient evaluation of sliding-window aggregates over data streams", ACM SIGMOD Record, v.34 n.1, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. ]]G.S. Manku, R. Motwani, "Approximate frequency counts over data streams". In Proceedings of the 28th International Conference on Very Large Data Bases, pp.356--357, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. ]]S. Muthukrishnan, "Data Streams: Algorithms And Applications" Foundations and Trends in Theoretical Computer Science, Volume 1, Issue 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. ]]C.R. Palmer, C. Faloutsos, "Density biased sampling: an improved method for data mining and clustering", Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pp.82--92, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. ]]V. Paxson, G. Almes, J. Mahdavi, M. Mathis, "Framework for IP performance metrics", RFC 2330, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. ]]C. Procopiuc, O. Procopiuc, "Density Estimation for Spatial Data Streams", Proceedings of the 9th International Symposium on Spatial and Temporal Databases, pp.109--126, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. ]]M. Szegedy, "The DLT priority sampling is essentially optimal", Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pp.150--158, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. ]]N. Tatbul, S. Zdonik, "Window-aware load shedding for aggregation queries over data streams", Proceedings of the 32nd international conference on Very large data bases, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. ]]J.S. Vitter, "Random sampling with a reservoir", ACM Transactions on Mathematical Software (TOMS), v.11 n.1, pp.37--57, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. ]]L. Zhang, Z. Li, M. Yu, Y. Wang, Y. Jiang, "Random sampling algorithms for sliding windows over data streams", Proc. of the 11th Joint International Computer Conference, pp. 572--575, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  74. ]]H. Zhao, A. Lall, M. Ogihara, O. Spatscheck, J. Wang, J. Xu, "A data streaming algorithm for estimating entropies of od flows", Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimal sampling from sliding windows

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
      June 2009
      298 pages
      ISBN:9781605585536
      DOI:10.1145/1559795
      • General Chair:
      • Jan Paredaens,
      • Program Chair:
      • Jianwen Su

      Copyright © 2009 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 June 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate476of1,835submissions,26%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!