ABSTRACT
Processing large data streams is now a major topic in data management. The data involved can be truly massive, and the required analyses complex. In a stream of sequential events such as stock feeds, sensor readings, or IP traffic measurements, data tuples pertaining to recent events are typically more important than older ones. This can be formalized via time-decay functions, which assign weights to data based on the age of data. Decay functions such as sliding windows and exponential decay have been studied under the assumption of well-ordered arrivals, i.e., data arrives in non-decreasing order of time stamps. However, data quality issues are prevalent in massive streams (due to network asynchrony and delays etc.), and correct arrival order is not guaranteed.
We focus on the computation of decayed aggregates such as range queries, quantiles, and heavy hitters on out-of-order streams, where elements do not necessarily arrive in increasing order of timestamps. Existing techniques such as Exponential Histograms and Waves are unable to handle out-of-order streams. We give the first deterministic algorithms for approximating these aggregates under popular decay functions such as sliding window and polynomial decay. We study the overhead of allowing out-of-order arrivals when compared to well-ordered arrivals, both analytically and experimentally. Our experiments confirm that these algorithms can be applied in practice, and compare the relative performance of different approaches for handling out-of-order arrivals.
- D. Abadi et al. Aurora: a data stream management system. In SIGMOD, 2003. Google Scholar
Digital Library
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. JCSS: Journal of Computer and System Sciences, 58:137--147, 1999. Google Scholar
Digital Library
- A. Arasu and G. S. Manku. Approximate counts and quantiles over sliding windows. In PODS, 2004. Google Scholar
Digital Library
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, 2002. Google Scholar
Digital Library
- B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan. Maintaining variance and k-medians over data stream windows. In PODS, 2003. Google Scholar
Digital Library
- V. Braverman and R. Ostrovsky Smooth Histograms for Sliding Windows. In FOCS, 2007. Google Scholar
Digital Library
- C. Busch and S. Tirthapura. A deterministic algorithm for summarizing asynchronous streams over a sliding window. In STACS, 2007. Google Scholar
Digital Library
- S. Cohen. User-defined aggregate functions: bridging theory and practice. In SIGMOD, 2006. Google Scholar
Digital Library
- E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In PODS, 2003. Google Scholar
Digital Library
- G. Cormode, F. Korn, S. Muthukrishnan, T. Johnson, O. Spatscheck, and D. Srivastava. Holistic UDAFs at streaming speeds. In SIGMOD, 2004. Google Scholar
Digital Library
- G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Spaceand time-efficient deterministic algorithms for biased quantiles over data streams. In PODS, 2006. Google Scholar
Digital Library
- G. Cormode, F. Korn, and S. Tirthapura. Exponentially Decayed Aggregates on Data Streams. In ICDE, 2008. Google Scholar
Digital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google Scholar
Digital Library
- G. Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In PODS, 2005. Google Scholar
Digital Library
- M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In SODA, 2002. Google Scholar
Digital Library
- P. Gibbons and S. Tirthapura. Distributed streams algorithms for sliding windows. Theory of Computing Systems, 37:457--478, 2004.Google Scholar
Digital Library
- J. Hershberger, N. Shrivastava, S. Suri, and C. Toth. Adaptive spatial partitioning for multidimensional data streams. In ISAAC, 2004. Google Scholar
Digital Library
- T. Kopelowitz and E. Porat. Improved Algorithms for Polynomial Time-Decay and Time-Decay with Additive error. In ICTCS, 2005. Google Scholar
Digital Library
- L.K. Lee and H.F. Ting. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In PODS, 2006. Google Scholar
Digital Library
- A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In ICDE, 2005. Google Scholar
Digital Library
- J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.Google Scholar
Cross Ref
- S. Muthukrishnan. Data streams: Algorithms and applications. In SODA, 2003. Google Scholar
Digital Library
- J. I. Munro and M. Paterson. Selection and sorting with limited storage. Theor. Comput. Sci., 12:315--323, 1980.Google Scholar
Cross Ref
- L. Qiao, D. Agrawal, and A. El Abbadi. Supporting sliding window queries for continuous data streams. In SSDBM, 2003. Google Scholar
Digital Library
- N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: New aggregation techniques for sensor networks. In ACM SenSys, 2004. Google Scholar
Digital Library
- S. Tirthapura, C. Busch, and B. Xu. Sketching asycnhronous streams over sliding windows. In PODC, 2006. Google Scholar
Digital Library
- P. A. Tucker, D. Maier, T. Sheard, and L. Fegaras. Exploiting punctuation semantics in countinuous data streams. IEEE TKDE, 15(3):555--568, May 2003. Google Scholar
Digital Library
Index Terms
Time-decaying aggregates in out-of-order streams
Recommendations
Maintaining time-decaying stream aggregates
PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsWe formalize the problem of maintaining time-decaying aggregates and statistics of a data stream: the relative contribution of each data item to the aggregate is scaled down by a factor that depends on, and is non-decreasing with, elapsed time. Time-...
Maintaining time-decaying stream aggregates
We formalize the problem of maintaining time-decaying aggregates and statistics of a data stream: the relative contribution of each data item to the aggregate is scaled down by a factor that depends on, and is non-increasing with, elapsed time. Time-...
Estimating statistical aggregates on probabilistic data streams
The probabilistic stream model was introduced by Jayram et al. [2007]. It is a generalization of the data stream model that is suited to handling probabilistic data, where each item of the stream represents a probability distribution over a set of ...






Comments