ABSTRACT
We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U={1,...,u}. For a given 0 ≤ Φ ≤ 1, the Φ-heavy hitters are those elements of A whose frequency in A is at least Φ |A|; the Φ-quantile of A is an element x of U such that at most Φ|A| elements of A are smaller than A and at most (1-Φ)|A| elements of A are greater than x. Suppose the elements of A are received at k remote sites over time, and each of the sites has a two-way communication channel to a designated coordinator, whose goal is to track the set of Φ-heavy hitters and the Φ-quantile of A approximately at all times with minimum communication. We give tracking algorithms with worst-case communication cost O(k/ε ⋅ log n) for both problems, where n is the total number of items in A, and ε is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the Φ-quantiles for all 0 ≤ Φ ≤ 1.
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58:137--147, 1999. See also STOC'96. Google Scholar
Digital Library
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. ACM Symposium on Principles of Database Systems, 2002. Google Scholar
Digital Library
- B. Babcock and C. Olston. Distributed top-k monitoring. In Proc. ACM SIGMOD International Conference on Management of Data, 2003. Google Scholar
Digital Library
- G. Cormode and M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In Proc. International Conference on Very Large Databases, 2005. Google Scholar
Digital Library
- G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In Proc. ACM SIGMOD International Conference on Management of Data, 2005. Google Scholar
Digital Library
- G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. In Proc. International Conference on Very Large Databases, 2008. Google Scholar
Digital Library
- G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In Proc. ACM Symposium on Principles of Database Systems, 2006. Google Scholar
Digital Library
- G. Cormode and S. Muthukrishnan. What's hot and what's not: tracking most frequent items dynamically. In Proc. ACM Symposium on Principles of Database Systems, 2003. Google Scholar
Digital Library
- G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2008. Google Scholar
Digital Library
- G. Cormode, S. Muthukrishnan, and W. Zhuang. What's different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In Proc. IEEE International Conference on Data Engineering, pages 20--31, 2006. Google Scholar
Digital Library
- G. Cormode, S. Muthukrishnan, and W. Zhuang. Conquering the divide: Continuous clustering of distributed data streams. In Proc. IEEE International Conference on Data Engineering, 2007.Google Scholar
Cross Ref
- A. Deshpande, C. Guestrin, S.R. Madden, J.M. Hellerstein, andW. Hong. Model-driven data acquisition in sensor networks. In Proc. International Conference on Very Large Databases, 2004. Google Scholar
Digital Library
- R. Fuller and M. Kantardzic. FIDS: Monitoring frequent items over distributed data streams. In MLDM, 2007. Google Scholar
Digital Library
- A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M.J. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In Proc. International Conference on Very Large Databases, 2002. Google Scholar
Digital Library
- M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. ACM SIGMOD International Conference on Management of Data, 2001. Google Scholar
Digital Library
- R.M. Karp, S. Shenker, and C.H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems, 2003. Google Scholar
Digital Library
- R. Keralapura, G. Cormode, and J. Ramamirtham. Communication-efficient distributed monitoring of thresholded counts. In Proc. ACM SIGMOD International Conference on Management of Data, 2006. Google Scholar
Digital Library
- A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In Proc. IEEE International Conference on Data Engineering, 2005. Google Scholar
Digital Library
- G. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. International Conference on Very Large Databases, 2002. Google Scholar
Digital Library
- A. Metwally, D. Agrawal, and A.E. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 2006. Google Scholar
Digital Library
- C. Olston, J. Jiang, and J. Widom. Adaptive filters for continuous queries over distributed data streams. In Proc. ACM SIGMOD International Conference on Management of Data, 2003. Google Scholar
Digital Library
- C. Olston and J. Widom. Efficient monitoring and querying of distributed, dynamic data via approximate replication. IEEE Data Engineering Bulletin, 2005.Google Scholar
- I. Sharfman, A. Schuster, and D. Keren. Shape sensitive geometric monitoring. In Proc. ACM Symposium on Principles of Database Systems, 2008. Google Scholar
Digital Library
- A.C. Yao. Some complexity questions related to distributive computing. In Proc. ACM Symposium on Theory of Computation, 1979. Google Scholar
Digital Library
Index Terms
Optimal tracking of distributed heavy hitters and quantiles
Recommendations
Optimal Tracking of Distributed Heavy Hitters and Quantiles
We consider the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the ...
High Quantiles of Heavy-Tailed Distributions: Their Estimation
High quantiles of heavy-tailed distributions are estimated under the assumption that the tail is of Pareto type. The distribution of the logarithm of the estimate ratio to the true quantile is asymptotically normal. The same is also proved for the ...
Space-optimal heavy hitters with strong error bounds
PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsThe problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees on ...






Comments