ABSTRACT
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε) for ε-approximate quantiles, there is a deterministic summary of size O(1 over ε log(εn))that has a restricted form of mergeability, and a randomized one of size O(1 over ε log 3/21 over ε) with full mergeability. We also extend our results to geometric summaries such as ε-approximations and εkernels.
We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O(1 over ε log 3/21 over ε, and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.
- P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating extent measure of points. Journal of the ACM, 51(4):660--635, 2004. Google Scholar
Digital Library
- P. K. Agarwal, J. M. Phillips, and H. Yu. Stability of $\eps$-kernels. In Proc. European Symposium on Algorithms, 2010. Google Scholar
Digital Library
- P. K. Agarwal and H. Yu. A space-optimal data-stream algorithm for coresets in the plane. In Proc. Annual Symposium on Computational Geometry, 2007. Google Scholar
Digital Library
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137--147, 1999. Google Scholar
Digital Library
- N. Bansal. Constructive algorithms for discrepancy minimization. In Proc. IEEE Symposium on Foundations of Computer Science, pages 407--414, 2010. Google Scholar
Digital Library
- Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, 2002. Google Scholar
Digital Library
- G. Barequet and S. Har-Peled. Efficiently approximating the minimum-volume bounding box of a point set in three dimensions. Journal of Algorithms, 38:91--109, 2001. Google Scholar
Digital Library
- R. Berinde, G. Cormode, P. Indyk, and M. Strauss. Space-optimal heavy hitters with strong error bounds. ACM Transactions on Database Systems, 35(4), 2010. Google Scholar
Digital Library
- T. Chan. Faster core-set constructions and data-stream algorithms in fixed dimensions. Computational Geometry: Theory and Applications, 35:20--35, 2006. Google Scholar
Digital Library
- T. Chan. Dynamic coresets. In Proc. Annual Symposium on Computational Geometry, 2008. Google Scholar
Digital Library
- M. Charikar, A. Newman, and A. Nikolov. Tight hardness results for minimizing discrepancy. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2011. Google Scholar
Digital Library
- B. Chazelle. The Discrepancy Method. Cambridge, 2000. Google Scholar
Digital Library
- B. Chazelle and J. Matousek. On linear-time deterministic algorithms for optimization problems in fixed dimensions. Journal of Algorithms, 21:579--597, 1996. Google Scholar
Digital Library
- G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. Proc. VLDB Endowment, 1(2):1530--1541, 2008. Google Scholar
Digital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google Scholar
Digital Library
- J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. Graph distances in the streaming model: The value of space. In ACM-SIAM Symposium on Discrete Algorithms, 2005. Google Scholar
Digital Library
- J. Feigenbaum, S. Kannan, M. J. Strauss, and M. Viswanathan. An approximate L1-difference algorithm for massive data streams. SIAM Journal on Computing, 32(1):131--151, 2003. Google Scholar
Digital Library
- J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On distributing symmetric streaming computations. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2008. Google Scholar
Digital Library
- S. Ganguly and A. Majumder. CR-precis: A deterministic summary structure for update data streams. In ESCAPE, 2007. Google Scholar
Digital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In Proc. International Conference on Very Large Data Bases, 2002. Google Scholar
Digital Library
- M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. ACM SIGMOD International Conference on Management of Data, 2001. Google Scholar
Digital Library
- M. Greenwald and S. Khanna. Power conserving computation of order-statistics over sensor networks. In Proc. ACM Symposium on Principles of Database Systems, 2004. Google Scholar
Digital Library
- S. Guha. Tight results for clustering and summarizing data streams. In Proc. International Conference on Database Theory, 2009. Google Scholar
Digital Library
- S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proc. IEEE Conference on Foundations of Computer Science, 2000. Google Scholar
Digital Library
- S. Har-Peled. Approximation Algorithm in Geometry (Chapter 21). http://valis.cs.uiuc.edu/~sariel/teach/notes/aprx/, 2010.Google Scholar
- D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In Proc. ACM Symposium on Principles of Database Systems, 2010. Google Scholar
Digital Library
- K. G. Larsen. On range searching in the group model and combinatorial discrepancy. under submission, 2011.Google Scholar
- Y. Li, P. M. Long, and A. Srinivasan. Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences, 62:516--527, 2001. Google Scholar
Digital Library
- S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. TAG: a tiny aggregation service for ad-hoc sensor networks. In Proc. Symposium on Operating Systems Design and Implementation, 2002. Google Scholar
Digital Library
- A. Manjhi, S. Nath, and P. B. Gibbons. Tributaries and deltas: efficient and robust aggregation in sensor network streams. In Proc. ACM SIGMOD International Conference on Management of Data, 2005. Google Scholar
Digital Library
- A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In Proc. IEEE International Conference on Data Engineering, 2005. Google Scholar
Digital Library
- G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In Proc. ACM SIGMOD International Conference on Management of Data, 1998. Google Scholar
Digital Library
- J. Matousek. Approximations and optimal geometric divide-and-conquer. In Proc. ACM Symposium on Theory of Computing, 1991. Google Scholar
Digital Library
- J. Matousek. Geometric Discrepancy; An Illustrated Guide. Springer, 1999.Google Scholar
- A. Metwally, D. Agrawal, and A. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 31(3):1095--1133, 2006. Google Scholar
Digital Library
- J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.Google Scholar
Cross Ref
- J. Nelson and D. P. Woodruff. Fast manhattan sketches in data streams. In Proc. ACM Symposium on Principles of Database Systems, 2010. Google Scholar
Digital Library
- J. M. Phillips. Algorithms for $\eps$-approximations of terrains. In Proc. ICALP, 2008. Google Scholar
Digital Library
- N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: New aggregation techniques for sensor networks. In Proc. ACM SenSys, 2004. Google Scholar
Digital Library
- S. Suri, C. Toth, and Y. Zhou. Range counting over multidimensional data streams. Discrete and Computational Geometry, 36(4):633--655, 2006. Google Scholar
Digital Library
- M. Talagrand. Sharper bounds for Gaussian and emperical processes. Annals of Probability, 22:76, 1994.Google Scholar
Cross Ref
- V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264--280, 1971.Google Scholar
Cross Ref
- H. Yu, P. K. Agarwal, R. Poreddy, and K. R. Varadarajan. Practical methods for shape fitting and kinetic data structures using coresets. In Proc. Annual Symposium on Computational Geometry, 2004. Google Scholar
Digital Library
- H. Zarrabi-Zadeh. An almost space-optimal streaming algorithm for coresets in fixed dimensions. In Proc. European Symposium on Algorithms, 2008. Google Scholar
Digital Library
Index Terms
Mergeable summaries
Recommendations
Mergeable summaries
Invited papers issueWe study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two datasets, there is a way to merge the two summaries into a single summary on the two datasets combined together, while preserving ...
Mergeable dictionaries
ICALP'10: Proceedings of the 37th international colloquium conference on Automata, languages and programmingA data structure is presented for the Mergeable Dictionary abstract data type, which supports the operations Predecessor-Search, Split, and Merge on a collection of disjoint sets of totally ordered data. While in a typical mergeable dictionary (e.g. 2-4 ...
Comparison-based time-space lower bounds for selection
We establish the first nontrivial lower bounds on time-space trade-offs for the selection problem. We prove that any comparison-based randomized algorithm for finding the median requires Ω(nlog logS n) expected time in the RAM model (or more generally ...






Comments