skip to main content
10.1145/2213556.2213562acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Mergeable summaries

Authors Info & Claims
Published:21 May 2012Publication History

ABSTRACT

We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε) for ε-approximate quantiles, there is a deterministic summary of size O(1 over ε log(εn))that has a restricted form of mergeability, and a randomized one of size O(1 over ε log 3/21 over ε) with full mergeability. We also extend our results to geometric summaries such as ε-approximations and εkernels.

We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O(1 over ε log 3/21 over ε, and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.

References

  1. P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating extent measure of points. Journal of the ACM, 51(4):660--635, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. K. Agarwal, J. M. Phillips, and H. Yu. Stability of $\eps$-kernels. In Proc. European Symposium on Algorithms, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. K. Agarwal and H. Yu. A space-optimal data-stream algorithm for coresets in the plane. In Proc. Annual Symposium on Computational Geometry, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137--147, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Bansal. Constructive algorithms for discrepancy minimization. In Proc. IEEE Symposium on Foundations of Computer Science, pages 407--414, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Barequet and S. Har-Peled. Efficiently approximating the minimum-volume bounding box of a point set in three dimensions. Journal of Algorithms, 38:91--109, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Berinde, G. Cormode, P. Indyk, and M. Strauss. Space-optimal heavy hitters with strong error bounds. ACM Transactions on Database Systems, 35(4), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Chan. Faster core-set constructions and data-stream algorithms in fixed dimensions. Computational Geometry: Theory and Applications, 35:20--35, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Chan. Dynamic coresets. In Proc. Annual Symposium on Computational Geometry, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Charikar, A. Newman, and A. Nikolov. Tight hardness results for minimizing discrepancy. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Chazelle. The Discrepancy Method. Cambridge, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Chazelle and J. Matousek. On linear-time deterministic algorithms for optimization problems in fixed dimensions. Journal of Algorithms, 21:579--597, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. Proc. VLDB Endowment, 1(2):1530--1541, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. Graph distances in the streaming model: The value of space. In ACM-SIAM Symposium on Discrete Algorithms, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Feigenbaum, S. Kannan, M. J. Strauss, and M. Viswanathan. An approximate L1-difference algorithm for massive data streams. SIAM Journal on Computing, 32(1):131--151, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On distributing symmetric streaming computations. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Ganguly and A. Majumder. CR-precis: A deterministic summary structure for update data streams. In ESCAPE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. How to summarize the universe: Dynamic maintenance of quantiles. In Proc. International Conference on Very Large Data Bases, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. ACM SIGMOD International Conference on Management of Data, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Greenwald and S. Khanna. Power conserving computation of order-statistics over sensor networks. In Proc. ACM Symposium on Principles of Database Systems, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Guha. Tight results for clustering and summarizing data streams. In Proc. International Conference on Database Theory, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proc. IEEE Conference on Foundations of Computer Science, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Har-Peled. Approximation Algorithm in Geometry (Chapter 21). http://valis.cs.uiuc.edu/~sariel/teach/notes/aprx/, 2010.Google ScholarGoogle Scholar
  26. D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In Proc. ACM Symposium on Principles of Database Systems, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. G. Larsen. On range searching in the group model and combinatorial discrepancy. under submission, 2011.Google ScholarGoogle Scholar
  28. Y. Li, P. M. Long, and A. Srinivasan. Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences, 62:516--527, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. TAG: a tiny aggregation service for ad-hoc sensor networks. In Proc. Symposium on Operating Systems Design and Implementation, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Manjhi, S. Nath, and P. B. Gibbons. Tributaries and deltas: efficient and robust aggregation in sensor network streams. In Proc. ACM SIGMOD International Conference on Management of Data, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In Proc. IEEE International Conference on Data Engineering, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In Proc. ACM SIGMOD International Conference on Management of Data, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Matousek. Approximations and optimal geometric divide-and-conquer. In Proc. ACM Symposium on Theory of Computing, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Matousek. Geometric Discrepancy; An Illustrated Guide. Springer, 1999.Google ScholarGoogle Scholar
  35. A. Metwally, D. Agrawal, and A. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 31(3):1095--1133, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  37. J. Nelson and D. P. Woodruff. Fast manhattan sketches in data streams. In Proc. ACM Symposium on Principles of Database Systems, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. M. Phillips. Algorithms for $\eps$-approximations of terrains. In Proc. ICALP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: New aggregation techniques for sensor networks. In Proc. ACM SenSys, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Suri, C. Toth, and Y. Zhou. Range counting over multidimensional data streams. Discrete and Computational Geometry, 36(4):633--655, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Talagrand. Sharper bounds for Gaussian and emperical processes. Annals of Probability, 22:76, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  42. V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264--280, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  43. H. Yu, P. K. Agarwal, R. Poreddy, and K. R. Varadarajan. Practical methods for shape fitting and kinetic data structures using coresets. In Proc. Annual Symposium on Computational Geometry, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. H. Zarrabi-Zadeh. An almost space-optimal streaming algorithm for coresets in fixed dimensions. In Proc. European Symposium on Algorithms, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mergeable summaries

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PODS '12: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems
      May 2012
      332 pages
      ISBN:9781450312486
      DOI:10.1145/2213556

      Copyright © 2012 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 May 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate476of1,835submissions,26%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!