ABSTRACT
When processing massive data sets, a core task is to construct synopses of the data. To be useful, a synopsis data structure should be easy to construct while also yielding good approximations of the relevant properties of the data set. A particularly useful class of synopses are sketches, i.e., those based on linear projections of the data. These are applicable in many models including various parallel, stream, and compressed sensing settings. A rich body of analytic and empirical work exists for sketching numerical data such as the frequencies of a set of entities. Our work investigates graph sketching where the graphs of interest encode the relationships between these entities. The main challenge is to capture this richer structure and build the necessary synopses with only linear measurements.
In this paper we consider properties of graphs including the size of the cuts, the distances between nodes, and the prevalence of dense sub-graphs. Our main result is a sketch-based sparsifier construction: we show that Õ(nε-2) random linear projections of a graph on n nodes suffice to (1+ε) approximate all cut values. Similarly, we show that Õ(ε-2) linear projections suffice for (additively) approximating the fraction of induced sub-graphs that match a given pattern such as a small clique. Finally, for distance estimation we present sketch-based spanner constructions. In this last result the sketches are adaptive, i.e., the linear projections are performed in a small number of batches where each projection may be chosen dependent on the outcome of earlier sketches. All of the above results immediately give rise to data stream algorithms that also apply to dynamic graph streams where edges are both inserted and deleted. The non-adaptive sketches, such as those for sparsification and subgraphs, give us single-pass algorithms for distributed data streams with insertion and deletions. The adaptive sketches can be used to analyze MapReduce algorithms that use a small number of rounds.
- K. J. Ahn and S. Guha. Graph sparsification in the semi-streaming model. In ICALP (2), pages 328--338, 2009. Google Scholar
Digital Library
- K. J. Ahn and S. Guha. Laminar families and metric embeddings: Non-bipartite maximum matching problem in the semi-streaming model. Manuscript, available at http://arxiv.org/abs/1104.4058, 2011.Google Scholar
- K. J. Ahn and S. Guha. Linear programming in the semi-streaming model with application to the maximum matching problem. In ICALP (2), pages 526--538, 2011. Google Scholar
Digital Library
- K. J. Ahn, S. Guha, and A. McGregor. Analyzing graph structure via linear measurements. In SODA, 2012. Google Scholar
Digital Library
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58:137--147, 1999. Google Scholar
Digital Library
- Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. of SODA, pages 623--632, 2002. Google Scholar
Digital Library
- S. Baswana and S. Sen. A simple and linear time randomized algorithm for computing sparse spanners in weighted graphs. Random Struct. Algorithms, 30(4):532--563, 2007. Google Scholar
Digital Library
- A. A. Benczúr and D. R. Karger. Approximating s-t minimum cuts in Õ(n2)time. In STOC, pages 47--55, 1996. Google Scholar
Digital Library
- L. S. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and C. Sohler. Counting triangles in data streams. In PODS, pages 253--262, 2006. Google Scholar
Digital Library
- M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3--15, 2004. Google Scholar
Digital Library
- K. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model. In STOC, pages 205--214, 2009. Google Scholar
Digital Library
- G. Cormode. Sketch techniques for approximate query processing. In G. Cormode, M. Garofalakis, P. Haas, and C. Jermaine, editors, Synposes for Approximate Query Processing: Samples, Histograms, Wavelets and Sketches, Foundations and Trends in Databases. NOW publishers, 2011.Google Scholar
- G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google Scholar
Digital Library
- G. Cormode, S. Muthukrishnan, and I. Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In VLDB, pages 25--36, 2005. Google Scholar
Digital Library
- G. Cormode, S. Muthukrishnan, K. Yi, and Q. Zhang. Optimal sampling from distributed streams. In PODS, pages 77--86, 2010. Google Scholar
Digital Library
- M. Elkin. A near-optimal fully dynamic distributed algorithm for maintaining sparse spanners, 2006.Google Scholar
- M. Elkin. Streaming and fully dynamic centralized algorithms for constructing and maintaining sparse spanners. ACM Transactions on Algorithms, 7(2):20, 2011. Google Scholar
Digital Library
- L. Epstein, A. Levin, J. Mestre, and D. Segev. Improved approximation guarantees for weighted matching in the semi-streaming model. CoRR, abs/00907.0305, 2000.Google Scholar
- J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. On graph problems in a semi-streaming model. Theor. Comput. Sci., 348(2):207--216, 2005. Google Scholar
Digital Library
- J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. Graph distances in the data-stream model. SIAM Journal on Computing, 38(5):1709--1727, 2008. Google Scholar
Digital Library
- G. Frahling, P. Indyk, and C. Sohler. Sampling in dynamic data streams and applications. In Symposium on Computational Geometry, pages 142--149, 2005. Google Scholar
Digital Library
- W. S. Fung, R. Hariharan, N. J. A. Harvey, and D. Panigrahi. A general framework for graph sparsification. In STOC, pages 71--80, 2011. Google Scholar
Digital Library
- S. Ganguly and L. Bhuvanagiri. Hierarchical sampling from sketches: Estimating functions over data streams. Algorithmica, 53(4):549--582, 2009. Google Scholar
Digital Library
- A. Gilbert and P. Indyk. Sparse recovery using sparse matrices. Proceedings of the IEEE, 98(6):937--947, june 2010.Google Scholar
- R. E. Gomory and T. C. Hu. Multi-Terminal Network Flows. Journal of the Society for Industrial and Applied Mathematics, 9(4):551--570, 1961.Google Scholar
Cross Ref
- S. Guha, N. Koudas, and K. Shim. Approximation and streaming algorithms for histogram construction problems. ACM Trans. Database Syst., 31(1):396--438, 2006. Google Scholar
Digital Library
- P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. J. ACM, 53(3):307--323, 2006. Google Scholar
Digital Library
- P. Indyk and D. Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 202--208. ACM New York, NY, USA, 2005. Google Scholar
Digital Library
- W. B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert Space. Contemporary Mathematics, Vol 26, pages 189--206, May 1984.Google Scholar
- H. Jowhari and M. Ghodsi. New streaming algorithms for counting triangles in graphs. In COCOON, pages 710--716, 2005. Google Scholar
Digital Library
- H. Jowhari, M. Saglam, and G. Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. In PODS, pages 49--58, 2011. Google Scholar
Digital Library
- D. M. Kane, J. Nelson, E. Porat, and D. P. Woodruff. Fast moment estimation in data streams in optimal space. In STOC, pages 745--754, 2011. Google Scholar
Digital Library
- D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In PODS, pages 41--52, 2010. Google Scholar
Digital Library
- D. R. Karger. Random sampling in cut, flow, and network design problems. In STOC, pages 648--657, 1994. Google Scholar
Digital Library
- J. A. Kelner and A. Levin. Spectral sparsification in the semi-streaming setting. In STACS, pages 440--451, 2011.Google Scholar
- A. McGregor. Finding graph matchings in data streams. APPROX-RANDOM, pages 170--181, 2005. Google Scholar
Digital Library
- A. McGregor. Graph mining on streams. In Encyclopedia of Database Systems, pages 1271--1275, 2009.Google Scholar
Cross Ref
- S. Muthukrishnan. Data Streams: Algorithms and Applications. Now Publishers, 2006.Google Scholar
- N. Nisan. Pseudorandom generators for space-bounded computation. Combinatorica, 12(4):449--461, 1992.Google Scholar
- A. Schrijver. Combinatorial Optimization - Polyhedra and Efficiency, volume 24 of Algorithms and Combinatorics. Springer, 2003.Google Scholar
- M. Zelke. Weighted matching in the semi-streaming model. Algorithmica DOI: 10.1007/s00453-010-9438-5, 2010. Google Scholar
Cross Ref
Index Terms
Graph sketches: sparsification, spanners, and subgraphs
Recommendations
Graph Stream Summarization: From Big Bang to Big Crunch
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataA graph stream, which refers to the graph with edges being updated sequentially in a form of a stream, has important applications in cyber security and social networks. Due to the sheer volume and highly dynamic nature of graph streams, the practical ...
Vertex and Hyperedge Connectivity in Dynamic Graph Streams
PODS '15: Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsA growing body of work addresses the challenge of processing dynamic graph streams: a graph is defined by a sequence of edge insertions and deletions and the goal is to construct synopses and compute properties of the graph while using only limited ...
Summarizing data using bottom-k sketches
PODC '07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computingA Bottom-sketch is a summary of a set of items with nonnegative weights that supports approximate query processing. A sketch is obtained by associating with each item in a ground set an independent random rank drawn from a probability distribution that ...






Comments