ABSTRACT
We consider the problem of computing a relational query q on a large input database of size n, using a large number p of servers. The computation is performed in rounds, and each server can receive only O(n/p1-ε) bits of data, where ε ∈[0,1] is a parameter that controls replication. We examine how many global communication steps are needed to compute q. We establish both lower and upper bounds, in two settings. For a single round of communication, we give lower bounds in the strongest possible model, where arbitrary bits may be exchanged; we show that any algorithm requires ε ≥ 1--1/τ*, where τ* is the fractional vertex cover of the hypergraph of q. We also give an algorithm that matches the lower bound for a specific class of databases. For multiple rounds of communication, we present lower bounds in a model where routing decisions for a tuple are tuple-based. We show that for the class of tree-like queries there exists a tradeoff between the number of rounds and the space exponent ε. The lower bounds for multiple rounds are the first of their kind. Our results also imply that transitive closure cannot be computed in O(1) rounds of communication.
- F. N. Afrati, A. D. Sarma, S. Salihoglu, and J. D. Ullman. Upper and lower bounds on the cost of a map-reduce computation. CoRR, abs/1206.4377, 2012.Google Scholar
- F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99--110, 2010. Google Scholar
Digital Library
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. JCSS, 58(1):137--147, 1999. Google Scholar
Digital Library
- A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins. In FOCS, pages 739--748, 2008. Google Scholar
Digital Library
- S. Chaudhuri. What next?: a half-dozen data management research goals for big data and the cloud. In PODS, pages 1--4, 2012. Google Scholar
Digital Library
- F. R. K. Chung, Z. Füredi, M. R. Garey, and R. L. Graham. On the fractional covering number of hypergraphs. SIAM J. Discrete Math., 1(1):45--49, 1988. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google Scholar
Digital Library
- EMC Corporation. Data science revealed: A data-driven glimpse into the burgeoning new field. http://www.emc.com/collateral/about/news/emc-data-science-study-wp.pdf.Google Scholar
- J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On distributing symmetric streaming computations. ACM Transactions on Algorithms, 6(4), 2010. Google Scholar
Digital Library
- E. Friedgut. Hypergraphs, entropy, and inequalities. American Mathematical Monthly, pages 749--760, 2004.Google Scholar
- A. Gál and P. Gopalan. Lower bounds on streaming algorithms for approximating the length of the longest increasing subsequence. In FOCS, pages 294--304, 2007. Google Scholar
Digital Library
- S. Ganguly, A. Silberschatz, and S. Tsur. Parallel bottom-up processing of datalog queries. J. Log. Program., 14(1&2):101--126, 1992. Google Scholar
Digital Library
- M. Grohe and D. Marx. Constraint solving via fractional edge covers. In SODA, pages 289--298, 2006. Google Scholar
Digital Library
- S. Guha and Z. Huang. Revisiting the direct sum theorem and space lower bounds in random order streams. In ICALP, volume 5555 of LNCS, pages 513--524. Springer, 2009. Google Scholar
Digital Library
- Hadoop. http://hadoop.apache.org/.Google Scholar
- H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938--948, 2010. Google Scholar
Digital Library
- P. Koutris and D. Suciu. Parallel evaluation of conjunctive queries. In PODS, pages 223--234, 2011. Google Scholar
Digital Library
- E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, Cambridge, England ; New York, 1997. Google Scholar
Digital Library
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330--339, 2010. Google Scholar
Digital Library
- H. Q. Ngo, E. Porat, C. Ré, and A. Rudra. Worst-case optimal join algorithms: {extended abstract}. In PODS, pages 37--48, 2012. Google Scholar
Digital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008. Google Scholar
Digital Library
- S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011. Google Scholar
Digital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009. Google Scholar
Digital Library
- P. Tiwari. Lower bounds on communication complexity in distributed computer networks. JACM, 34(4):921--938, Oct. 1987. Google Scholar
Digital Library
- J. D. Ullman. Designing good mapreduce algorithms. ACM Crossroads, 19(1):30--34, 2012. Google Scholar
Digital Library
- A. C. Yao. Lower bounds by probabilistic arguments. In FOCS, pages 420--428, Tucson, AZ, 1983. Google Scholar
Digital Library
Index Terms
Communication steps for parallel query processing
Recommendations
Skew in parallel query processing
PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsWe study the problem of computing a conjunctive query q in parallel, using p of servers, on a large database. We consider algorithms with one round of communication, and study the complexity of the communication. We are especially interested in the case ...
Communication Steps for Parallel Query Processing
We study the problem of computing conjunctive queries over large databases on parallel architectures without shared storage. Using the structure of such a query q and the skew in the data, we study tradeoffs between the number of processors, the number ...
Tight bounds for online vector bin packing
STOC '13: Proceedings of the forty-fifth annual ACM symposium on Theory of ComputingIn the d-dimensional bin packing problem (VBP), one is given vectors x1,x2, ... ,xn ∈ Rd and the goal is to find a partition into a minimum number of feasible sets: {1,2 ... ,n} = ∪is Bi. A set Bi is feasible if ∑j ∈ Bi xj ≤ 1, where 1 denotes the all 1'...






Comments