ABSTRACT
The availability of large data centers with tens of thousands of servers has led to the popular adoption of massive parallelism for data analysis on large datasets. Several query languages exist for running queries on massively parallel architectures, some based on the MapReduce infrastructure, others using proprietary implementations. Motivated by this trend, this paper analyzes the parallel complexity of conjunctive queries. We propose a very simple model of parallel computation that captures these architectures, in which the complexity parameter is the number of parallel steps requiring synchronization of all servers. We study the complexity of conjunctive queries and give a complete characterization of the queries which can be computed in one parallel step. These form a strict subset of hierarchical queries, and include flat queries like R(x,y), S(x,z), T(x,v), U(x,w), tall queries like R(x), S(x,y), T(x,y,z), U(x,y,z,w), and combinations thereof, which we call tall-flat queries. We describe an algorithm for computing in parallel any tall-flat query, and prove that any query that is not tall-flat cannot be computed in one step in this model. Finally, we present extensions of our results to queries that are not tall-flat.
- F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99--110, 2010. Google Scholar
Digital Library
- P. Alvaro, W. Marczak, N. Conway, J. M. Hellerstein, D. Maier, and R. C. Sears. Dedalus: Datalog in time and space. Technical Report UCB/EECS-2009-173, EECS Department, University of California, Berkeley, Dec 2009.Google Scholar
Cross Ref
- L. Carter and M. N. Wegman. Universal classes of hash functions. J. Comput. Syst. Sci., 18(2):143--154, 1979.Google Scholar
Cross Ref
- R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1:1265--1276, August 2008. Google Scholar
Digital Library
- S. Cohen. Containment of aggregate queries. SIGMOD Record, 34(1):77--85, 2005. Google Scholar
Digital Library
- D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. E. Santos, R. Subramonian, and T. von Eicken. Logp: Towards a realistic model of parallel computation. In PPOPP, pages 1--12, 1993. Google Scholar
Digital Library
- N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB J., 16(4):523--544, 2007. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google Scholar
Digital Library
- D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992. Google Scholar
Digital Library
- A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414--1425, 2009. Google Scholar
Digital Library
- M. Grohe, Y. Gurevich, D. Leinders, N. Schweikardt, J. Tyszkiewicz, and J. V. den Bussche. Database query processing using finite cursor machines. Theory Comput. Syst., 44(4):533--560, 2009. Google Scholar
Digital Library
- J. M. Hellerstein. The declarative imperative: experiences and conjectures in distributed logic. SIGMOD Rec., 39:5--19, September 2010. Google Scholar
Digital Library
- N. Immerman. Expressibility and parallel complexity. SIAM J. Comput., 18(3):625--638, 1989. Google Scholar
Digital Library
- H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938--948, 2010. Google Scholar
Digital Library
- P. Koutris and D. Suciu. Parallel evaluation of conjunctive queries. Research Report UW-CSE-11-03-01, University of Washington, 2011.Google Scholar
Digital Library
- L. Libkin. Elements of Finite Model Theory. Springer, 2004. Google Scholar
Digital Library
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330--339, 2010. Google Scholar
Digital Library
- A. Pagh and R. Pagh. Uniform hashing in constant time and optimal space. SIAM J. Comput., 38(1):85--96, 2008. Google Scholar
Digital Library
- M. Raab and A. Steger. "balls into bins" - a simple and tight analysis. In RANDOM, pages 159--170, 1998. Google Scholar
Digital Library
- P. Sanders. On the competitive analysis of randomized static load balancing. In Proceedings of the first Workshop on Randomized Parallel Algorithms, RANDOM, 1996.Google Scholar
- L. J. Stockmeyer and U. Vishkin. Simulation of parallel random access machines by circuits. SIAM J. Comput., 13(2):409--422, 1984.Google Scholar
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009. Google Scholar
Digital Library
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990. Google Scholar
Digital Library
- Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared-nothing systems. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1043--1052, New York, NY, USA, 2008. ACM. Google Scholar
Digital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1--14, 2008. Google Scholar
Digital Library
Index Terms
Parallel evaluation of conjunctive queries
Recommendations
The complexity of acyclic conjunctive queries
This paper deals with the evaluation of acyclic Boolean conjunctive queries in relational databases. By well-known results of Yannakakis[1981], this problem is solvable in polynomial time; its precise complexity, however, has not been pinpointed so far. We ...
Interleaving a Join Sequence with Semijoins in Distributed Query Processing
The problem of combining join and semijoin reducers for distributed query processing is studied. An approach based on interleaving a join sequence with beneficial semijoins is proposed. A join sequence is mapped into a join sequence tree first. The join ...
The Complexity of Acyclic Conjunctive Queries
FOCS '98: Proceedings of the 39th Annual Symposium on Foundations of Computer ScienceWe show that the problem of evaluating acyclic Boolean database-queries is LOGCFL-complete and thus highly parallelizable. We present a parallel database algorithm solving this problem with a logarithmic number of parallel join operations. We also show ...






Comments