ABSTRACT
Building and debugging distributed software remains extremely difficult. We conjecture that by adopting a data-centric approach to system design and by employing declarative programming languages, a broad range of distributed software can be recast naturally in a data-parallel programming model. Our hope is that this model can significantly raise the level of abstraction for programmers, improving code simplicity, speed of development, ease of software evolution, and program correctness.
This paper presents our experience with an initial large-scale experiment in this direction. First, we used the Overlog language to implement a "Big Data" analytics stack that is API-compatible with Hadoop and HDFS and provides comparable performance. Second, we extended the system with complex distributed features not yet available in Hadoop, including high availability, scalability, and unique monitoring and debugging facilities. We present both quantitative and anecdotal results from our experience, providing some concrete evidence that both data-centric design and declarative languages can substantially simplify distributed systems programming.
- A. Abouzeid et al. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB, 2009. Google Scholar
Digital Library
- P. Alvaro et al. BOOM: Data-centric programming in the datacenter. Technical Report UCB/EECS-2009-113, EECS Department, University of California, Berkeley, Jul 2009.Google Scholar
- P. Alvaro et al. Dedalus: Datalog in time and space. Technical Report UCB/EECS-2009-173, EECS Department, University of California, Berkeley, Dec 2009.Google Scholar
Cross Ref
- P. Alvaro et al. I Do Declare: Consensus in a logic language. In NetDB, 2009.Google Scholar
- M. P. Ashley-Rollman et al.Declarative Programming for Modular Robots. In Workshop on Self-Reconfigurable Robots/Systems and Applications, 2007.Google Scholar
- N. Belaramani et al. PADS: A policy architecture for data replication systems. In NSDI, 2009. Google Scholar
Digital Library
- M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In OSDI, 2006. Google Scholar
Digital Library
- D. Cabrero et al. ARMISTICE: an experience developing management software with Erlang. In ACM SIGPLAN Workshop on Erlang, 2003. Google Scholar
Digital Library
- T. D. Chandra et al. Paxos made live: an engineering perspective. In PODC, 2007. Google Scholar
Digital Library
- T. Condie et al. Evita Raced: metacompilation for declarative networks. In VLDB, 2008. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google Scholar
Digital Library
- G. DeCandia et al. Dynamo: Amazon's highly available key-value store. In SOSP, 2007. Google Scholar
Digital Library
- J. Eisner et al. Dyna: a declarative language for implementing dynamic programs. In ACL, 2004. Google Scholar
Digital Library
- S. Ghemawat et al. The Google file system. In SOSP, 2003. Google Scholar
Digital Library
- H. S. Gunawi et al. SQCK: A Declarative File System Checker. In OSDI, 2008. Google Scholar
Digital Library
- A. Gupta et al. Constraint checking with partial information. In PODS, 1994. Google Scholar
Digital Library
- M. Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, 2007. Google Scholar
Digital Library
- M. B. Jones. Interposition agents: transparently interposing user code at the system interface. In SOSP, 1993. Google Scholar
Digital Library
- E. Kohler et al. The Click modular router. ACM Transactions on Computer Systems, 18(3):263--297, August 2000. Google Scholar
Digital Library
- M. S. Lam et al. Context-sensitive program analysis as database queries. In PODS, 2005. Google Scholar
Digital Library
- L. Lamport. The part-time parliament. ACM Transactions on Computer Systems, 16(2):133--169, 1998. Google Scholar
Digital Library
- LATE Hadoop Jira. Hadoop jira issue tracker, July 2009. http://issues.apache.org/jira/browse/HADOOP.Google Scholar
- B. T. Loo et al. Declarative networking: language, execution and optimization. In SIGMOD, 2006. Google Scholar
Digital Library
- B. T. Loo et al. Implementing declarative overlays. In SOSP, 2005. Google Scholar
Digital Library
- N. A. Lynch. Distributed Algorithms. Morgan Kaufmann, 1997. Google Scholar
Digital Library
- W. R. Marczak et al. Declarative reconfigurable trust management. In CIDR, 2009.Google Scholar
- F. Marguerie et al. LINQ In Action. Manning Publications Co., 2008. Google Scholar
Digital Library
- Nokia Corporation. disco: massive data -- minimal code, 2009. http://discoproject.org/.Google Scholar
- T. Schutt et al. Scalaris: Reliable transactional P2P key/value store. In ACM SIGPLAN Workshop on Erlang, 2008. Google Scholar
Digital Library
- R. Sears and E. Brewer. Stasis: flexible transactional storage. In OSDI, 2006. Google Scholar
Digital Library
- A. Singh et al. Using queries for distributed monitoring and forensics. In EuroSys, 2006. Google Scholar
Digital Library
- A. Singh et al. BFT protocols under fire. In NSDI, 2008. Google Scholar
Digital Library
- M. Stonebraker. Inclusion of new types in relational data base systems. In ICDE, 1986. Google Scholar
Digital Library
- B. Szekely and E. Torres, Dec. 2005.http://www.klinewoods.com/papers/p2paxos.pdf.Google Scholar
- A. Thusoo et al. Hive -- a warehousing solution over a Map-Reduce framework. In VLDB, 2009. Google Scholar
Digital Library
- J. D. Ullman. Principles of Database and Knowledge-Base Systems: Volume II: The New Technologies. W. H. Freeman & Company, 1990. Google Scholar
Digital Library
- W. White et al. Scaling games to epic proportions. In SIGMOD, 2007. Google Scholar
Digital Library
- F. Yang et al. Hilda: A high-level language for data-driven web applications. In ICDE, 2006. Google Scholar
Digital Library
- Y. Yu et al.DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008. Google Scholar
Digital Library
- M. Zaharia et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys, 2010. Google Scholar
Digital Library
- M. Zaharia et al. Improving MapReduce performance in heterogeneous environments. In OSDI, 2008. Google Scholar
Digital Library
Index Terms
Boom analytics: exploring data-centric, declarative programming for the cloud
Recommendations
G-Hadoop: MapReduce across distributed data centers for data-intensive computing
Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge ...
Towards building an analytics platform in the cloud
CF '15: Proceedings of the 12th ACM International Conference on Computing FrontiersRecently enterprises have been able to leverage two revolutionary new tools for gaining a competitive advantage for their business -- cloud computing and analytic applications. Cloud computing unburdens them from running and maintaining their compute ...
Challenges and Benefits of Deploying Big Data Analytics in the Cloud for Business Intelligence
Cloud computing and big data analytics are, without a doubt, two of the most important technologies to enter the mainstream IT industry in recent years. Surprisingly, the two technologies are coming together to deliver powerful results and benefits for ...





Comments