Abstract
Many applications associated with live business intelligence are written as complex data analysis programs defined by directed acyclic graphs of MapReduce jobs, for example, using Pig, Hive, or Scope frameworks. An increasing number of these applications have additional requirements for completion time guarantees. In this article, we consider the popular Pig framework that provides a high-level SQL-like abstraction on top of MapReduce engine for processing large data sets. There is a lack of performance models and analysis tools for automated performance management of such MapReduce jobs. We offer a performance modeling environment for Pig programs that automatically profiles jobs from the past runs and aims to solve the following inter-related problems: (i) estimating the completion time of a Pig program as a function of allocated resources; (ii) estimating the amount of resources (a number of map and reduce slots) required for completing a Pig program with a given (soft) deadline. First, we design a basic performance model that accurately predicts completion time and required resource allocation for a Pig program that is defined as a sequence of MapReduce jobs: predicted completion times are within 10% of the measured ones. Second, we optimize a Pig program execution by enforcing the optimal schedule of its concurrent jobs. For DAGs with concurrent jobs, this optimization helps reducing the program completion time: 10%--27% in our experiments. Moreover, it eliminates possible nondeterminism of concurrent jobs’ execution in the Pig program, and therefore, enables a more accurate performance model for Pig programs. Third, based on these optimizations, we propose a refined performance model for Pig programs with concurrent jobs. The proposed approach leads to significant resource savings (20%--60% in our experiments) compared with the original, unoptimized solution. We validate our solution using a 66-node Hadoop cluster and a diverse set of workloads: PigMix benchmark, TPC-H queries, and customized queries mining a collection of HP Labs’ web proxy logs.
- Apache. 2010. PigMix Benchmark. http://wiki.apache.org/pig/PigMix.Google Scholar
- Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. 2008. Easy and efficient parallel processing of massive data sets. Proc. of the VLDB Endow. 1, 2. Google Scholar
Digital Library
- Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51, 1. Google Scholar
Digital Library
- Ganapathi, A., Chen, Y., Fox, A., Katz, R., and Patterson, D. 2010. Statistics-driven workload modeling for the cloud. In Proceedings of the 5th International Workshop on Self Managing Database Systems (SMDB).Google Scholar
- Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., and Srivastava, U. 2009. Building a high-level dataflow system on top of map-reduce: The pig experience. Proc. of the VLDB Endow. 2, 2. Google Scholar
Digital Library
- Herodotou, H. and Babu, S. 2011. Profiling, what-if analysis, and costbased optimization of MapReduce programs. Proc. of the VLDB Endow. 4, 11.Google Scholar
Digital Library
- Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F., and Babu, S. 2011. Starfish: A self-tuning system for big data analytics. In Proceedings of the 5th Conference on Innovative Data Systems Research (CIDR).Google Scholar
- Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. ACM SIGOPS OS Review 41, 3. Google Scholar
Digital Library
- Johnson, S. M. 1954. Optimal two- and three-stage production schedules with setup times included. Naval Res. Log. Quart.Google Scholar
- Kambatla, K., Pathak, A., and Pucha, H. 2009. Towards optimizing hadoop provisioning in the cloud. In Proceedings of the 1st Workshop on Hot Topics in Cloud Computing. Google Scholar
Digital Library
- Morton, K., Balazinska, M., and Grossman, D. 2010a. ParaTimer: A progress indicator for MapReduce DAGs. In Proceedings of SIGMOD. ACM. Google Scholar
Digital Library
- Morton, K., Friesen, A., Balazinska, M., and Grossman, D. 2010b. Estimating the progress of MapReduce pipelines. In Proceedings of ICDE.Google Scholar
- Polo, J., Carrera, D., Becerra, Y., Torres, J., Ayguadé, E., Steinder, M., and Whalley, I. 2010. Performance-driven task co-scheduling for MapReduce environments. In Proceedings of the 12th IEEE/IFIP Network Operations and Management Symposium.Google Scholar
- Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. 2009. Hive - a warehousing solution over a map-reduce framework. Proc. of VLDB. Google Scholar
Digital Library
- Tian, F. and Chen, K. 2011. Towards optimal resource provisioning for running MapReduce programs in public clouds. In Proceedings of IEEE Conference on Cloud Computing (CLOUD’11). Google Scholar
Digital Library
- Transaction Processing Performance Council (TPC). 2008. TPC Benchmark H (Decision Support), Version 2.8.0. http://www.tpc.org/tpch/.Google Scholar
- Verma, A., Cherkasova, L., and Campbell, R. H. 2011a. ARIA: Automatic Resource Inference and Allocation for MapReduce environments. In Proceedings of the 8th ACM International Conference on Autonomic Computing (ICAC’11). Google Scholar
Digital Library
- Verma, A., Cherkasova, L., and Campbell, R. H. 2011b. SLO-driven right-sizing and resource provisioning of MapReduce jobs. In Proceedings of the 5th Workshop on Large Scale Distributed Systems and Middleware (LADIS).Google Scholar
- Wang, X., Olston, C., Sarma, A., and Burns, R. 2011. CoScan: Cooperative scan sharing in the cloud. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’11). Google Scholar
Digital Library
- Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., Wu, K.-L., and Balmin, A. 2010. FLEX: A slot allocation scheduling optimizer for MapReduce workloads. In Proceedings of the 11th ACM/IFIP/USENIX Middleware Conference. Google Scholar
Digital Library
Index Terms
Performance Modeling and Optimization of Deadline-Driven Pig Programs
Recommendations
Automated profiling and resource management of pig programs for meeting service level objectives
ICAC '12: Proceedings of the 9th international conference on Autonomic computingAn increasing number of MapReduce applications associated with live business intelligence require completion time guarantees. In this paper, we consider the popular Pig framework that provides a high-level SQL-like abstraction on top of MapReduce engine ...
Crime Data Analysis Using Pig with Hadoop
Big data is the voluminous and complex collection of data that comes from different sources such as sensors, content posted on social media website, sale purchase transaction etc. Such voluminous data becomes tough to process using ancient processing ...
Optimizing Completion Time and Resource Provisioning of Pig Programs
CCGRID '12: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)As cloud computing continues to mature, IT managers have started concentrating on the support of additional performance requirements: quality of service and tailored resource allocation for achieving service performance goals. In this paper, we consider ...






Comments