skip to main content
research-article

Performance Modeling and Optimization of Deadline-Driven Pig Programs

Published:01 September 2013Publication History
Skip Abstract Section

Abstract

Many applications associated with live business intelligence are written as complex data analysis programs defined by directed acyclic graphs of MapReduce jobs, for example, using Pig, Hive, or Scope frameworks. An increasing number of these applications have additional requirements for completion time guarantees. In this article, we consider the popular Pig framework that provides a high-level SQL-like abstraction on top of MapReduce engine for processing large data sets. There is a lack of performance models and analysis tools for automated performance management of such MapReduce jobs. We offer a performance modeling environment for Pig programs that automatically profiles jobs from the past runs and aims to solve the following inter-related problems: (i) estimating the completion time of a Pig program as a function of allocated resources; (ii) estimating the amount of resources (a number of map and reduce slots) required for completing a Pig program with a given (soft) deadline. First, we design a basic performance model that accurately predicts completion time and required resource allocation for a Pig program that is defined as a sequence of MapReduce jobs: predicted completion times are within 10% of the measured ones. Second, we optimize a Pig program execution by enforcing the optimal schedule of its concurrent jobs. For DAGs with concurrent jobs, this optimization helps reducing the program completion time: 10%--27% in our experiments. Moreover, it eliminates possible nondeterminism of concurrent jobs’ execution in the Pig program, and therefore, enables a more accurate performance model for Pig programs. Third, based on these optimizations, we propose a refined performance model for Pig programs with concurrent jobs. The proposed approach leads to significant resource savings (20%--60% in our experiments) compared with the original, unoptimized solution. We validate our solution using a 66-node Hadoop cluster and a diverse set of workloads: PigMix benchmark, TPC-H queries, and customized queries mining a collection of HP Labs’ web proxy logs.

References

  1. Apache. 2010. PigMix Benchmark. http://wiki.apache.org/pig/PigMix.Google ScholarGoogle Scholar
  2. Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. 2008. Easy and efficient parallel processing of massive data sets. Proc. of the VLDB Endow. 1, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ganapathi, A., Chen, Y., Fox, A., Katz, R., and Patterson, D. 2010. Statistics-driven workload modeling for the cloud. In Proceedings of the 5th International Workshop on Self Managing Database Systems (SMDB).Google ScholarGoogle Scholar
  5. Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., and Srivastava, U. 2009. Building a high-level dataflow system on top of map-reduce: The pig experience. Proc. of the VLDB Endow. 2, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Herodotou, H. and Babu, S. 2011. Profiling, what-if analysis, and costbased optimization of MapReduce programs. Proc. of the VLDB Endow. 4, 11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F., and Babu, S. 2011. Starfish: A self-tuning system for big data analytics. In Proceedings of the 5th Conference on Innovative Data Systems Research (CIDR).Google ScholarGoogle Scholar
  8. Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. ACM SIGOPS OS Review 41, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Johnson, S. M. 1954. Optimal two- and three-stage production schedules with setup times included. Naval Res. Log. Quart.Google ScholarGoogle Scholar
  10. Kambatla, K., Pathak, A., and Pucha, H. 2009. Towards optimizing hadoop provisioning in the cloud. In Proceedings of the 1st Workshop on Hot Topics in Cloud Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Morton, K., Balazinska, M., and Grossman, D. 2010a. ParaTimer: A progress indicator for MapReduce DAGs. In Proceedings of SIGMOD. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Morton, K., Friesen, A., Balazinska, M., and Grossman, D. 2010b. Estimating the progress of MapReduce pipelines. In Proceedings of ICDE.Google ScholarGoogle Scholar
  13. Polo, J., Carrera, D., Becerra, Y., Torres, J., Ayguadé, E., Steinder, M., and Whalley, I. 2010. Performance-driven task co-scheduling for MapReduce environments. In Proceedings of the 12th IEEE/IFIP Network Operations and Management Symposium.Google ScholarGoogle Scholar
  14. Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. 2009. Hive - a warehousing solution over a map-reduce framework. Proc. of VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Tian, F. and Chen, K. 2011. Towards optimal resource provisioning for running MapReduce programs in public clouds. In Proceedings of IEEE Conference on Cloud Computing (CLOUD’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Transaction Processing Performance Council (TPC). 2008. TPC Benchmark H (Decision Support), Version 2.8.0. http://www.tpc.org/tpch/.Google ScholarGoogle Scholar
  17. Verma, A., Cherkasova, L., and Campbell, R. H. 2011a. ARIA: Automatic Resource Inference and Allocation for MapReduce environments. In Proceedings of the 8th ACM International Conference on Autonomic Computing (ICAC’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Verma, A., Cherkasova, L., and Campbell, R. H. 2011b. SLO-driven right-sizing and resource provisioning of MapReduce jobs. In Proceedings of the 5th Workshop on Large Scale Distributed Systems and Middleware (LADIS).Google ScholarGoogle Scholar
  19. Wang, X., Olston, C., Sarma, A., and Burns, R. 2011. CoScan: Cooperative scan sharing in the cloud. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Wolf, J., Rajan, D., Hildrum, K., Khandekar, R., Kumar, V., Parekh, S., Wu, K.-L., and Balmin, A. 2010. FLEX: A slot allocation scheduling optimizer for MapReduce workloads. In Proceedings of the 11th ACM/IFIP/USENIX Middleware Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Performance Modeling and Optimization of Deadline-Driven Pig Programs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Autonomous and Adaptive Systems
            ACM Transactions on Autonomous and Adaptive Systems  Volume 8, Issue 3
            September 2013
            110 pages
            ISSN:1556-4665
            EISSN:1556-4703
            DOI:10.1145/2518017
            Issue’s Table of Contents

            Copyright © 2013 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 September 2013
            • Revised: 1 July 2013
            • Accepted: 1 July 2013
            • Received: 1 February 2013
            Published in taas Volume 8, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!