Abstract
In-Memory cluster Computing (IMC) frameworks (e.g., Spark) have become increasingly important because they typically achieve more than 10× speedups over the traditional On-Disk cluster Computing (ODC) frameworks for iterative and interactive applications. Like ODC, IMC frameworks typically run the same given programs repeatedly on a given cluster with similar input dataset size each time. It is challenging to build performance model for IMC program because: 1) the performance of IMC programs is more sensitive to the size of input dataset, which is known to be difficult to be incorporated into a performance model due to its complex effects on performance; 2) the number of performance-critical configuration parameters in IMC is much larger than ODC (more than 40 vs. around 10), the high dimensionality requires more sophisticated models to achieve high accuracy. To address this challenge, we propose DAC, a datasize-aware auto-tuning approach to efficiently identify the high dimensional configuration for a given IMC program to achieve optimal performance on a given cluster. DAC is a significant advance over the state-of-the-art because it can take the size of input dataset and 41 configuration parameters as the parameters of the performance model for a given IMC program, --- unprecedented in previous work. It is made possible by two key techniques: 1) Hierarchical Modeling (HM), which combines a number of individual sub-models in a hierarchical manner; 2) Genetic Algorithm (GA) is employed to search the optimal configuration. To evaluate DAC, we use six typical Spark programs, each with five different input dataset sizes. The evaluation results show that DAC improves the performance of six typical Spark programs, each with five different input dataset sizes compared to default configurations by a factor of 30.4x on average and up to 89x. We also report that the geometric mean speedups of DAC over configurations by default, expert, and RFHOC are 15.4x, 2.3x, and 1.5x, respectively.
- Faraz Ahmad, Srimat T Chakradhar, Anand Raghunathan, and TN Vijaykumar. 2014. ShuffleWatcher: Shuffle-aware Scheduling in Multitenant MapReduce Clusters. In Proceedings of USENIX Annual Technical Conference (ATC) (ATC'14). USENIX Association, Philadelphia, PA, 1-12. Google Scholar
Digital Library
- Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT) (PACT'14). ACM Press, Edmonton, Canada, 303-316. Google Scholar
Digital Library
- Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1383-1394. Google Scholar
Digital Library
- Zhendong Bei, Zhibin Yu, Huiling Zhang, Wen Xiong, Chengzhong Xu, Lieven Eeckhout, and Shengzhong Feng. 2016. RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration. IEEE Transactions on Parallel and Distributed Systems 27, 5 (June 2016), 1470-1483. Google Scholar
Digital Library
- Dazhao Cheng, Jia Rao, Yanfei Guo, and Xiaobo Zhou. 2014. Improving MapReduce Performance in Heterogeneous Environments with Adaptive Task Tuning. In Proceedings of the 15th International Middleware Conference (Middleware) (Middleware'14). USENIX Association, Bordeaux, France, 97-108. Google Scholar
Digital Library
- Tatsuhiro Chiba and Tamiya Onodera. 2015. Workload Characterization and Optimization of TPC-H Queries on Apache Spark. Technical Report. IBM Research - Tokyo, IBM Japan, Ltd.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the International Conference on Operating Systems Design and Implementation (OSDI) (OSDI'12). USENIX Association, San Francisco, CA, 137-150. Google Scholar
Digital Library
- Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proceedings of the 18th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS) (ASPLOS'13). ACM Press, Houston, TX, 77-88. Google Scholar
Digital Library
- Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proceedings of the 19th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS) (ASPLOS'14). ACM Press, Salt Lake City, UT, 1-12. Google Scholar
Digital Library
- Adem Efe Gencer, David Bindel, Emin Gun Sirer, and Robbert van Renesse. 2015. Configuring Distributed Computations Using Response Surfaces. In Proceedings of the annual ACM/IFIP/USENIX Middleware conference (Middleware) (Middleware'15). USENIX Association, Vancouver, Canada, 235-246 Google Scholar
Digital Library
- Robert Gentleman and Ross Ihaka. 2016. The R Project for Statistical Computing. (Sept. 2016). Retrieved Januray 20, 2018 from https://www.r-project.org/Google Scholar
- Herodotos Herodotou. 2011. Hadoop Performance Models. Technical Report CS-2011-05. Duke University, Durham, NC.Google Scholar
- Herodotos Herodotou and Shivnath Babu. 2011. Profiling, What-If Analysis, and Cost-Based Optimization of MapReduce programs. Journal of VLDB Endowment 4, 11 (Jan. 2011), 1111-1122.Google Scholar
Digital Library
- Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A Self-tuning System for Big Data Analytics. In Proceedings of the Biennial International Conference on Innovative Data Systems Research (CIDR'11). CIDRDB, 261-272.Google Scholar
- Peng Huang, William J. Bolosky, Abhishek Singh, and Yuanyuan Zhou. 2015. Conf Valley: A Systematic Configuration Validation Framework for Cloud Services. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys) (EuroSys'15). USENIX Association, Bordeaux, France, 1-16. Google Scholar
Digital Library
- Cloudera Inc. 2016. Tuning Spark Applications. (June 2016). Retrieved Januray 20, 2018 from https://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_spark_tuning.htmlGoogle Scholar
- Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed Data-Parallel Programs form Sequential Building Blocks. In Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys) (EuroSys'07). USENIX Association, Lisbon, Portugal, 59-72. Google Scholar
Digital Library
- Manoj Kumar, Mohammad Husian, Naveen Upreti, and Deepti Gupta. 2010. Genetic algorithm: Review and Application. International Journal of Information Technology and Knowledge Management 2, 2 (Jan. 2010), 451-454.Google Scholar
- Palden Lama and Xiaobo Zhou. 2012. AROMA: Automated Resource Allocation and Configuration of MapReduce Environment in the Cloud. In Proceedings of the 9th ACM International Conference on Autonomic Computing (ICAC) (ICAC'12). ACM Press, San Jose, CA, 63-72. Google Scholar
Digital Library
- Jacek Laskowski. 2016. Mastering Apache Spark. (Jan. 2016). Retrieved Januray 20, 2018 from https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-dagscheduler-stages.htmlGoogle Scholar
- Benjamin C. Lee and David Brooks. 2010. Applied Inference: Case Studies in Micro-architectural Design. ACM Transactions on Architecture and Code Optimization (TACO) 7, 2 (Sept. 2010), 8:1-8:35. Google Scholar
Digital Library
- Roger J Lewis. 2000. An introduction to classification and regression tree (CART) analysis. In Proceedings of Annual Meeting of the Society for Academic Emergency Medicine. San Francisco, CA, 1-14.Google Scholar
- Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In Proceedings of the ACM Symposium on Cloud Computing (SoCC) (SoCC'14). ACM Press, Seattle, WA, 1-15. Google Scholar
Digital Library
- Shen Li, Shaohan Hu, Shiguang Wang, Lu Su, Tarek Abdelzaher, Indranil Gupta, and Richard Pace. 2014. Woha: Deadline-aware map-reduce workflow scheduling framework over hadoop clusters. In Proceedings of the 2014 IEEE 34th International Conference on Distributed Computing Systems (ICDCS) (ICDCS'14). IEEE, Madrid, Spain, 93-103. Google Scholar
Digital Library
- Guangdeng Liao, Kushal Datta, and Theodore L Willke. 2013. Gunther: Search-Based Auto-Tuning of MapReduce. In Proceedings of Euro-Par 2013 Parallel Processing (EuroPar'13). Springer, Berlin, Heidelberg, 406-419. Google Scholar
Digital Library
- Luo Lie. 2010. Heuristic Artificial Intelligent Algorithm for Genetic Algorithm. Key Engineering Materials 439 (May 2010), 516-521.Google Scholar
- Weiqing Liu, Jiannong Cao, Lei Yang, Lin Xu, Xuanjia Qiu, and Jing Li. 2017. AppBooster: Boosting the Performance of Interactive Mobile Applications with Computation Offloading and Parameter Tuning. IEEE Transactions on Parallel and Distributed Systems 28, 6 (June 2017), 1593-1606. Google Scholar
Digital Library
- Zhaolei Liu and TS Eugene Ng. 2017. Leaky Buffer: A Novel Abstraction for Relieving Memory Pressure from Cluster Data Processing Frameworks. IEEE Transactions on Parallel and Distributed Systems 28, 1 (March 2017), 128-140. Google Scholar
Digital Library
- Martin Maas, Tim Harris, Krste Asanovic, and John Kubiatowicz. 2015. Trash Day: Coordinating Garbage Collection in Distributed Systems. In Proceedings of the 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS) (HotOS XV). USENIX Association, Kartause Ittingen, Switzerland, 1-6. Google Scholar
Digital Library
- Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: Machine Learning in Apach Spark. The Journal of Machine Learning Research 17, 1 (Jan. 2016), 1-7. Google Scholar
Digital Library
- Khanh Nguyen, Lu Fang, Guoqing Xu, Brian Demsky, Shan Lu, Sanazsadat Alamian, and Onur Mutlu. 2016. Yak: A high-performance big-data-friendly garbage collector. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI) (OSDI'16). USENIX Association, Savannah, GA, 349-365. Google Scholar
Digital Library
- Andrew Or and Josh Rosen. 2016. Unified Memor Management in Spark 1.6. (Jan. 2016). Retrieved Januray 20, 2018 from https://issues.apache.org/jira/secure/attachment/12765646/unified-memory-management-spark-10000.pdfGoogle Scholar
- Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, and Byung-Gon Chun. 2015. Making Sense of Performance in Data Analytics Frameworks. In Proceedings of the 12nd USENIX Symposium on Networked Systems Design and Implementation (NSDI) (NSDI'15). USENIX Association, Oakland, CA, 293-307. Google Scholar
Digital Library
- Pankaj. 2017. Java (JVM) Memory Model - Memory Management in Java. (March 2017). Retrieved Januray 20, 2018 from http://www.journaldev.com/2856/java-jvm-memory-model-memory-management-in-javaGoogle Scholar
- Simone Pellegrini, Radu Prodan, and Thomas Fahringer. 2012. Tuning MPI Runtime Parameter Setting for High Performance Computing. In Proceedings of IEEE International Conference on Cluster Computing Workshops. IEEE Computer Society, Washington, DC, 213-221. Google Scholar
Digital Library
- Zujie Ren, Xianghua Xu, Jian Wan, Weisong Shi, and Min Zhou. 2012. Workload Characterization on a Production Hadoop Cluster: A Case Study on Taobao. In Proceedings of IEEE International Symposium on Workload Characterization (IISWC) (IISWC'12). IEEE Computer Society, San Diego, CA, 1-11. Google Scholar
Digital Library
- Anooshiravan Saboori, Guofei Jiang, and Haifeng Chen. 2008. Autotuning Configurations in Distributed Systems for Performance Improvements using Evolutionary Strategies. In Proceedings of the 28th International Conference on Distributed Computing Systems (ICDCS) (ICDCS'08). IEEE Computer Society, Beijing, China, 769-776. Google Scholar
Digital Library
- Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Ozcan. 2015. Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. In Proceedings of the 42nd International Conference on Very Large Data Bases (VLDB Endowment), Vol.8, No.13 (VLDB'15), Vol. 8. Hawai'i, USA, 2110-2121. Google Scholar
Digital Library
- Xueyuan Su, Garret Swart, Brian Goetz, Brian Oliver, and Paul Sandoz. 2014. Changing engines in midstream: A java stream computational model for big data processing. Proceedings of the VLDB Endowment 7, 13 (Sept. 2014), 1343-1354. Google Scholar
Digital Library
- Apache HBase Team. 2016. Apache HBase. (June 2016). Retrieved Januray 20, 2018 from http://hadoop.apache.org/hbase/Google Scholar
- Aparch Spark Team. 2016. Aparch Spark. (March 2016). Retrieved Januray 20, 2018 from http://spark.apache.org/Google Scholar
- Aparch Spark Team. 2016. Spark Configuration. (May 2016). Retrieved Januray 20, 2018 from http://spark.apache.org/docs/latest/configuration.htmlGoogle Scholar
- Aparch Spark Team. 2016. Tuning Spark. (June 2016). Retrieved Januray 20, 2018 from http://spark.apache.org/docs/latest/tuning.htmlGoogle Scholar
- Spark Streaming Team. 2016. Spark Streaming. (March 2016). Retrieved Januray 20, 2018 from http://spark.apache.org/streaming/Google Scholar
- White Tom. 2012. Hadoop: The definitive guide. O'Reilly Media, Inc. Google Scholar
Digital Library
- Virginia Torczon and Michael W Trosset. {n. d.}. From Evolutionary Operation to Parallel Direct Search: Pattern Search Algorithms for Numerical Optimization. Computing Science and Statistics 29 ({n. d.}).Google Scholar
- Inc. TypeSafe. 2015. Apache Spark Survey from Typesafe. (Jan. 2015). Retrieved Januray 20, 2018 from https://dzone.com/articles/apache-spark-survey-typesafe-0Google Scholar
- Md. Wasi ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar, and Dhabaleswar K. Panda. 2016. MR-Advisor: A Comprehensive Tuning Tool for Advising HPC Users to Accelerate MapReduce Applications on Supercomputers. In Proceedings of 2016 IEEE 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'16). IEEE Computer Society, Los Angeles, CA, 198-205.Google Scholar
- Guolu Wang, Jungang Xu, and Ben He. 2016. A Novel Method for Tuning Configuration parameters of Spark Based on Machine Learning. In Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications (HPCC) (HPCC'16). IEEE Computer Society, Sydney, Australia, 586-593.Google Scholar
Cross Ref
- Jingjing Wang and Magdalena Balazinska. 2016. Toward elastic memory management for cloud data analytics. In Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond. ACM Press, San Francisco, CA, 1-7. Google Scholar
Digital Library
- Reynold S. Xin, Joseph E. Gonzalez, Michael J. Franklin, and Ion Stoica. 2013. GraphX: A Resilient Distributed Graph System on Spark. In Proceedings of the First International Workshop on Graph Data Management Experimence and System. 1-5. Google Scholar
Digital Library
- Wen Xiong, Zhibin Yu, Lieven Eeckhout, Zhengdong Bei, Fan Zhang, and Chengzhong Xu. 2015. SZTS: A Novel Big Data Transportation System Benchmark Suite. In Proceedings of the 44th International Conference on Parallel Processing (ICPP) (ICPP'15). IEEE, Beijin, China, 819-828. Google Scholar
Digital Library
- Tianyin Xu, Long Jin, Xuepeng Fan, Yuanyuan Zhou, Shankar Pasupathy, and Rukma Talwadker. 2015. Hey, You Have Given Me Too Many Knobs. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering. ACM Press, Bergamo, Italy, 307-319. Google Scholar
Digital Library
- Tianyin Xu, Jiaqi Zhang, Peng Huang, Jing Zheng, Tianwei Sheng, Ding Yuan, Yuanyuan Zhou, and Shankar Pasupathy. 2013. Do Not Blame Users for Misconfigurations. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP) (SOSP'13). USENIX Association, Farmington, Pennsylvania, 244-259. Google Scholar
Digital Library
- Tianyin Xu and Yuanyuan Zhou. 2015. Systems Approaches to Tackling Configuration Errors: A Survey. Comput. Surveys 47, 4 (July 2015), 1-41. Google Scholar
Digital Library
- Tao Ye and Shivkumar Kalyanaraman. {n. d.}. A Recursive Random Search Algorithm for Large-Scale Network Parameter Configuration. ACM SIGMETRICS Performance Evaluation Review 31, 1 ({n. d.}). Google Scholar
Digital Library
- Nezih Yigitbasi, Theodore L. Willke, Guangdeng Liao, and Dick H. J. Epema. 2013. Towards Machine Learning-Based Auto-tuning of MapReduce. In Proceedings of the 21st International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) (MASCOTS'13). IEEE Computer Society, San Francisco, CA, 11-20. Google Scholar
Digital Library
- Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy. 2011. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP) (SOSP'11). USENIX Association, Cascais, Portugal, 159-172. Google Scholar
Digital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud) (HotCloud'10). USENIX Association, Boston, MA, 1-8. Google Scholar
Digital Library
- Jiaqi Zhang, Lakshminarayanan Renganarayana, Xiaolan Zhang, Niyu Ge, Vasanth Bala, Tianyin Xu, and Yuanyuan Zhou. 2014. EnCore: Exploiting System Environment and Correlation Information for Misconfiguration Detection. In Proceedings of the 19th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS) (ASPLOS'14). ACM Press, Salt Lake City, UT, 687-700 Google Scholar
Digital Library
- Yao Zhao, Fei Hu, and Haopeng Chen. 2016. An Adaptive Tuning Strategy on Spark Based on In-memory Computation Characteristics. In Proceedings of the 18th International Conference on Advanced Communication Technology (ICACT) (ICACT'16). PyeongChang, Korea (South), 484-488.Google Scholar
Index Terms
Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing
Recommendations
Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsIn-Memory cluster Computing (IMC) frameworks (e.g., Spark) have become increasingly important because they typically achieve more than 10× speedups over the traditional On-Disk cluster Computing (ODC) frameworks for iterative and interactive ...
ROBOTune: High-Dimensional Configuration Tuning for Cluster-Based Data Analytics
ICPP '21: Proceedings of the 50th International Conference on Parallel ProcessingSpark is popular for its ability to enable high-performance data analytics applications on diverse systems. Its great versatility is achieved through numerous user- and system-level options, resulting in an exponential configuration space that, ...
Making sense of performance in in-memory computing frameworks for scientific data analysis: A case study of the spark system
AbstractOver the last five years, Apache Spark has become a major software platform for in-memory data analysis. Acknowledging its widespread use, we present a comprehensive study of system characteristics of Spark targeting scientific data ...
Highlights- We develop a benchmark, ArrayBench, for benchmarking scientific data analytics that process gene expression matrices using Spark and SciDB.







Comments