skip to main content
10.5555/2789770.2789791guideproceedingsArticle/Chapter ViewAbstractPublication PagesnsdiConference Proceedingsconference-collections
Article

Making sense of performance in data analytics frameworks

Published:04 May 2015Publication History

ABSTRACT

There has been much research devoted to improving the performance of data analytics frameworks, but comparatively little effort has been spent systematically identifying the performance bottlenecks of these systems. In this paper, we develop blocked time analysis, a methodology for quantifying performance bottlenecks in distributed computation frameworks, and use it to analyze the Spark framework's performance on two SQL benchmarks and a production workload. Contrary to our expectations, we find that (i) CPU (and not I/O) is often the bottleneck, (ii) improving network performance can improve job completion time by a median of at most 2%, and (iii) the causes of most stragglers can be identified.

References

  1. Apache Parquet. http://parquet.incubator.apache.org/.Google ScholarGoogle Scholar
  2. Common Crawl. http://commoncrawl.org/.Google ScholarGoogle Scholar
  3. Databricks. http://databricks.com/.Google ScholarGoogle Scholar
  4. Spark SQL. https://spark.apache.org/sql/.Google ScholarGoogle Scholar
  5. M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance Debugging for Distributed Systems of Black Boxes. In Proc. SOSP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Al-fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic Flow Scheduling for Data Center Networks. In Proc. NSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, and E. Harris. Scarlett: Coping with Skewed Content Popularity in MapReduce Clusters. In Proc. EuroSys, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective Straggler Mitigation: Attack of the Clones. In Proc. NSDI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. PACMan: Coordinated Memory Caching for Parallel Jobs. In Proc. NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Ananthanarayanan, M. C.-C. Hung, X. Ren, I. Stoica, A. Wierman, and M. Yu. GRASS: Trimming Stragglers in Approximation Analytics. In Proc. NSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the Outliers in Map-Reduce Clusters using Mantri. In Proc. OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Ananthanarayayan. Personal Communication, February 2015.Google ScholarGoogle Scholar
  13. Apache Software Foundation. Apache Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  14. H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Towards Predictable Datacenter Networks. In Proc. SIGCOMM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In Proc. SOSP, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Borthakur. Facebook has the world's largest Hadoop cluster! http://hadoopblog.blogspot.com/2010/05/facebook-has-worlds-largest-hadoop.html, May 2010.Google ScholarGoogle Scholar
  17. M. Chowdhury, S. Kandula, and I. Stoica. Leveraging Endpoint Flexibility in Data-intensive Clusters. In Proc. SIGCOMM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Chowdhury and I. Stoica. Coflow: A Networking Abstraction for Cluster Applications. In Proc. HotNets, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing Data Transfers in Computer Clusters with Orchestra. In Proc. SIGCOMM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Chowdhury, Y. Zhong, and I. Stoica. Efficient Coflow Scheduling with Varys. In Proc. SIGCOMM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Costa, A. Donnelly, A. Rowstron, and G. O'Shea. Camdoop: Exploiting In-network Aggregation for Big Data Applications. In Proc. NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Crotty, A. Galakatos, K. Dursun, T. Kraska, U. Çetintemel, and S. B. Zdonik. Tupleware: Redefining modern analytics. CoRR, 2014.Google ScholarGoogle Scholar
  23. J. Dean. Personal Communication, February 2015.Google ScholarGoogle Scholar
  24. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1):107-113, Jan. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Erickson, M. Kornacker, and D. Kumar. New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead. http://goo.gl/evDBfy, 2014.Google ScholarGoogle Scholar
  26. B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Load Balancing in MapReduce Based on Scalable Cardinality Estimates. In Proc. ICDE, pages 522-533, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou, S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE. In Proc. OSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Jeyakumar, M. Alizadeh, D. Mazieres, B. Prabhakar, C. Kim, and A. Greenberg. EyeQ: Practical Network Performance Isolation at the Edge. In Proc. NSDI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. SkewTune: Mitigating Skew in MapReduce Applications. In Proc. SIGMOD, pages 25-36, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proc. SoCC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. O'Dell. How-to: Select the Right Hardware for Your New Hadoop Cluster. http://goo.gl/INds4t, August 2013.Google ScholarGoogle Scholar
  32. Oracle. The Java HotSpot Performance Engine Architecture. http://www.oracle.com/technetwork/java/whitepaper-135217.html.Google ScholarGoogle Scholar
  33. K. Ousterhout. Display filesystem read statistics with each task. https://issues.apache.org/jira/browse/SPARK-1683.Google ScholarGoogle Scholar
  34. K. Ousterhout. Shuffle read bytes are reported incorrectly for stages with multiple shuffle dependencies. https://issues.apache.org/jira/browse/SPARK-2571.Google ScholarGoogle Scholar
  35. K. Ousterhout. Shuffle write time does not include time to open shuffle files. https://issues. apache.org/jira/browse/SPARK-3570.Google ScholarGoogle Scholar
  36. K. Ousterhout. Shuffle write time is incorrect for sort-based shuffle. https://issues.apache. org/jira/browse/SPARK-5762.Google ScholarGoogle Scholar
  37. K. Ousterhout. Spark big data benchmark and TPC-DS workload traces. http://eecs.berkeley.edu/~keo/traces.Google ScholarGoogle Scholar
  38. K. Ousterhout. Time to cleanup spilled shuffle files not included in shuffle write time. https://issues.apache.org/jira/browse/SPARK-5845.Google ScholarGoogle Scholar
  39. K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The Case for Tiny Tasks in Compute Clusters. In Proc. HotOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-scale Data Analysis. In Proc. SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. L. Popa, G. Kumar, M. Chowdhury, A. Krishnamurthy, S. Ratnasamy, and I. Stoica. FairCloud: Sharing The Network in Cloud Computing. In Proc. SIGCOMM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. P. Prakash, A. Dixit, Y. C. Hu, and R. Kompella. The TCP Outcast Problem: Exposing Unfairness in Data Center Networks. In Proc. NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. A. Rasmussen, V. T. Lam, M. Conley, G. Porter, R. Kapoor, and A. Vahdat. Themis: An I/O-efficient MapReduce. In Proc. SoCC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. K. Sakellis. Track local bytes read for shuffles - update UI. https://issues.apache.org/jira/browse/SPARK-5645.Google ScholarGoogle Scholar
  45. Transaction Processing Performance Council (TPC). TPC Benchmark DS Standard Specification. http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf, 2012.Google ScholarGoogle Scholar
  46. UC Berkeley AmpLab. Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/, February 2014.Google ScholarGoogle Scholar
  47. A. Wang and C. McCabe. In-memory Caching in HDFS: Lower Latency, Same Great Taste. In Presented at Hadoop Summit, 2014.Google ScholarGoogle Scholar
  48. D. Xie, N. Ding, Y. C. Hu, and R. Kompella. The Only Constant is Change: Incorporating Time-Varying Network Reservations in Data Centers. In Proc. SIGCOMM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. N. J. Yadwadkar, G. Ananthanarayanan, and R. Katz. Wrangler: Predictable and Faster Jobs Using Fewer Resources. In Proc. SoCC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. L. Yi, K. Wei, S. Huang, and J. Dai. Hadoop Benchmark Suite (HiBench). https://github.com/intel-hadoop/HiBench, 2012.Google ScholarGoogle Scholar
  51. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proc. NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In Proc. OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li, W. Lin, J. Zhou, and L. Zhou. Optimizing Data Shuffling in Data-Parallel Computation by Understanding User-Defined Functions. In Proc. NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Making sense of performance in data analytics frameworks
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image Guide Proceedings
        NSDI'15: Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation
        May 2015
        620 pages
        ISBN:9781931971218

        Publisher

        USENIX Association

        United States

        Publication History

        • Published: 4 May 2015

        Qualifiers

        • Article