skip to main content
research-article
Open Access

DProf: distributed profiler with strong guarantees

Published:10 October 2019Publication History
Skip Abstract Section

Abstract

Performance analysis of a distributed system is typically achieved by collecting profiles whose underlying events are timestamped with unsynchronized clocks of multiple machines in the system. To allow comparison of timestamps taken at different machines, several timestamp synchronization algorithms have been developed. However, the inaccuracies associated with these algorithms can lead to inaccuracies in the final results of performance analysis. To address this problem, in this paper, we develop a system for constructing distributed performance profiles called DProf. At the core of DProf is a new timestamp synchronization algorithm, FreeZer, that tightly bounds the inaccuracy in a converted timestamp to a time interval. This not only allows timestamps from different machines to be compared, it also enables maintaining strong guarantees throughout the comparison which can be carefully transformed into guarantees for analysis results. To demonstrate the utility of DProf, we use it to implement dCSP and dCOZ that are accuracy bounded distributed versions of Context Sensitive Profiles and Causal Profiles developed for shared memory systems. While dCSP enables user to ascertain existence of a performance bottleneck, dCOZ estimates the expected performance benefit from eliminating that bottleneck. Experiments with three distributed applications on a cluster of heterogeneous machines validate that inferences via dCSP and dCOZ are highly accurate. Moreover, if FreeZer is replaced by two existing timestamp algorithms (linear regression & convex hull), the inferences provided by dCSP and dCOZ are severely degraded.

Skip Supplemental Material Section

Supplemental Material

a156-benavides

Presentation at OOPSLA '19

References

  1. L. Adhianto, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. Tallent. 2008. Hpctoolkit: Performance measurement and analysis for supercomputers with node-level parallelism. In Workshop on Node Level Parallelism for Large Scale Supercomputers, in conjuction with ACM/IEEE SC.Google ScholarGoogle Scholar
  2. T.E. Anderson and E.D. Lazowska. 1990. Quartz: A tool for tuning parallel program performance. In ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. 115–125.Google ScholarGoogle Scholar
  3. P. Ashton. 1995. Algorithms for off-line clock synchronization. In Technical Report TR-COSC12/95, Dept. of Computer Science, Univ. of Canterbury.Google ScholarGoogle Scholar
  4. D. Becker. 2010. Timestamp Synchronization of Concurrent Events. In Schriften des Forschungszentrums Julich, IAS Series, Volume 4.Google ScholarGoogle Scholar
  5. Z. Benavides, R. Gupta, and X. Zhang. 2017. Annotation guided collection of context-sensitive parallel execution profiles. In The 17th International Conference on Runtime Verification, LNCS 10548, Springer. 103–120.Google ScholarGoogle Scholar
  6. D. Böhme, F. Wolf, B.R. de Supinski, M. Schulz, and M. Geimer. 2012. Scalable critical-path based performance analysis. In IEEE 26th International Parallel & Distributed Processing Symposium. 1330–1340.Google ScholarGoogle Scholar
  7. K. Du Bois, J.B. Sartor, S. Eyerman, and L. Eeckhout. 2013. Bottle graphs: Visualizing scalability bottlenecks in multithreaded applications. In ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications. 355–372.Google ScholarGoogle Scholar
  8. C. Curtsinger and E.D. Berger. 2015. Coz: finding code that counts with causal profiling. In The 25th Symposium on Operating Systems Principles. 184–197.Google ScholarGoogle Scholar
  9. R. Ding, H. Zhou, J.-G. Lou, H. Zhang, Q. Lin, Q. Fu, D. Zhang, and T. Xie. 2015. Log2: A cost-aware logging mechanism for performance diagnosis. In USENIX Annual Technical Conference. 139–150.Google ScholarGoogle Scholar
  10. A. Duda, G. Harrus, Y. Haddad, and G. Bernard. 1987. Estimating global time in distributed systems. In ICDCS, Volume 87. 299–306.Google ScholarGoogle Scholar
  11. T.H. Dunigan. 1992. Hypercube clock synchronization. In Concurrency and Computation: Practice and Experience, 4(3):257-268.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Geimer, F. Wolf, B.J. Wylie, E. Ábrahám, D. Becker, and B. Mohr. 2010. The scalasca performance toolset architecture. In Concurrency and Computation: Practice and Experience, 22(6):702-719.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Geng, S. Liu, Z. Yin, A. Naik, B. Prabhakar, M. Rosenblum, and A. Vahdat. 2018. Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization. In 15th USENIX Symposium on Networked Systems Design and Implementation. 81–94.Google ScholarGoogle Scholar
  14. R. Hofman and U. Hilgers. 1998. Theory and tool for estimating global time in parallel and distributed systems. In Sixth Euromicro Workshop on Parallel and Distributed Processing. 173–179.Google ScholarGoogle Scholar
  15. J.K. Hollingsworth. 1996. An online computation of critical path profiling. In SIGMETRICS Symposium on Parallel and Distributed Tools. 11–20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J.K. Hollingsworth and B.P. Miller. 1994. Slack: a new performance metric for parallel programs. In Univ. of Maryland and Univ. of Wisconsin-Madison, Tech. Rep.Google ScholarGoogle Scholar
  17. J. Leskovec, K.J. Lang, A. Dasgupta, and M.W. Mahoney. 2008. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. In CoRR, abs/0810.1355.Google ScholarGoogle Scholar
  18. X. Liu and J. Mellor-Crummey. 2011. Pinpointing data locality problems using data-centric analysis. In IEEE/ACM 9th Annual International Symposium on Code Generation and Optimization. 171–180.Google ScholarGoogle Scholar
  19. E. Maillet and C. Tron. 1995. On efficiently implementing global time for performance evaluation on multiprocessor systems. In Journal of Parallel and Distributed Computing, 28(1):84–93.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Mariappan and K. Vora. 2019. GraphBolt: Dependency-Driven Synchronous Processing of Streaming Graphs. In European Conference on Computer Systems (EuroSys). ACM, Article 25, 16 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B.P. Miller, M. Clark, J. Hollingsworth, S. Kierstead, S.S. Lim, and T. Torzewski. 1990. Ips-2: the second generation of a parallel program measurement system. In IEEE Transactions on Parallel and Distributed Systems, 1(2):206–217.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W.E. Nagel, A. Arnold, M. Weber, H.-C. Hoppe, and K. Solchenbach. 1996. Vampir: Visualization and analysis of mpi resources. https://tu-dresden.de/zih/forschung/projekte/vampir?set_language=enGoogle ScholarGoogle Scholar
  23. L. Page, S. Brin, and R. Motwani. 1999. The PageRank citation ranking: Bringing order to the web. In Stanford InfoLab.Google ScholarGoogle Scholar
  24. B. Poirier, R. Roy, and M. Dagenais. 2010. Accurate offline synchronization of distributed traces using kernel-level events. In ACM SIGOPS Operating Systems Review, 44(3):75–87.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Rabenseifner. 1997. The controlled logical clock - a global time for trace based software monitoring of parallel applications in workstation clusters. In The 5th EUROMICRO Workshop on Parallel and Distributed Processing. 477–484.Google ScholarGoogle Scholar
  26. S.S. Shende and A.D. Malony. 2006. The tau parallel performance system. In International Journal of High Performance Computing Applications, 20(2):287–311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Stisen, H. Blunck, S. Bhattacharya, T.S. Prentow, M.B. Kjærgaard, A. Dey, T. Sonne, and M.M. Jensen. 2015. Smart devices are different: Assessing and mitigating mobile sensing heterogeneities for activity recognition. In 13th ACM Conference on Embedded Networked Sensor Systems. 127–140.Google ScholarGoogle Scholar
  28. K. Vora, R. Gupta, and G. Xu. 2017a. KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 237–251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K. Vora, S.-C. Koduru, and R. Gupta. 2014. ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms Using a Relaxed Consistency Based DSM. In International Conference on Object Oriented Programming Systems, Languages and Applications (OOPSLA). 861–878.Google ScholarGoogle Scholar
  30. K. Vora, C. Tian, R. Gupta, and Z. Hu. 2017b. CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, New York, NY, USA, 223–236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C.-Q. Yang and B.P. Miller. 1988. Critical path analysis for the execution of parallel and distributed programs. In 8th International Conference on Distributed Computing Systems. 366–373.Google ScholarGoogle Scholar
  32. A. Yoga and S. Nagarakatte. 2017. Fast Causal Profiler for Task Parallel Programs. In 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering.Google ScholarGoogle Scholar
  33. M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, and I. Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10-10 (2010), 95.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DProf: distributed profiler with strong guarantees

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!