ABSTRACT
Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify and attribute lock contention is important for understanding where a multithreaded program needs improvement.
This paper proposes and evaluates three strategies for gaining insight into performance losses due to lock contention. First, we consider using a straightforward strategy based on call stack profiling to attribute idle time and show that it fails to yield insight into lock contention. Second, we consider an approach that builds on a strategy previously used for analyzing idleness in work-stealing computations; we show that this strategy does not yield insight into lock contention. Finally, we propose a new technique for measurement and analysis of lock contention that uses data associated with locks to blame lock holders for the idleness of spinning threads. Our approach incurs ≤ 5% overhead on a quantum chemistry application that makes extensive use of locking (65M distinct locks, a maximum of 340K live locks, and an average of 30K lock acquisitions per second per thread) and attributes lock contention to its full static and dynamic calling contexts. Our strategy, implemented in HPCToolkit, is fully distributed and should scale well to systems with large core counts.
- T. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel Distributed Systems, 1(1):6--16, 1990. Google Scholar
Digital Library
- T. E. Anderson and E. D. Lazowska. Quartz: a tool for tuning parallel program performance. SIGMETRICS Perform. Eval. Rev., 18(1):115--125, 1990. Google Scholar
Digital Library
- D. F. Bacon, R. Konuru, C. Murthy, and M. Serrano. Thin locks: featherweight synchronization for Java. In Proc. of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 258--268, New York, NY, USA, 1998. ACM. Google Scholar
Digital Library
- D. A. Bader and K. Madduri. Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors. Lecture Notes in Computer Science, 3769/2005:465--476, 2005. Google Scholar
Digital Library
- C. P. Breshears. Using Intel Thread Profiler for Win32 threads: Philosophy and theory. http://software.intel.com/en-us/articles/using-intel-thread-profiler-for-win32-threads-philosophy-and-theory http://software.intel.com/en-us/articles/using-intel-thread-profiler-for-win32-threads-philosophy-and-theory, August 2007.Google Scholar
- D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997. Google Scholar
Digital Library
- S. Cepeda. Performance analysis and Intel Parallel Amplifier. http://www.ddj.com/architect/217700473, May 27, 2009.Google Scholar
- M. Chung. Monitoring and managing Java SE 6 platform applications. http://java.sun.com/developer/technicalArticles/J2SE/monitoring, August 2006.Google Scholar
- DARPA High Productivity Computing Program. Scalable Synthetic Compact Application benchmarks. http://www.highproductivity.org/SSCABmks.htm.Google Scholar
- J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos. ProfileMe: Hardware support for instruction--level profiling on out-of-order processors. In Proc. of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, pages 292--302, Washington, DC, USA, 1997. IEEE Computer Society. Google Scholar
Digital Library
- D. Dice and N. Shavit. Understanding tradeoffs in software transactional memory. In Proc. of the International Symposium on Code Generation and Optimization, pages 21--33, Washington, DC, USA, 2007. IEEE Computer Society. Google Scholar
Digital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proc. of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 212--223, Montreal, Quebec, Canada, June 1998. Google Scholar
Digital Library
- N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In Proc. of the 19th Annual International Conference on Supercomputing, pages 81--90, New York, NY, USA, 2005. ACM Press. Google Scholar
Digital Library
- R. J. Hall. Call path profiling. In Proc. of the 14th international Conference on Software engineering, pages 296--306, New York, NY, USA, 1992. ACM Press. Google Scholar
Digital Library
- G. J. Hansen, C. A. Linthicum, and G. Brooks. Experience with a performance analyzer for multithreaded applications. In Proc. of the 1990 ACM/IEEE Conference on Supercomputing, pages 124--131, Washington, DC, USA, 1990. IEEE Computer Society. Google Scholar
Digital Library
- R. J. Harrison, G. I. Fann, T. Yanai, and G. Beylkin. Multiresolution quantum chemistry in multiwavelet bases. Lecture Notes in Computer Science, 2660/2003:103--110, 2003. Google Scholar
Digital Library
- IBM. IBM lock analyzer for Java. http://www.alphaworks.ibm.com/tech/jla.Google Scholar
- J. Larus and C. Kozyrakis. Transactional memory. Commun. ACM, 51(7):80--88, 2008. Google Scholar
Digital Library
- J. Mellor-Crummey and M. Scott. Algorithms for scalable synchronization on shared--memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, 1991. Google Scholar
Digital Library
- S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An unbalanced tree search benchmark. Lecture Notes in Computer Science, 4382/2007:235--250, 2007. Google Scholar
Digital Library
- G. F. Pfister and V. A. Norton. Hot-spot contention and combining in multistage interconnection networks. IEEE Transactions on Computers, C--34(10):943--948, October 1985.Google Scholar
Cross Ref
- W. N. Scherer III and M. L. Scott. Advanced contention management for dynamic software transactional memory. In Proc. of the 24th Annual ACM Symposium on Principles of Distributed Computing, pages 240--248, New York, NY, USA, 2005. ACM. Google Scholar
Digital Library
- N. R. Tallent and J. Mellor-Crummey. Effective performance measurement and analysis of multithreaded applications. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 229--240, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- N. R. Tallent, J. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 441--452, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
Index Terms
Analyzing lock contention in multithreaded applications
Recommendations
Analyzing lock contention in multithreaded applications
PPoPP '10Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify ...
HaLock: hardware-assisted lock contention detection in multithreaded applications
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesMultithreaded programming relies on locks to ensure the consistency of shared data. Lock contention is the main reason of low parallel efficiency and poor scalability of multithreaded programs. Lock profiling is the primary approach to detect lock ...
Lock contention aware thread migrations
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingOn a cache-coherent multicore multiprocessor system, the performance of a multithreaded application with high lock contention is very sensitive to the distribution of application threads across multiple processors. This is because the distribution of ...







Comments