skip to main content
research-article

Dynamic cache contention detection in multi-threaded applications

Authors Info & Claims
Published:09 March 2011Publication History
Skip Abstract Section

Abstract

In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy.

In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.

References

  1. DynamoRIO dynamic instrumentation tool platform, Feb. 2009. \bibtt http://dynamorio.org/.Google ScholarGoogle Scholar
  2. E. Berger, K. McKinley, R. Blumofe, and P. Wilson. Hoard: A scalable memory allocator for multithreaded applications. ACM SIGPLAN Notices, 35(11):117--128, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. W. Bolosky, W. J. Bolosky, and M. L. Scott. False sharing and its effect on shared memory. In Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), pages 57--71, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Bruening. Efficient, Transparent, and Comprehensive Runtime Code Manipulation. PhD thesis, M.I.T., Sept. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Burrows, S. N. Freund, and J. L. Wiener. Run-time type checking for binary programs. In Proceedings of the 12th International Conference on Compiler Construction (CC '03), pages 90--105, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. M. Calandrino and J. H. Anderson. On the design and implementation of a cache-aware multicore real-time scheduler. Real-Time Systems, Euromicro Conference on, 0:194--204, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Carter, J. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In Proceedings of the thirteenth ACM symposium on Operating systems principles, page 164. ACM, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. Cheng, Q. Zhao, B. Yu, and S. Hiroshige. Tainttrace: Efficient flow tracing with dynamic binary rewriting. In Proceedings of the 11th IEEE Symposium on Computers and Communications (ISCC '06), pages 749--754, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenstrom. The detection and elimination of useless misses in multiprocessors. ACM SIGARCH Computer Architecture News, 21(2):88--97, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Fedorova. Operating system scheduling for chip multithreaded processors. PhD thesis, Harvard University, Cambridge, MA, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. V. W. Freeh. Dynamically controlling false sharing in distributed shared memory. International Symposium on High-Performance Distributed Computing, 0:403, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Gunther and J. Weidendorfer. Assessing cache false sharing effects by dynamic binary instrumentation. In Proceedings of the Workshop on Binary Instrumentation and Applications, pages 26--33. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. J. Harrow. Runtime checking of multithreaded applications with visual threads. In Proceedings of 7th International SPIN Workshop on SPIN Model Checking and Software Verification, pages 331--342, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Intel-Corporation. Intel Performance Tuning Utility 3.2. User Guide, Chapter 7.4.6.5, 2008.Google ScholarGoogle Scholar
  15. A. Jaleel, R. S. Cohn, C.-K. Luk, and B. Jacob. CMP$im: A Pin-based on-the-fly multi-core cache simulator. In Proceedings of The Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), pages 28--36, Beijing, China, Jun 2008.Google ScholarGoogle Scholar
  16. T. Jeremiassen and S. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. ACM SIGPLAN Notices, 30(8):179--188, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Ju and H. Dietz. Reduction of cache coherence overhead by compiler data layout and loop transformation. Languages and Compilers for Parallel Computing, pages 344--358, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Khera, P. R. LaRowe, Jr., and S. C. Ellis. An architecture-independent analysis of false sharing. Technical Report DUKE-TR-1993-13, Duke University, Durham, NC, USA, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic logging of operating system effects to guide application-level architecture simulation. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'06/Performance'06), pages 216--227, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Nethercote and A. Mycroft. Redux: A dynamic dataflow tracer. In Electronic Notes in Theoretical Computer Science, volume 89, 2003.Google ScholarGoogle Scholar
  21. N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07), pages 89--100, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Newsome. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In Proceedings of the Network and Distributed System Security Symposium (NDSS 2005), 2005.Google ScholarGoogle Scholar
  23. OpenWorks LLP. Helgrind: A data race detector, 2007. http://valgrind.org/docs/manual/hg-manual.html/.Google ScholarGoogle Scholar
  24. J. Peir and R. Cytron. Minimum distance: A method for partitioning recurrences for multiprocessors. IEEE Transactions on Computers, 38(8):1203--1211, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. Qin, C. Wang, Z. Li, H.-s. Kim, Y. Zhou, and Y. Wu. Lift: A low-overhead practical information flow tracking system for detecting security attacks. In Proceedings of the 39th International Symposium on Microarchitecture (MICRO 39), pages 135--148, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Rajagopalan, B. Lewis, and T. Anderson. Thread scheduling for multi-core platforms. In Proceedings of the 11th USENIX workshop on Hot topics in operating systems, pages 1--6. USENIX Association, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rational Software. Purify: Fast detection of memory leaks and access errors, 2000. http://www.rationalsoftware.com/products/whitepapers/319.jsp.Google ScholarGoogle Scholar
  29. M. Ronsse, B. Stougie, J. Maebe, F. Cornelis, and K. D. Bosschere. An efficient data race detector backend for DIOTA. In Parallel Computing: Software Technology, Algorithms, Architectures & Applications, volume 13, pages 39--46. Elsevier, 2 2004.Google ScholarGoogle Scholar
  30. S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15(4):391--411, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Seward and N. Nethercote. Using Valgrind to detect undefined value errors with bit-precision. In Proceedings of the USENIX Annual Technical Conference, pages 2--2, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Sridharan, B. Keck, R. Murphy, S. Chandra, and P. Kogge. Thread migration to improve synchronization performance. In Workshop on Operating System Interference in High Performance Applications, 2006.Google ScholarGoogle Scholar
  33. D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on smp-cmp-smt multiprocessors. In EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 47--58, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Tao and W. Karl. CacheIn: A Toolset for Comprehensive Cache Inspection. Computational Science-ICCS 2005, pages 174--181, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Weidendorfer, M. Ott, T. Klug, and C. Trinitis. Latencies of conflicting writes on contemporary multicore architectures. Parallel Computing Technologies, pages 318--327, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture (ISCA '95), pages 24--36, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Q. Zhao, D. Bruening, and S. Amarasinghe. Efficient memory shadowing for 64-bit architectures. In Proceedings of The International Symposium on Memory Management (ISMM '10), Toronto, Canada, Jun 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Q. Zhao, D. Bruening, and S. Amarasinghe. Umbra: Efficient and scalable memory shadowing. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '10), Apr. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In International Symposium on Code Generation and Optimization, San Jose, CA, Mar 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Q. Zhao, R. M. Rabbah, S. P. Amarasinghe, L. Rudolph, and W.-F. Wong. How to do a million watchpoints: Efficient debugging using dynamic instrumentation. In Proceedings of the 17th International Conference on Compiler Construction (CC '08), pages 147--162, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Dynamic cache contention detection in multi-threaded applications

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 46, Issue 7
      VEE '11
      July 2011
      231 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2007477
      Issue’s Table of Contents
      • cover image ACM Conferences
        VEE '11: Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
        March 2011
        250 pages
        ISBN:9781450306874
        DOI:10.1145/1952682

      Copyright © 2011 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 March 2011

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!