Abstract
In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy.
In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.
- DynamoRIO dynamic instrumentation tool platform, Feb. 2009. \bibtt http://dynamorio.org/.Google Scholar
- E. Berger, K. McKinley, R. Blumofe, and P. Wilson. Hoard: A scalable memory allocator for multithreaded applications. ACM SIGPLAN Notices, 35(11):117--128, 2000. Google Scholar
Digital Library
- P. W. Bolosky, W. J. Bolosky, and M. L. Scott. False sharing and its effect on shared memory. In Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), pages 57--71, 1993. Google Scholar
Digital Library
- D. Bruening. Efficient, Transparent, and Comprehensive Runtime Code Manipulation. PhD thesis, M.I.T., Sept. 2004. Google Scholar
Digital Library
- M. Burrows, S. N. Freund, and J. L. Wiener. Run-time type checking for binary programs. In Proceedings of the 12th International Conference on Compiler Construction (CC '03), pages 90--105, 2003. Google Scholar
Digital Library
- J. M. Calandrino and J. H. Anderson. On the design and implementation of a cache-aware multicore real-time scheduler. Real-Time Systems, Euromicro Conference on, 0:194--204, 2009. Google Scholar
Digital Library
- J. Carter, J. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In Proceedings of the thirteenth ACM symposium on Operating systems principles, page 164. ACM, 1991. Google Scholar
Digital Library
- W. Cheng, Q. Zhao, B. Yu, and S. Hiroshige. Tainttrace: Efficient flow tracing with dynamic binary rewriting. In Proceedings of the 11th IEEE Symposium on Computers and Communications (ISCC '06), pages 749--754, 2006. Google Scholar
Digital Library
- M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, and P. Stenstrom. The detection and elimination of useless misses in multiprocessors. ACM SIGARCH Computer Architecture News, 21(2):88--97, 1993. Google Scholar
Digital Library
- A. Fedorova. Operating system scheduling for chip multithreaded processors. PhD thesis, Harvard University, Cambridge, MA, USA, 2006. Google Scholar
Digital Library
- V. W. Freeh. Dynamically controlling false sharing in distributed shared memory. International Symposium on High-Performance Distributed Computing, 0:403, 1996. Google Scholar
Digital Library
- S. Gunther and J. Weidendorfer. Assessing cache false sharing effects by dynamic binary instrumentation. In Proceedings of the Workshop on Binary Instrumentation and Applications, pages 26--33. ACM, 2009. Google Scholar
Digital Library
- J. J. Harrow. Runtime checking of multithreaded applications with visual threads. In Proceedings of 7th International SPIN Workshop on SPIN Model Checking and Software Verification, pages 331--342, 2000. Google Scholar
Digital Library
- Intel-Corporation. Intel Performance Tuning Utility 3.2. User Guide, Chapter 7.4.6.5, 2008.Google Scholar
- A. Jaleel, R. S. Cohn, C.-K. Luk, and B. Jacob. CMP$im: A Pin-based on-the-fly multi-core cache simulator. In Proceedings of The Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), pages 28--36, Beijing, China, Jun 2008.Google Scholar
- T. Jeremiassen and S. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. ACM SIGPLAN Notices, 30(8):179--188, 1995. Google Scholar
Digital Library
- Y. Ju and H. Dietz. Reduction of cache coherence overhead by compiler data layout and loop transformation. Languages and Compilers for Parallel Computing, pages 344--358, 1992. Google Scholar
Digital Library
- V. Khera, P. R. LaRowe, Jr., and S. C. Ellis. An architecture-independent analysis of false sharing. Technical Report DUKE-TR-1993-13, Duke University, Durham, NC, USA, 1993. Google Scholar
Digital Library
- S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic logging of operating system effects to guide application-level architecture simulation. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'06/Performance'06), pages 216--227, 2006. Google Scholar
Digital Library
- N. Nethercote and A. Mycroft. Redux: A dynamic dataflow tracer. In Electronic Notes in Theoretical Computer Science, volume 89, 2003.Google Scholar
- N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07), pages 89--100, June 2007. Google Scholar
Digital Library
- J. Newsome. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In Proceedings of the Network and Distributed System Security Symposium (NDSS 2005), 2005.Google Scholar
- OpenWorks LLP. Helgrind: A data race detector, 2007. http://valgrind.org/docs/manual/hg-manual.html/.Google Scholar
- J. Peir and R. Cytron. Minimum distance: A method for partitioning recurrences for multiprocessors. IEEE Transactions on Computers, 38(8):1203--1211, 1989. Google Scholar
Digital Library
- F. Qin, C. Wang, Z. Li, H.-s. Kim, Y. Zhou, and Y. Wu. Lift: A low-overhead practical information flow tracking system for detecting security attacks. In Proceedings of the 39th International Symposium on Microarchitecture (MICRO 39), pages 135--148, 2006. Google Scholar
Digital Library
- M. Rajagopalan, B. Lewis, and T. Anderson. Thread scheduling for multi-core platforms. In Proceedings of the 11th USENIX workshop on Hot topics in operating systems, pages 1--6. USENIX Association, 2007. Google Scholar
Digital Library
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, 2007. Google Scholar
Digital Library
- Rational Software. Purify: Fast detection of memory leaks and access errors, 2000. http://www.rationalsoftware.com/products/whitepapers/319.jsp.Google Scholar
- M. Ronsse, B. Stougie, J. Maebe, F. Cornelis, and K. D. Bosschere. An efficient data race detector backend for DIOTA. In Parallel Computing: Software Technology, Algorithms, Architectures & Applications, volume 13, pages 39--46. Elsevier, 2 2004.Google Scholar
- S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: a dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst., 15(4):391--411, 1997. Google Scholar
Digital Library
- J. Seward and N. Nethercote. Using Valgrind to detect undefined value errors with bit-precision. In Proceedings of the USENIX Annual Technical Conference, pages 2--2, 2005. Google Scholar
Digital Library
- S. Sridharan, B. Keck, R. Murphy, S. Chandra, and P. Kogge. Thread migration to improve synchronization performance. In Workshop on Operating System Interference in High Performance Applications, 2006.Google Scholar
- D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on smp-cmp-smt multiprocessors. In EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 47--58, New York, NY, USA, 2007. ACM. Google Scholar
Digital Library
- J. Tao and W. Karl. CacheIn: A Toolset for Comprehensive Cache Inspection. Computational Science-ICCS 2005, pages 174--181, 2005. Google Scholar
Digital Library
- J. Weidendorfer, M. Ott, T. Klug, and C. Trinitis. Latencies of conflicting writes on contemporary multicore architectures. Parallel Computing Technologies, pages 318--327, 2007. Google Scholar
Digital Library
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture (ISCA '95), pages 24--36, 1995. Google Scholar
Digital Library
- Q. Zhao, D. Bruening, and S. Amarasinghe. Efficient memory shadowing for 64-bit architectures. In Proceedings of The International Symposium on Memory Management (ISMM '10), Toronto, Canada, Jun 2010. Google Scholar
Digital Library
- Q. Zhao, D. Bruening, and S. Amarasinghe. Umbra: Efficient and scalable memory shadowing. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '10), Apr. 2010. Google Scholar
Digital Library
- Q. Zhao, R. Rabbah, S. Amarasinghe, L. Rudolph, and W.-F. Wong. Ubiquitous memory introspection. In International Symposium on Code Generation and Optimization, San Jose, CA, Mar 2007. Google Scholar
Digital Library
- Q. Zhao, R. M. Rabbah, S. P. Amarasinghe, L. Rudolph, and W.-F. Wong. How to do a million watchpoints: Efficient debugging using dynamic instrumentation. In Proceedings of the 17th International Conference on Compiler Construction (CC '08), pages 147--162, 2008. Google Scholar
Digital Library
Index Terms
Dynamic cache contention detection in multi-threaded applications
Recommendations
Dynamic cache contention detection in multi-threaded applications
VEE '11: Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environmentsIn today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such ...
Effective cache prefetching on bus-based multiprocessors
Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a shared-memory multiprocessor. Prefetching ...
Reducing Contention in Shared Last-Level Cache for Throughput Processors
Deploying the Shared Last-Level Cache (SLLC) is an effective way to alleviate the memory bottleneck in modern throughput processors, such as GPGPUs. A commonly used scheduling policy of throughput processors is to render the maximum possible thread-...







Comments