Abstract
False sharing is an insidious problem for multithreaded programs running on multicore processors, where it can silently degrade performance and scalability. Previous tools for detecting false sharing are severely limited: they cannot distinguish false sharing from true sharing, have high false positive rates, and provide limited assistance to help programmers locate and resolve false sharing.
This paper presents two tools that attack the problem of false sharing: Sheriff-Detect and Sheriff-Protect. Both tools leverage a framework we introduce here called Sheriff. Sheriff breaks out threads into separate processes, and exposes an API that allows programs to perform per-thread memory isolation and tracking on a per-page basis. We believe Sheriff is of independent interest.
Sheriff-Detect finds instances of false sharing by comparing updates within the same cache lines by different threads, and uses sampling to rank them by performance impact. Sheriff-Detect is precise (no false positives), runs with low overhead (on average, 20%), and is accurate, pinpointing the exact objects involved in false sharing. We present a case study demonstrating Sheriff-Detect's effectiveness at locating false sharing in a variety of benchmarks.
Rewriting a program to fix false sharing can be infeasible when source is unavailable, or undesirable when padding objects would unacceptably increase memory consumption or further worsen runtime performance. Sheriff-Protect mitigates false sharing by adaptively isolating shared updates from different threads into separate physical addresses, effectively eliminating most of the performance impact of false sharing. We show that Sheriff-Protect can improve performance for programs with catastrophic false sharing by up to 9×, without programmer intervention.
- E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX), pages 117--128, Cambridge, MA, Nov. 2000. Google Scholar
Digital Library
- E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: safe multithreaded programming for C/CGoogle Scholar
- . In OOPSLA '09: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, pages 81--96, New York, NY, USA, 2009. ACM.Google Scholar
Digital Library
- E. D. Berger, B. G. Zorn, and K. S. McKinley. Composing high-performance memory allocators. In Proceedings of the 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Snowbird, Utah, June 2001. Google Scholar
Digital Library
- C. Bienia and K. Li. PARSEC 2.0: A new benchmark suite for chip-multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June 2009.Google Scholar
- W. J. Bolosky and M. L. Scott. False sharing and its effect on shared memory performance. In SEDMS IV: USENIX Symposium on Experiences with Distributed and Multiprocessor Systems, pages 57--71, Berkeley, CA, USA, 1993. USENIX Association. Google Scholar
Digital Library
- J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In SOSP '91: Proceedings of the thirteenth ACM symposium on Operating systems principles, pages 152--164, New York, NY, USA, 1991. ACM. Google Scholar
Digital Library
- J.-H. Chow and V. Sarkar. False sharing elimination by selection of runtime scheduling parameters. In ICPP '97: Proceedings of the international Conference on Parallel Processing, pages 396--403, Washington, DC, USA, 1997. IEEE Computer Society. Google Scholar
Digital Library
- M. Dubois, J. C. Wang, L. A. Barroso, K. Lee, and Y.-S. Chen. Delayed consistency and its effects on the miss rate of parallel programs. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 197--206, New York, NY, USA, 1991. ACM. Google Scholar
Digital Library
- S. L. Graham, P. B. Kessler, and M. K. Mckusick. Gprof: A call graph execution profiler. SIGPLAN Not., 17(6):120--126, 1982. Google Scholar
Digital Library
- S. M. Günther and J. Weidendorfer. Assessing cache false sharing effects by dynamic binary instrumentation. In WBIA '09: Proceedings of the Workshop on Binary Instrumentation and Applications, pages 26--33, New York, NY, USA, 2009. ACM. Google Scholar
Digital Library
- R. L. Hyde and B. D. Fleisch. An analysis of degenerate sharing and false coherence. J. Parallel Distrib. Comput., 34(2):183--195, 1996. Google Scholar
Digital Library
- Intel Corporation. Intel Performance Tuning Utility 3.2 Update, November 2008.Google Scholar
- Intel Corporation. Avoiding and identifying false sharing among threads. http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/, February 2010.Google Scholar
- T. E. Jeremiassen and S. J. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. In PPOPP '95: Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 179--188, New York, NY, USA, 1995. ACM. Google Scholar
Digital Library
- P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: distributed shared memory on standard workstations and operating systems. In WTEC'94: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, pages 10--10, Berkeley, CA, USA, 1994. USENIX Association. Google Scholar
Digital Library
- J. Larus and R. Rajwar. Transactional Memory (Synthesis Lectures on Computer Architecture). Morgan & Claypool Publishers, first edition, 2007. Google Scholar
Digital Library
- J. Levon. OProfile internals. http://oprofile.sourceforge.net/doc/internals/index.html, 2003.Google Scholar
- C.-L. Liu. False sharing analysis for multithreaded programs. Master's thesis, National Chung Cheng University, July 2009.Google Scholar
- M. Olszewski and S. Amarasinghe. Outfoxing the mammoths: PLDI 2010 FIT presentation, June 2010.Google Scholar
- A. Pesterev, N. Zeldovich, and R. T. Morris. Locating cache performance bottlenecks using data profiling. In EuroSys '10: Proceedings of the 5th European conference on Computer systems, pages 335--348, New York, NY, USA, 2010. ACM. Google Scholar
Digital Library
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In HPCA '07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, Washington, DC, USA, 2007. IEEE Computer Society. Google Scholar
Digital Library
- M. Schindewolf. Analysis of cache misses using SIMICS. Master's thesis, Institute for Computing Systems Architecture, University of Edinburgh, 2007.Google Scholar
- W. R. Stevens and S. A. Rago. Advanced Programming in the UNIX® Environment: Second Edition. Addison Wesley Professional, 2005. Google Scholar
Digital Library
- W. Xiong, S. Park, J. Zhang, Y. Zhou, and Z. Ma. Ad hoc synchronization considered harmful. In OSDI'10: Proceedings of the 9th Conference on Symposium on Opearting Systems Design & Implementation, pages 163--176, Berkeley, CA, USA, 2010. USENIX Association. Google Scholar
Digital Library
- Q. Zhao, D. Koh, S. Raza, D. Bruening, W.-F. Wong, and S. Amarasinghe. Dynamic cache contention detection in multi-threaded applications. In The International Conference on Virtual Execution Environments, Newport Beach, CA, Mar 2011. Google Scholar
Digital Library
Index Terms
SHERIFF: precise detection and automatic mitigation of false sharing
Recommendations
Huron: hybrid false sharing detection and repair
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and ImplementationWriting efficient multithreaded code that can leverage the full parallelism of underlying hardware is difficult. A key impediment is insidious cache contention issues, such as false sharing. False sharing occurs when multiple threads from different ...
Featherlight on-the-fly false-sharing detection
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingShared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools ...
PREDATOR: predictive false sharing detection
PPoPP '14False sharing is a notorious problem for multithreaded applications that can drastically degrade both performance and scalability. Existing approaches can precisely identify the sources of false sharing, but only report false sharing actually observed ...







Comments