skip to main content
research-article

SHERIFF: precise detection and automatic mitigation of false sharing

Published:22 October 2011Publication History
Skip Abstract Section

Abstract

False sharing is an insidious problem for multithreaded programs running on multicore processors, where it can silently degrade performance and scalability. Previous tools for detecting false sharing are severely limited: they cannot distinguish false sharing from true sharing, have high false positive rates, and provide limited assistance to help programmers locate and resolve false sharing.

This paper presents two tools that attack the problem of false sharing: Sheriff-Detect and Sheriff-Protect. Both tools leverage a framework we introduce here called Sheriff. Sheriff breaks out threads into separate processes, and exposes an API that allows programs to perform per-thread memory isolation and tracking on a per-page basis. We believe Sheriff is of independent interest.

Sheriff-Detect finds instances of false sharing by comparing updates within the same cache lines by different threads, and uses sampling to rank them by performance impact. Sheriff-Detect is precise (no false positives), runs with low overhead (on average, 20%), and is accurate, pinpointing the exact objects involved in false sharing. We present a case study demonstrating Sheriff-Detect's effectiveness at locating false sharing in a variety of benchmarks.

Rewriting a program to fix false sharing can be infeasible when source is unavailable, or undesirable when padding objects would unacceptably increase memory consumption or further worsen runtime performance. Sheriff-Protect mitigates false sharing by adaptively isolating shared updates from different threads into separate physical addresses, effectively eliminating most of the performance impact of false sharing. We show that Sheriff-Protect can improve performance for programs with catastrophic false sharing by up to 9×, without programmer intervention.

References

  1. E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX), pages 117--128, Cambridge, MA, Nov. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: safe multithreaded programming for C/CGoogle ScholarGoogle Scholar
  3. . In OOPSLA '09: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, pages 81--96, New York, NY, USA, 2009. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. D. Berger, B. G. Zorn, and K. S. McKinley. Composing high-performance memory allocators. In Proceedings of the 2001 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Snowbird, Utah, June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Bienia and K. Li. PARSEC 2.0: A new benchmark suite for chip-multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, June 2009.Google ScholarGoogle Scholar
  6. W. J. Bolosky and M. L. Scott. False sharing and its effect on shared memory performance. In SEDMS IV: USENIX Symposium on Experiences with Distributed and Multiprocessor Systems, pages 57--71, Berkeley, CA, USA, 1993. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and performance of Munin. In SOSP '91: Proceedings of the thirteenth ACM symposium on Operating systems principles, pages 152--164, New York, NY, USA, 1991. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J.-H. Chow and V. Sarkar. False sharing elimination by selection of runtime scheduling parameters. In ICPP '97: Proceedings of the international Conference on Parallel Processing, pages 396--403, Washington, DC, USA, 1997. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Dubois, J. C. Wang, L. A. Barroso, K. Lee, and Y.-S. Chen. Delayed consistency and its effects on the miss rate of parallel programs. In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercomputing '91, pages 197--206, New York, NY, USA, 1991. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. L. Graham, P. B. Kessler, and M. K. Mckusick. Gprof: A call graph execution profiler. SIGPLAN Not., 17(6):120--126, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. M. Günther and J. Weidendorfer. Assessing cache false sharing effects by dynamic binary instrumentation. In WBIA '09: Proceedings of the Workshop on Binary Instrumentation and Applications, pages 26--33, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. L. Hyde and B. D. Fleisch. An analysis of degenerate sharing and false coherence. J. Parallel Distrib. Comput., 34(2):183--195, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Intel Corporation. Intel Performance Tuning Utility 3.2 Update, November 2008.Google ScholarGoogle Scholar
  14. Intel Corporation. Avoiding and identifying false sharing among threads. http://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads/, February 2010.Google ScholarGoogle Scholar
  15. T. E. Jeremiassen and S. J. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. In PPOPP '95: Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 179--188, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: distributed shared memory on standard workstations and operating systems. In WTEC'94: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, pages 10--10, Berkeley, CA, USA, 1994. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Larus and R. Rajwar. Transactional Memory (Synthesis Lectures on Computer Architecture). Morgan & Claypool Publishers, first edition, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Levon. OProfile internals. http://oprofile.sourceforge.net/doc/internals/index.html, 2003.Google ScholarGoogle Scholar
  19. C.-L. Liu. False sharing analysis for multithreaded programs. Master's thesis, National Chung Cheng University, July 2009.Google ScholarGoogle Scholar
  20. M. Olszewski and S. Amarasinghe. Outfoxing the mammoths: PLDI 2010 FIT presentation, June 2010.Google ScholarGoogle Scholar
  21. A. Pesterev, N. Zeldovich, and R. T. Morris. Locating cache performance bottlenecks using data profiling. In EuroSys '10: Proceedings of the 5th European conference on Computer systems, pages 335--348, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating MapReduce for multi-core and multiprocessor systems. In HPCA '07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13--24, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Schindewolf. Analysis of cache misses using SIMICS. Master's thesis, Institute for Computing Systems Architecture, University of Edinburgh, 2007.Google ScholarGoogle Scholar
  24. W. R. Stevens and S. A. Rago. Advanced Programming in the UNIX® Environment: Second Edition. Addison Wesley Professional, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. W. Xiong, S. Park, J. Zhang, Y. Zhou, and Z. Ma. Ad hoc synchronization considered harmful. In OSDI'10: Proceedings of the 9th Conference on Symposium on Opearting Systems Design & Implementation, pages 163--176, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Q. Zhao, D. Koh, S. Raza, D. Bruening, W.-F. Wong, and S. Amarasinghe. Dynamic cache contention detection in multi-threaded applications. In The International Conference on Virtual Execution Environments, Newport Beach, CA, Mar 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SHERIFF: precise detection and automatic mitigation of false sharing

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 46, Issue 10
            OOPSLA '11
            October 2011
            1063 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/2076021
            Issue’s Table of Contents
            • cover image ACM Conferences
              OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
              October 2011
              1104 pages
              ISBN:9781450309400
              DOI:10.1145/2048066

            Copyright © 2011 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 22 October 2011

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!