Abstract
Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools detect false sharing via a heavyweight process of logging memory accesses and feeding the ensuing access traces to an offline cache simulator. We have developed Feather, a lightweight, on-the-fly false-sharing detection tool. Feather achieves low overhead by exploiting two hardware features ubiquitous in commodity CPUs: the performance monitoring units (PMU) and debug registers. Additionally, Feather is a first-of-its-kind tool to detect false sharing in multi-process applications that use shared memory. Feather allowed us to scale false-sharing detection to myriad codes. Feather detected several false-sharing cases in important multi-core and multi-process codes including previous PPoPP artifacts. Eliminating false sharing resulted in dramatic (up to 16x) speedups.
Supplemental Material
Available for Download
HPCToolkit performance tools: libmonitor - a substrate for monitoring tools
A featherlight on-the-fly false-sharing detection tool
HPCToolkit performance tools: measurement and analysis components
HPCToolkit performance tools: essential third party libraries for hpctoolkit
- L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. 2010. HPCToolkit: Tools for Performance Analysis of Optimized Parallel Programs. Concurrency Computation : Practice Expererience 22, 6 (April 2010), 685--701. Google Scholar
Digital Library
- Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting Hardware Performance Counters with Flow and Context Sensitive Profiling. In SIGPLAN Conference on Programming Language Design and Implementation. ACM, NY, NY, USA, 85--96. Google Scholar
Digital Library
- Matthew Arnold and Peter F. Sweeney. 1999. Approximating the Calling Context Tree via Sampling. Technical Report 21789. IBM.Google Scholar
- Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University. Google Scholar
Digital Library
- Boost developer community. 2012. Boost C++ Libraries. https://sourceforge.net/projects/boost/files/boost/1.49.0/. (2012).Google Scholar
- Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2010. An Analysis of Linux Scalability to Many Cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1--16. http://dl.acm.org/citation.cfm?id=1924943.1924944 Google Scholar
Digital Library
- Milind Chabbi, Abdelhalim Amer, Shasha Wen, and Xu Liu. 2017. An Efficient Abortable-locking Protocol for Multi-level NUMA Systems. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 61--74. Google Scholar
Digital Library
- Milind Chabbi and John Mellor-Crummey. 2012. DeadSpy: A Tool to Pinpoint Program Inefficiencies. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO '12). ACM, New York, NY, USA, 124--134. Google Scholar
Digital Library
- Cristian Coarfa, John Mellor-Crummey, Nathan Froyd, and Yuri Dotsenko. 2007. Scalability analysis of SPMD codes using expectations. In ICS '07: Proc. of the 21st annual International Conference on Supercomputing. ACM, NY, NY, USA, 13--22. Google Scholar
Digital Library
- Dave. Dice. 2011. False sharing induced by card table marking. https://blogs.oracle.com/dave/false-sharing-induced-by-card-table-marking. (2011).Google Scholar
- Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. https://pdfs.semanticscholar.org/5219/4b43b8385ce39b2b08ecd409c753e0efafe5.pdf. (November 2007).Google Scholar
- Ariel Eizenberg, Shiliang Hu, Gilles Pokam, and Joseph Devietti. 2016. Remix: Online Detection and Repair of Cache Contention for the JVM. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 251--265. Google Scholar
Digital Library
- Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. 1982. Gprof: A call graph execution profiler. In Proc. of the 1982 SIGPLAN Symp. on Compiler Construction. ACM Press, New York, NY, USA, 120--126. Google Scholar
Digital Library
- Vincent Gramoli. 2015. More Than You Ever Wanted to Know About Synchronization: Synchrobench, Measuring the Impact of the Synchronization on Concurrent Algorithms. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, USA, 1--10. Google Scholar
Digital Library
- Stephan M. Günther and Josef Weidendorfer. 2009. Assessing Cache False Sharing Effects by Dynamic Binary Instrumentation. In Proceedings of the Workshop on Binary Instrumentation and Applications (WBIA '09). ACM, New York, NY, USA, 26--33. Google Scholar
Digital Library
- Robert J. Hall. 1992. Call Path Profiling. In Proceedings of the 14th International Conference on Software Engineering (ICSE '92). ACM, New York, NY, USA, 296--306. Google Scholar
Digital Library
- Ravi Hegde. 2015. Optimizing Application Performance on Intel Core Microarchitecture Using Hardware-Implemented Prefetchers. (Oct 2015).Google Scholar
- Gerard J. Holzmann. 1997. The Model Checker SPIN. IEEE Transactions on Software Engineering --- Special issue on formal methods in software practice 23, 5 (May 1997), 279--295. Google Scholar
Digital Library
- Gerard J. Holzmann and Dragan Bosnacki. 2007. The Design of a Multicore Extension of the SPIN Model Checker. IEEE Transactions on Software Engineering 33, 10 (Oct. 2007), 659--674. Google Scholar
Digital Library
- Intel Corp. 2009. An Introduction to the Intel® QuickPath Interconnect. http://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html. (2009).Google Scholar
- Intel Corp. 2011. Avoiding and Identifying False Sharing Among Threads. https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads. (2011).Google Scholar
- Intel Corp. 2015. Intel X86 Encoder Decoder Software Library. https://software.intel.com/en-us/articles/xed-x86-encoder-decoder-software-library. (2015).Google Scholar
- Intel Corp. NA. Hardware Event-based Sampling Collection. https://software.intel.com/en-us/node/544067. (NA).Google Scholar
- Intel Corp. NA. Intel Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide. https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf. (NA).Google Scholar
- Intel Corporation 2008. Intel Performance Tuning Utility 3.2 Update. Intel Corporation.Google Scholar
- Sanath Jayasena, Saman Amarasinghe, Asanka Abeyweera, Gayashan Amarasinghe, Himeshi De Silva, Sunimal Rathnayake, Xiaoqiao Meng, and Yanbin Liu. 2013. Detection of False Sharing Using Machine Learning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 30, 9 pages. Google Scholar
Digital Library
- Mark Scott Johnson. 1982. Some Requirements for Architectural Support of Software Debugging. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 140--148. Google Scholar
Digital Library
- Christos Kozyrakis. 2009. Phoenix Project: Shared-memory implementation of Google's MapReduce model. https://github.com/kozyraki/phoenix/tree/master/phoenix-2.0. (2009).Google Scholar
- Leslie Lamport. 1977. Concurrent Reading and Writing. Commun. ACM 20, 11 (Nov. 1977), 806--811. Google Scholar
Digital Library
- Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558--565. Google Scholar
Digital Library
- Chien-Lung Liu. 2009. False Sharing Analysis for Multithreaded Programs. Master's thesis. National Chung Cheng University.Google Scholar
- Tongping Liu and Emery D. Berger. 2011. SHERIFF: Precise Detection and Automatic Mitigation of False Sharing. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '11). ACM, New York, NY, USA, 3--18. Google Scholar
Digital Library
- Tongping Liu and Xu Liu. 2016. Cheetah: Detecting False Sharing Efficiently and Effectively. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO '16). ACM, New York, NY, USA, 1--11. Google Scholar
Digital Library
- Tongping Liu, Chen Tian, Ziang Hu, and Emery D. Berger. 2014. PREDATOR: Predictive False Sharing Detection. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). ACM, New York, NY, USA, 3--14. Google Scholar
Digital Library
- Xu Liu and Bo Wu. 2015. ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 47, 12 pages. Google Scholar
Digital Library
- L. Luo, A. Sriraman, B. Fugate, S. Hu, G. Pokam, C. J. Newburn, and J. Devietti. 2016. LASER: Light, Accurate Sharing dEtection and Repair. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 261--273.Google Scholar
- Joe Mario. 2016. C2C - False Sharing Detection in Linux Perf. https://joemario.github.io/blog/2016/09/01/c2c-blog/. (2016).Google Scholar
- McDonald, Nicholas. 2015. desbench:A benchmark application for libdes. https://github.com/nicmcd/desbench. (2015).Google Scholar
- McDonald, Nicholas. 2015. libdes:A C++ discrete event simulation framework. https://github.com/nicmcd/libdes. (2015).Google Scholar
- McDonald, Nicholas. 2015. supersim:A flexible event-driven cycle-accurate network simulator. https://github.com/HewlettPackard/supersim. (2015).Google Scholar
- R. E. McLear, D. M. Scheibelhut, and E. Tammaru. 1982. Guidelines for Creating a Debuggable Processor. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 100--106. Google Scholar
Digital Library
- Milind Chabbi. 2017. HMCST lock: Hierarchical MCS locks with timeout. https://github.com/HMCST/. (2017).Google Scholar
- Greg Nakhimovsky. 2001. Debugging and Performance Tuning with Library Interposers. http://dsc.sun.com/solaris/articles/lib_interposers.html. (Jul 2001).Google Scholar
- Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T. Meyer, William Aiello, and Andrew Warfield. 2013. Whose cache line is it anyway?: operating system support for live detection and repair of false sharing. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). ACM, New York, NY, USA, 141--154. Google Scholar
Digital Library
- R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. 2006. MineBench: A Benchmark Suite for Data Mining Workloads. In 2006 IEEE International Symposium on Workload Characterization. 182--188.Google Scholar
- Northwestern University. 2006. NU-MineBench suite. http://cucis.ece.northwestern.edu/projects/DMS/MineBench.html. (2006).Google Scholar
- Perf developers. {n. d.}. perf_event_open - Linux man page. https://linux.die.net/man/2/perf_event_open. ({n. d.}).Google Scholar
- Aleksey Pesterev, Nickolai Zeldovich, and Robert T. Morris. 2010. Locating cache performance bottlenecks using data profiling. In EuroSys '10: Proceedings of the 5th European conference on Computer systems. ACM, New York, NY, USA, 335--348. Google Scholar
Digital Library
- Princeton University. 2011. Parsec3.0. http://parsec.cs.princeton.edu/index.htm. (2011).Google Scholar
- Mikael Ronstrom. 2012. MySQL team increases scalability by <50% for Sysbench OLTP RO in MySQL 5.6 labs release april 2012. http://mikaelronstrom.blogspot.in/2012/04/mysql-team-increases-scalability-by-50.html. (2012).Google Scholar
- ML Scott and WJ Bolosky. 1993. False Sharing and Its Effect on Shared Memory Performance. In Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS). 57. Google Scholar
Digital Library
- Spin developers. 2017. Spin Sources. http://spinroot.com/spin/Src/index.html. (2017).Google Scholar
- M. Srinivas, B. Sinharoy, R. J. Eickemeyer, R. Raghavan, S. Kunkel, T. Chen, W. Maron, D. Flemming, A. Blanchard, P. Seshadri, J. W. Kellington, A. Mericas, A. E. Petruski, V. R. Indukuru, and S. Reyes. 2011. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD 55, 3 (May-June 2011), 4:1--4:19.Google Scholar
- Stackoverflow discussion. 2012. False sharing in boost::detail::spinlock_pool? https://stackoverflow.com/questions/11037655/false-sharing-in-boostdetailspinlock-pool. (2012).Google Scholar
- Nathan R. Tallent, John Mellor-Crummey, and Michael W. Fagan. 2009. Binary Analysis for Measurement and Attribution of Program Performance. In Proc. of the 2009 ACM PLDI. ACM, NY, NY, USA, 441--452. Google Scholar
Digital Library
- Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 1998. Simultaneous Multithreading: Maximizing On-chip Parallelism. In 25 Years of the International Symposia on Computer Architecture (Selected Papers) (ISCA '98). ACM, New York, NY, USA, 533--544. Google Scholar
Digital Library
- Shasha Wen, Milind Chabbi, and Xu Liu. 2017. RedSpy: Exploring Value Locality in Software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 47--61. Google Scholar
Digital Library
- Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. 2018. Watching for Software Inefficiencies with Witch. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (to appear) (ASPLOS '18). ACM, New York, NY, USA.Google Scholar
Digital Library
- Besar Wicaksono, Munara Tolubaeva, and Barbara Chapman. 2013. Detecting False Sharing in OpenMP Applications Using the DARWIN Framework. Springer Berlin Heidelberg, Berlin, Heidelberg, 283--297.Google Scholar
- Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. 2011. Dynamic Cache Contention Detection in Multi-threaded Applications. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '11). ACM, New York, NY, USA, 27--38. Google Scholar
Digital Library
Index Terms
Featherlight on-the-fly false-sharing detection
Recommendations
Featherlight on-the-fly false-sharing detection
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingShared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools ...
False Sharing and Spatial Locality in Multiprocessor Caches
The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the ...
Effective cache prefetching on bus-based multiprocessors
Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a shared-memory multiprocessor. Prefetching ...







Comments