skip to main content

Featherlight on-the-fly false-sharing detection

Published:10 February 2018Publication History
Skip Abstract Section

Abstract

Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and at least one variable is modified. State-of-the-art tools detect false sharing via a heavyweight process of logging memory accesses and feeding the ensuing access traces to an offline cache simulator. We have developed Feather, a lightweight, on-the-fly false-sharing detection tool. Feather achieves low overhead by exploiting two hardware features ubiquitous in commodity CPUs: the performance monitoring units (PMU) and debug registers. Additionally, Feather is a first-of-its-kind tool to detect false sharing in multi-process applications that use shared memory. Feather allowed us to scale false-sharing detection to myriad codes. Feather detected several false-sharing cases in important multi-core and multi-process codes including previous PPoPP artifacts. Eliminating false sharing resulted in dramatic (up to 16x) speedups.

Skip Supplemental Material Section

Supplemental Material

References

  1. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. 2010. HPCToolkit: Tools for Performance Analysis of Optimized Parallel Programs. Concurrency Computation : Practice Expererience 22, 6 (April 2010), 685--701. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting Hardware Performance Counters with Flow and Context Sensitive Profiling. In SIGPLAN Conference on Programming Language Design and Implementation. ACM, NY, NY, USA, 85--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Matthew Arnold and Peter F. Sweeney. 1999. Approximating the Calling Context Tree via Sampling. Technical Report 21789. IBM.Google ScholarGoogle Scholar
  4. Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Boost developer community. 2012. Boost C++ Libraries. https://sourceforge.net/projects/boost/files/boost/1.49.0/. (2012).Google ScholarGoogle Scholar
  6. Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 2010. An Analysis of Linux Scalability to Many Cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 1--16. http://dl.acm.org/citation.cfm?id=1924943.1924944 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Milind Chabbi, Abdelhalim Amer, Shasha Wen, and Xu Liu. 2017. An Efficient Abortable-locking Protocol for Multi-level NUMA Systems. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 61--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Milind Chabbi and John Mellor-Crummey. 2012. DeadSpy: A Tool to Pinpoint Program Inefficiencies. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO '12). ACM, New York, NY, USA, 124--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cristian Coarfa, John Mellor-Crummey, Nathan Froyd, and Yuri Dotsenko. 2007. Scalability analysis of SPMD codes using expectations. In ICS '07: Proc. of the 21st annual International Conference on Supercomputing. ACM, NY, NY, USA, 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dave. Dice. 2011. False sharing induced by card table marking. https://blogs.oracle.com/dave/false-sharing-induced-by-card-table-marking. (2011).Google ScholarGoogle Scholar
  11. Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. https://pdfs.semanticscholar.org/5219/4b43b8385ce39b2b08ecd409c753e0efafe5.pdf. (November 2007).Google ScholarGoogle Scholar
  12. Ariel Eizenberg, Shiliang Hu, Gilles Pokam, and Joseph Devietti. 2016. Remix: Online Detection and Repair of Cache Contention for the JVM. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 251--265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Susan L. Graham, Peter B. Kessler, and Marshall K. McKusick. 1982. Gprof: A call graph execution profiler. In Proc. of the 1982 SIGPLAN Symp. on Compiler Construction. ACM Press, New York, NY, USA, 120--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Vincent Gramoli. 2015. More Than You Ever Wanted to Know About Synchronization: Synchrobench, Measuring the Impact of the Synchronization on Concurrent Algorithms. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, USA, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Stephan M. Günther and Josef Weidendorfer. 2009. Assessing Cache False Sharing Effects by Dynamic Binary Instrumentation. In Proceedings of the Workshop on Binary Instrumentation and Applications (WBIA '09). ACM, New York, NY, USA, 26--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Robert J. Hall. 1992. Call Path Profiling. In Proceedings of the 14th International Conference on Software Engineering (ICSE '92). ACM, New York, NY, USA, 296--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ravi Hegde. 2015. Optimizing Application Performance on Intel Core Microarchitecture Using Hardware-Implemented Prefetchers. (Oct 2015).Google ScholarGoogle Scholar
  18. Gerard J. Holzmann. 1997. The Model Checker SPIN. IEEE Transactions on Software Engineering --- Special issue on formal methods in software practice 23, 5 (May 1997), 279--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gerard J. Holzmann and Dragan Bosnacki. 2007. The Design of a Multicore Extension of the SPIN Model Checker. IEEE Transactions on Software Engineering 33, 10 (Oct. 2007), 659--674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Intel Corp. 2009. An Introduction to the Intel® QuickPath Interconnect. http://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html. (2009).Google ScholarGoogle Scholar
  21. Intel Corp. 2011. Avoiding and Identifying False Sharing Among Threads. https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads. (2011).Google ScholarGoogle Scholar
  22. Intel Corp. 2015. Intel X86 Encoder Decoder Software Library. https://software.intel.com/en-us/articles/xed-x86-encoder-decoder-software-library. (2015).Google ScholarGoogle Scholar
  23. Intel Corp. NA. Hardware Event-based Sampling Collection. https://software.intel.com/en-us/node/544067. (NA).Google ScholarGoogle Scholar
  24. Intel Corp. NA. Intel Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide. https://software.intel.com/sites/default/files/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf. (NA).Google ScholarGoogle Scholar
  25. Intel Corporation 2008. Intel Performance Tuning Utility 3.2 Update. Intel Corporation.Google ScholarGoogle Scholar
  26. Sanath Jayasena, Saman Amarasinghe, Asanka Abeyweera, Gayashan Amarasinghe, Himeshi De Silva, Sunimal Rathnayake, Xiaoqiao Meng, and Yanbin Liu. 2013. Detection of False Sharing Using Machine Learning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13). ACM, New York, NY, USA, Article 30, 9 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mark Scott Johnson. 1982. Some Requirements for Architectural Support of Software Debugging. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 140--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Christos Kozyrakis. 2009. Phoenix Project: Shared-memory implementation of Google's MapReduce model. https://github.com/kozyraki/phoenix/tree/master/phoenix-2.0. (2009).Google ScholarGoogle Scholar
  29. Leslie Lamport. 1977. Concurrent Reading and Writing. Commun. ACM 20, 11 (Nov. 1977), 806--811. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558--565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Chien-Lung Liu. 2009. False Sharing Analysis for Multithreaded Programs. Master's thesis. National Chung Cheng University.Google ScholarGoogle Scholar
  32. Tongping Liu and Emery D. Berger. 2011. SHERIFF: Precise Detection and Automatic Mitigation of False Sharing. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '11). ACM, New York, NY, USA, 3--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Tongping Liu and Xu Liu. 2016. Cheetah: Detecting False Sharing Efficiently and Effectively. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO '16). ACM, New York, NY, USA, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tongping Liu, Chen Tian, Ziang Hu, and Emery D. Berger. 2014. PREDATOR: Predictive False Sharing Detection. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). ACM, New York, NY, USA, 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xu Liu and Bo Wu. 2015. ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 47, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L. Luo, A. Sriraman, B. Fugate, S. Hu, G. Pokam, C. J. Newburn, and J. Devietti. 2016. LASER: Light, Accurate Sharing dEtection and Repair. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 261--273.Google ScholarGoogle Scholar
  37. Joe Mario. 2016. C2C - False Sharing Detection in Linux Perf. https://joemario.github.io/blog/2016/09/01/c2c-blog/. (2016).Google ScholarGoogle Scholar
  38. McDonald, Nicholas. 2015. desbench:A benchmark application for libdes. https://github.com/nicmcd/desbench. (2015).Google ScholarGoogle Scholar
  39. McDonald, Nicholas. 2015. libdes:A C++ discrete event simulation framework. https://github.com/nicmcd/libdes. (2015).Google ScholarGoogle Scholar
  40. McDonald, Nicholas. 2015. supersim:A flexible event-driven cycle-accurate network simulator. https://github.com/HewlettPackard/supersim. (2015).Google ScholarGoogle Scholar
  41. R. E. McLear, D. M. Scheibelhut, and E. Tammaru. 1982. Guidelines for Creating a Debuggable Processor. In Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS I). ACM, New York, NY, USA, 100--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Milind Chabbi. 2017. HMCST lock: Hierarchical MCS locks with timeout. https://github.com/HMCST/. (2017).Google ScholarGoogle Scholar
  43. Greg Nakhimovsky. 2001. Debugging and Performance Tuning with Library Interposers. http://dsc.sun.com/solaris/articles/lib_interposers.html. (Jul 2001).Google ScholarGoogle Scholar
  44. Mihir Nanavati, Mark Spear, Nathan Taylor, Shriram Rajagopalan, Dutch T. Meyer, William Aiello, and Andrew Warfield. 2013. Whose cache line is it anyway?: operating system support for live detection and repair of false sharing. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). ACM, New York, NY, USA, 141--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. 2006. MineBench: A Benchmark Suite for Data Mining Workloads. In 2006 IEEE International Symposium on Workload Characterization. 182--188.Google ScholarGoogle Scholar
  46. Northwestern University. 2006. NU-MineBench suite. http://cucis.ece.northwestern.edu/projects/DMS/MineBench.html. (2006).Google ScholarGoogle Scholar
  47. Perf developers. {n. d.}. perf_event_open - Linux man page. https://linux.die.net/man/2/perf_event_open. ({n. d.}).Google ScholarGoogle Scholar
  48. Aleksey Pesterev, Nickolai Zeldovich, and Robert T. Morris. 2010. Locating cache performance bottlenecks using data profiling. In EuroSys '10: Proceedings of the 5th European conference on Computer systems. ACM, New York, NY, USA, 335--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Princeton University. 2011. Parsec3.0. http://parsec.cs.princeton.edu/index.htm. (2011).Google ScholarGoogle Scholar
  50. Mikael Ronstrom. 2012. MySQL team increases scalability by <50% for Sysbench OLTP RO in MySQL 5.6 labs release april 2012. http://mikaelronstrom.blogspot.in/2012/04/mysql-team-increases-scalability-by-50.html. (2012).Google ScholarGoogle Scholar
  51. ML Scott and WJ Bolosky. 1993. False Sharing and Its Effect on Shared Memory Performance. In Proceedings of the USENIX Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS). 57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Spin developers. 2017. Spin Sources. http://spinroot.com/spin/Src/index.html. (2017).Google ScholarGoogle Scholar
  53. M. Srinivas, B. Sinharoy, R. J. Eickemeyer, R. Raghavan, S. Kunkel, T. Chen, W. Maron, D. Flemming, A. Blanchard, P. Seshadri, J. W. Kellington, A. Mericas, A. E. Petruski, V. R. Indukuru, and S. Reyes. 2011. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD 55, 3 (May-June 2011), 4:1--4:19.Google ScholarGoogle Scholar
  54. Stackoverflow discussion. 2012. False sharing in boost::detail::spinlock_pool? https://stackoverflow.com/questions/11037655/false-sharing-in-boostdetailspinlock-pool. (2012).Google ScholarGoogle Scholar
  55. Nathan R. Tallent, John Mellor-Crummey, and Michael W. Fagan. 2009. Binary Analysis for Measurement and Attribution of Program Performance. In Proc. of the 2009 ACM PLDI. ACM, NY, NY, USA, 441--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 1998. Simultaneous Multithreading: Maximizing On-chip Parallelism. In 25 Years of the International Symposia on Computer Architecture (Selected Papers) (ISCA '98). ACM, New York, NY, USA, 533--544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Shasha Wen, Milind Chabbi, and Xu Liu. 2017. RedSpy: Exploring Value Locality in Software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '17). ACM, New York, NY, USA, 47--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. 2018. Watching for Software Inefficiencies with Witch. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (to appear) (ASPLOS '18). ACM, New York, NY, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Besar Wicaksono, Munara Tolubaeva, and Barbara Chapman. 2013. Detecting False Sharing in OpenMP Applications Using the DARWIN Framework. Springer Berlin Heidelberg, Berlin, Heidelberg, 283--297.Google ScholarGoogle Scholar
  60. Qin Zhao, David Koh, Syed Raza, Derek Bruening, Weng-Fai Wong, and Saman Amarasinghe. 2011. Dynamic Cache Contention Detection in Multi-threaded Applications. In Proceedings of the 7th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE '11). ACM, New York, NY, USA, 27--38. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Featherlight on-the-fly false-sharing detection

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!