skip to main content
research-article
Public Access

What Scalable Programs Need from Transactional Memory

Published:04 April 2017Publication History
Skip Abstract Section

Abstract

Transactional memory (TM) has been the focus of numerous studies, and it is supported in processors such as the IBM Blue Gene/Q and Intel Haswell. Many studies have used the STAMP benchmark suite to evaluate their designs. However, the speedups obtained for the STAMP benchmarks on all TM systems we know of are quite limited; for example, with 64 threads on the IBM Blue Gene/Q, we observe a median speedup of 1.4X using the Blue Gene/Q hardware transactional memory (HTM), and a median speedup of 4.1X using a software transactional memory (STM).

What limits the performance of these benchmarks on TMs? In this paper, we argue that the problem lies with the programming model and data structures used to write them. To make this point, we articulate two principles that we believe must be embodied in any scalable program and argue that STAMP programs violate both of them. By modifying the STAMP programs to satisfy both principles, we produce a new set of programs that we call the Stampede suite. Its median speedup on the Blue Gene/Q is 8.0X when using an STM. The two principles also permit us to simplify the TM design. Using this new STM with the Stampede benchmarks, we obtain a median speedup of 17.7X with 64 threads on the Blue Gene/Q and 13.2X with 32 threads on an Intel Westmere system.

These results suggest that HTM and STM designs will benefit if more attention is paid to the division of labor between application programs, systems software, and hardware.

References

  1. M. Abadi, A. Birrell, T. Harris, and M. Isard. Semantics of transactional memory and automatic mutual exclusion. ACM Trans. Programming Language and Systems, 33 (1): 2:1--2:50, Jan. 2011. 10.1145/1889997.1889999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A.-R. Adl-Tabatabai, B. T. Lewis, V. Menon, B. R. Murphy, B. Saha, and T. Shpeisman. Compiler and runtime support for efficient software transactional memory. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 26--37, 2006. 10.1145/1133981.1133985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. W. Appel. Compiling with Continuations. Cambridge University Press, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Avni and N. Shavit. Maintaining consistent transactional states without a global clock. In Proc. Intl Colloq. Structural Information and Communication Complexity, pages 131--140, 2008. 10.1007/978--3--540--69355-0_12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. J. Best, S. Mottishaw, C. Mustard, M. Roth, A. Fedorova, and A. Brownsword. Synchronization via scheduling: Techniques for efficiently managing shared state. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 640--652, 2011. 10.1145/1993498.1993573.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. Intl Conf. Parallel Architectures and Compilation Techniques, PACT, pages 72--81, 2008. 10.1145/1454115.1454128.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. SIGPLAN Notices, 30 (8): 207--216, 1995. 10.1145/209937.209958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Blundell, J. Devietti, E. C. Lewis, and M. M. K. Martin. Making the fast case common and the uncommon case simple in unbounded transactional memory. In Proc. Intl Symp. Computer Architecture, ISCA, pages 24--34, 2007. 10.1145/1250662.1250667. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Bobba, N. Goyal, M. D. Hill, M. M. Swift, and D. A. Wood. Token™: Efficient execution of large transactions with hardware transactional memory. In Proc. Intl Symp. Computer Architecture, ISCA, pages 127--138, 2008. 10.1109/ISCA.2008.24.Google ScholarGoogle Scholar
  10. C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford transactional applications for multi-processing. In Proc. IEEE Intl Symp. Workload Characterization, IISWC, Sept. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  11. B. D. Carlstrom, A. McDonald, H. Chafi, J. Chung, C. C. Minh, C. Kozyrakis, and K. Olukotun. The Atomos transactional programming language. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 1--13, 2006. 10.1145/1133981.1133983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl., 21 (3): 291--312, Aug. 2007. 10.1177/1094342007078442. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. T. Clements, M. F. Kaashoek, N. Zeldovich, R. T. Morris, and E. Kohler. The scalable commutativity rule: Designing scalable software for multicore processors. In Proc. ACM Symp. Operating Systems Principles, SOSP, pages 1--17, 2013. 10.1145/2517349.2522712.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Click. Azul's experiences with hardware transactional memory. In HP Labs' Bay Area Workshop on Transactional Memory, 2009.Google ScholarGoogle Scholar
  15. L. Dalessandro, F. Carouge, S. White, Y. Lev, M. Moir, M. L. Scott, and M. F. Spear. Hybrid NOrec: a case study in the effectiveness of best effort hardware transactional memory. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 39--52, 2011. 10.1145/1950365.1950373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Dice, O. Shalev, and N. Shavit. Transactional locking II. In Proc. Intl Conf. Distributed Computing, pages 194--208, 2006. 10.1007/11864219_14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Dice, Y. Lev, M. Moir, and D. Nussbaum. Early experience with a commercial hardware transactional memory implementation. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 157--168, 2009. 10.1145/1508244.1508263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Diegues, P. Romano, and L. Rodrigues. Virtues and limitations of commodity hardware transactional memory. In Proc. Intl Conf. Parallel Architectures and Compilation, PACT, pages 3--14, 2014. 10.1145/2628071.2628080. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Dolev, D. Hendler, and A. Suissa. CAR-S™: Scheduling-based collision avoidance and resolution for software transactional memory. In Proc. ACM Symp. Principles of Distributed Computing, PODC, pages 125--134, 2008. 10.1145/1400751.1400769.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Dragojević, R. Guerraoui, and M. Kapalka. Stretching transactional memory. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 155--165, 2009. 10.1145/1542476.1542494. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Felber, C. Fetzer, and T. Riegel. Dynamic performance tuning of word-based software transactional memory. In Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, PPoPP, pages 237--246, 2008. 10.1145/1345206.1345241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Ghemawat and P. Menage. TCMalloc: Thread-caching malloc. http://goog-perftools.sourceforge.net/doc/tcmalloc.html, 2014.Google ScholarGoogle Scholar
  23. T. Harris and K. Fraser. Language support for lightweight transactions. In Proc. ACM SIGPLAN Conf. Object-oriented Programing, Systems, Languages and Applications, OOPSLA, pages 388--402, New York, NY, USA, 2003. 10.1145/949305.949340.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In Proc. Intl Symp. Computer Architecture, ISCA, 1993. 10.1145/165123.165164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann, March 2008. ISBN 0123705916.Google ScholarGoogle Scholar
  26. M. Kulkarni, L. P. Chew, and K. Pingali. Using transactions in Delaunay mesh generation. In Proc. Workshop on Transactional Memory Workloads, WTW, 2006.Google ScholarGoogle Scholar
  27. M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 211--222, 2007. 10.1145/1250734.1250759. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Lattner, A. Lenharth, and V. Adve. Making context-sensitive points-to analysis with heap cloning practical for the real world. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 278--289, 2007. 10.1145/1250734.1250766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Lenharth and K. Pingali. Scaling runtimes for irregular algorithms to large-scale NUMA systems. Computer, 48 (8): 35--44, 2015. 10.1109/MC.2015.229. Google ScholarGoogle ScholarCross RefCross Ref
  30. A. Lenharth, D. Nguyen, and K. Pingali. Priority queues are not good concurrent priority schedulers. In Proc. European Conf. Parallel Processing, pages 209--221, 2015. Google ScholarGoogle ScholarCross RefCross Ref
  31. V. Luchangco, M. Wong, H. Boehm, J. Gottschlich, J. Maurer, P. McKenney, M. Michael, M. Moir, T. Riegel, M. Scott, T. Shpeisman, and M. Spear. Transactional memory support for C+. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3919.pdf, Feb. 2014.Google ScholarGoogle Scholar
  32. M. Mendez-Lojo, A. Mathew, and K. Pingali. Parallel inclusion-based points-to analysis. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA, October 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. Menon and K. Pingali. High-level semantic optimization of numerical codes. In Proc. Intl Conf. Supercomputing, ICS, pages 434--443, 1999. 10.1145/305138.305230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. C. Minh, M. Trautmann, J. Chung, A. McDonald, N. Bronson, J. Casper, C. Kozyrakis, and K. Olukotun. An effective hybrid transactional memory system with strong isolation guarantees. In Proc. Intl Symp. Computer Architecture, ISCA, pages 69--80, 2007. 10.1145/1250662.1250673. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Nasre, M. Burtscher, and K. Pingali. Data-driven versus topology-driven irregular computations on GPUs. In Proc. IEEE Intl Symp. Parallel and Distributed Processing, pages 463--474, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. Nasre, M. Burtscher, and K. Pingali. Morph algorithms on GPUs. In ACM SIGPLAN Notices, volume 48, pages 147--156, 2013 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Nguyen and K. Pingali. Synthesizing concurrent schedulers for irregular algorithms. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 333--344, 2011. 10.1145/1950365.1950404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proc. ACM Symp. Operating Systems Principles, SOSP, pages 456--471, New York, NY, USA, 2013. 10.1145/2517349.2522739. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Y. Ni, A. Welc, A.-R. Adl-Tabatabai, M. Bach, S. Berkowits, J. Cownie, R. Geva, S. Kozhukow, R. Narayanaswamy, J. Olivier, S. Preis, B. Saha, A. Tal, and X. Tian. Design and implementation of transactional constructs for C/C++. In Proc. ACM SIGPLAN Intl. Conf. Object-oriented Programming Systems Languages and Applications, OOPSLA, pages 195--212, 2008. 10.1145/1449764.1449780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Pai and K. Pingali. A compiler for throughput optimization of graph algorithms on GPUs. In Proc. ACM SIGPLAN Intl Conf. Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA, pages 1--19, 2016. 10.1145/2983990.2984015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 12--25, 2011. 10.1145/1993498.1993501. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. T. Riegel, P. Felber, and C. Fetzer. A lazy snapshot algorithm with eager validation. In Proc. Intl Conf. on Distributed Computing, DISC, pages 284--298, 2006. 10.1007/11864219_20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. T. Riegel, C. Fetzer, and P. Felber. Time-based transactional memory with scalable time bases. In Proc. ACM Symp. on Parallel Algorithms and Architectures, SPAA, pages 221--228, 2007. 10.1145/1248377.1248415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. W. Ruan, Y. Liu, and M. Spear. Boosting timestamp-based transactional memory by exploiting hardware cycle counters. ACM Trans. Archit. Code Optim., 10 (4): 40:1--40:21, Dec. 2013. 10.1145/2541228.2555297.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. W. Ruan, T. Vyas, Y. Liu, and M. Spear. Transactionalizing legacy code: An experience report using gcc and memcached. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 399--412, 2014. 10.1145/2541940.2541960.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. B. Saha, A.-R. Adl-Tabatabai, R. L. Hudson, C. C. Minh, and B. Hertzberg. McRT-S™: a high performance software transactional memory system for a multi-core runtime. In Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, PPoPP, pages 187--197, 2006. 10.1145/1122971.1123001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Shriraman, M. F. Spear, H. Hossain, V. J. Marathe, S. Dwarkadas, and M. L. Scott. An integrated hardware-software approach to flexible transactional memory. In Proc. Intl Symp. Computer Architecture, ISCA, pages 104--115, 2007. 10.1145/1250662.1250676. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. M. F. Spear, M. M. Michael, and C. von Praun. RingS™: scalable transactions with a single atomic instruction. In Proc. Symp. Parallelism in Algorithms and Architectures, SPAA, pages 275--284, 2008. 10.1145/1378533.1378583.Google ScholarGoogle Scholar
  49. S. Tomić, C. Perfumo, C. Kulkarni, A. Armejach, A. Cristal, O. Unsal, T. Harris, and M. Valero. EazyH™: eager-lazy hardware transactional memory. In Proc. IEEE/ACM Intl Symp. Microarchitecture, MICRO, pages 145--155, 2009. 10.1145/1669112.1669132.Google ScholarGoogle Scholar
  50. A. Wang, M. Gaudet, P. Wu, J. N. Amaral, M. Ohmacht, C. Barton, R. Silvera, and M. Michael. Evaluation of Blue Gene/Q hardware support for transactional memories. In Proc. Intl Conf. Parallel Architectures and Compilation Techniques, PACT, pages 127--136, 2012. 10.1145/2370816.2370836.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. L. Yen, J. Bobba, M. R. Marty, K. E. Moore, H. Volos, M. D. Hill, M. M. Swift, and D. A. Wood. LogTM-SE: Decoupling hardware transactional memory from caches. In Proc. IEEE Intl Symp. High Performance Computer Architecture, HPCA, pages 261--272, 2007. 10.1109/HPCA.2007.346204.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. R. M. Yoo, C. J. Hughes, K. Lai, and R. Rajwar. Performance evaluation of Intel transactional synchronization extensions for high-performance computing. In Proc. Intl Conf. for High Performance Computing, Networking, Storage and Analysis, SC, pages 19:1--19:11, 2013. 10.1145/2503210.2503232. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. What Scalable Programs Need from Transactional Memory

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!