Abstract
Transactional memory (TM) has been the focus of numerous studies, and it is supported in processors such as the IBM Blue Gene/Q and Intel Haswell. Many studies have used the STAMP benchmark suite to evaluate their designs. However, the speedups obtained for the STAMP benchmarks on all TM systems we know of are quite limited; for example, with 64 threads on the IBM Blue Gene/Q, we observe a median speedup of 1.4X using the Blue Gene/Q hardware transactional memory (HTM), and a median speedup of 4.1X using a software transactional memory (STM).
What limits the performance of these benchmarks on TMs? In this paper, we argue that the problem lies with the programming model and data structures used to write them. To make this point, we articulate two principles that we believe must be embodied in any scalable program and argue that STAMP programs violate both of them. By modifying the STAMP programs to satisfy both principles, we produce a new set of programs that we call the Stampede suite. Its median speedup on the Blue Gene/Q is 8.0X when using an STM. The two principles also permit us to simplify the TM design. Using this new STM with the Stampede benchmarks, we obtain a median speedup of 17.7X with 64 threads on the Blue Gene/Q and 13.2X with 32 threads on an Intel Westmere system.
These results suggest that HTM and STM designs will benefit if more attention is paid to the division of labor between application programs, systems software, and hardware.
- M. Abadi, A. Birrell, T. Harris, and M. Isard. Semantics of transactional memory and automatic mutual exclusion. ACM Trans. Programming Language and Systems, 33 (1): 2:1--2:50, Jan. 2011. 10.1145/1889997.1889999.Google Scholar
Digital Library
- A.-R. Adl-Tabatabai, B. T. Lewis, V. Menon, B. R. Murphy, B. Saha, and T. Shpeisman. Compiler and runtime support for efficient software transactional memory. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 26--37, 2006. 10.1145/1133981.1133985. Google Scholar
Digital Library
- A. W. Appel. Compiling with Continuations. Cambridge University Press, 2007.Google Scholar
Digital Library
- H. Avni and N. Shavit. Maintaining consistent transactional states without a global clock. In Proc. Intl Colloq. Structural Information and Communication Complexity, pages 131--140, 2008. 10.1007/978--3--540--69355-0_12.Google Scholar
Digital Library
- M. J. Best, S. Mottishaw, C. Mustard, M. Roth, A. Fedorova, and A. Brownsword. Synchronization via scheduling: Techniques for efficiently managing shared state. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 640--652, 2011. 10.1145/1993498.1993573.Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. Intl Conf. Parallel Architectures and Compilation Techniques, PACT, pages 72--81, 2008. 10.1145/1454115.1454128.Google Scholar
Digital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: an efficient multithreaded runtime system. SIGPLAN Notices, 30 (8): 207--216, 1995. 10.1145/209937.209958. Google Scholar
Digital Library
- C. Blundell, J. Devietti, E. C. Lewis, and M. M. K. Martin. Making the fast case common and the uncommon case simple in unbounded transactional memory. In Proc. Intl Symp. Computer Architecture, ISCA, pages 24--34, 2007. 10.1145/1250662.1250667. Google Scholar
Digital Library
- J. Bobba, N. Goyal, M. D. Hill, M. M. Swift, and D. A. Wood. Token™: Efficient execution of large transactions with hardware transactional memory. In Proc. Intl Symp. Computer Architecture, ISCA, pages 127--138, 2008. 10.1109/ISCA.2008.24.Google Scholar
- C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford transactional applications for multi-processing. In Proc. IEEE Intl Symp. Workload Characterization, IISWC, Sept. 2008.Google Scholar
Cross Ref
- B. D. Carlstrom, A. McDonald, H. Chafi, J. Chung, C. C. Minh, C. Kozyrakis, and K. Olukotun. The Atomos transactional programming language. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 1--13, 2006. 10.1145/1133981.1133983. Google Scholar
Digital Library
- B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl., 21 (3): 291--312, Aug. 2007. 10.1177/1094342007078442. Google Scholar
Digital Library
- A. T. Clements, M. F. Kaashoek, N. Zeldovich, R. T. Morris, and E. Kohler. The scalable commutativity rule: Designing scalable software for multicore processors. In Proc. ACM Symp. Operating Systems Principles, SOSP, pages 1--17, 2013. 10.1145/2517349.2522712.Google Scholar
Digital Library
- C. Click. Azul's experiences with hardware transactional memory. In HP Labs' Bay Area Workshop on Transactional Memory, 2009.Google Scholar
- L. Dalessandro, F. Carouge, S. White, Y. Lev, M. Moir, M. L. Scott, and M. F. Spear. Hybrid NOrec: a case study in the effectiveness of best effort hardware transactional memory. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 39--52, 2011. 10.1145/1950365.1950373. Google Scholar
Digital Library
- D. Dice, O. Shalev, and N. Shavit. Transactional locking II. In Proc. Intl Conf. Distributed Computing, pages 194--208, 2006. 10.1007/11864219_14. Google Scholar
Digital Library
- D. Dice, Y. Lev, M. Moir, and D. Nussbaum. Early experience with a commercial hardware transactional memory implementation. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 157--168, 2009. 10.1145/1508244.1508263. Google Scholar
Digital Library
- N. Diegues, P. Romano, and L. Rodrigues. Virtues and limitations of commodity hardware transactional memory. In Proc. Intl Conf. Parallel Architectures and Compilation, PACT, pages 3--14, 2014. 10.1145/2628071.2628080. Google Scholar
Digital Library
- S. Dolev, D. Hendler, and A. Suissa. CAR-S™: Scheduling-based collision avoidance and resolution for software transactional memory. In Proc. ACM Symp. Principles of Distributed Computing, PODC, pages 125--134, 2008. 10.1145/1400751.1400769.Google Scholar
Digital Library
- A. Dragojević, R. Guerraoui, and M. Kapalka. Stretching transactional memory. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 155--165, 2009. 10.1145/1542476.1542494. Google Scholar
Digital Library
- P. Felber, C. Fetzer, and T. Riegel. Dynamic performance tuning of word-based software transactional memory. In Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, PPoPP, pages 237--246, 2008. 10.1145/1345206.1345241. Google Scholar
Digital Library
- S. Ghemawat and P. Menage. TCMalloc: Thread-caching malloc. http://goog-perftools.sourceforge.net/doc/tcmalloc.html, 2014.Google Scholar
- T. Harris and K. Fraser. Language support for lightweight transactions. In Proc. ACM SIGPLAN Conf. Object-oriented Programing, Systems, Languages and Applications, OOPSLA, pages 388--402, New York, NY, USA, 2003. 10.1145/949305.949340.Google Scholar
Digital Library
- M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In Proc. Intl Symp. Computer Architecture, ISCA, 1993. 10.1145/165123.165164. Google Scholar
Digital Library
- M. Herlihy and N. Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann, March 2008. ISBN 0123705916.Google Scholar
- M. Kulkarni, L. P. Chew, and K. Pingali. Using transactions in Delaunay mesh generation. In Proc. Workshop on Transactional Memory Workloads, WTW, 2006.Google Scholar
- M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 211--222, 2007. 10.1145/1250734.1250759. Google Scholar
Digital Library
- C. Lattner, A. Lenharth, and V. Adve. Making context-sensitive points-to analysis with heap cloning practical for the real world. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 278--289, 2007. 10.1145/1250734.1250766. Google Scholar
Digital Library
- A. Lenharth and K. Pingali. Scaling runtimes for irregular algorithms to large-scale NUMA systems. Computer, 48 (8): 35--44, 2015. 10.1109/MC.2015.229. Google Scholar
Cross Ref
- A. Lenharth, D. Nguyen, and K. Pingali. Priority queues are not good concurrent priority schedulers. In Proc. European Conf. Parallel Processing, pages 209--221, 2015. Google Scholar
Cross Ref
- V. Luchangco, M. Wong, H. Boehm, J. Gottschlich, J. Maurer, P. McKenney, M. Michael, M. Moir, T. Riegel, M. Scott, T. Shpeisman, and M. Spear. Transactional memory support for C+. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3919.pdf, Feb. 2014.Google Scholar
- M. Mendez-Lojo, A. Mathew, and K. Pingali. Parallel inclusion-based points-to analysis. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA, October 2010. Google Scholar
Digital Library
- V. Menon and K. Pingali. High-level semantic optimization of numerical codes. In Proc. Intl Conf. Supercomputing, ICS, pages 434--443, 1999. 10.1145/305138.305230. Google Scholar
Digital Library
- C. C. Minh, M. Trautmann, J. Chung, A. McDonald, N. Bronson, J. Casper, C. Kozyrakis, and K. Olukotun. An effective hybrid transactional memory system with strong isolation guarantees. In Proc. Intl Symp. Computer Architecture, ISCA, pages 69--80, 2007. 10.1145/1250662.1250673. Google Scholar
Digital Library
- R. Nasre, M. Burtscher, and K. Pingali. Data-driven versus topology-driven irregular computations on GPUs. In Proc. IEEE Intl Symp. Parallel and Distributed Processing, pages 463--474, 2013. Google Scholar
Digital Library
- R. Nasre, M. Burtscher, and K. Pingali. Morph algorithms on GPUs. In ACM SIGPLAN Notices, volume 48, pages 147--156, 2013 Google Scholar
Digital Library
- D. Nguyen and K. Pingali. Synthesizing concurrent schedulers for irregular algorithms. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 333--344, 2011. 10.1145/1950365.1950404. Google Scholar
Digital Library
- D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proc. ACM Symp. Operating Systems Principles, SOSP, pages 456--471, New York, NY, USA, 2013. 10.1145/2517349.2522739. Google Scholar
Digital Library
- Y. Ni, A. Welc, A.-R. Adl-Tabatabai, M. Bach, S. Berkowits, J. Cownie, R. Geva, S. Kozhukow, R. Narayanaswamy, J. Olivier, S. Preis, B. Saha, A. Tal, and X. Tian. Design and implementation of transactional constructs for C/C++. In Proc. ACM SIGPLAN Intl. Conf. Object-oriented Programming Systems Languages and Applications, OOPSLA, pages 195--212, 2008. 10.1145/1449764.1449780.Google Scholar
Digital Library
- S. Pai and K. Pingali. A compiler for throughput optimization of graph algorithms on GPUs. In Proc. ACM SIGPLAN Intl Conf. Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA, pages 1--19, 2016. 10.1145/2983990.2984015. Google Scholar
Digital Library
- K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, PLDI, pages 12--25, 2011. 10.1145/1993498.1993501. Google Scholar
Digital Library
- T. Riegel, P. Felber, and C. Fetzer. A lazy snapshot algorithm with eager validation. In Proc. Intl Conf. on Distributed Computing, DISC, pages 284--298, 2006. 10.1007/11864219_20. Google Scholar
Digital Library
- T. Riegel, C. Fetzer, and P. Felber. Time-based transactional memory with scalable time bases. In Proc. ACM Symp. on Parallel Algorithms and Architectures, SPAA, pages 221--228, 2007. 10.1145/1248377.1248415. Google Scholar
Digital Library
- W. Ruan, Y. Liu, and M. Spear. Boosting timestamp-based transactional memory by exploiting hardware cycle counters. ACM Trans. Archit. Code Optim., 10 (4): 40:1--40:21, Dec. 2013. 10.1145/2541228.2555297.Google Scholar
Digital Library
- W. Ruan, T. Vyas, Y. Liu, and M. Spear. Transactionalizing legacy code: An experience report using gcc and memcached. In Proc. Intl Conf. Architectural Support for Programming Languages and Operating Systems, ASPLOS, pages 399--412, 2014. 10.1145/2541940.2541960.Google Scholar
Digital Library
- B. Saha, A.-R. Adl-Tabatabai, R. L. Hudson, C. C. Minh, and B. Hertzberg. McRT-S™: a high performance software transactional memory system for a multi-core runtime. In Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, PPoPP, pages 187--197, 2006. 10.1145/1122971.1123001.Google Scholar
Digital Library
- A. Shriraman, M. F. Spear, H. Hossain, V. J. Marathe, S. Dwarkadas, and M. L. Scott. An integrated hardware-software approach to flexible transactional memory. In Proc. Intl Symp. Computer Architecture, ISCA, pages 104--115, 2007. 10.1145/1250662.1250676. Google Scholar
Digital Library
- M. F. Spear, M. M. Michael, and C. von Praun. RingS™: scalable transactions with a single atomic instruction. In Proc. Symp. Parallelism in Algorithms and Architectures, SPAA, pages 275--284, 2008. 10.1145/1378533.1378583.Google Scholar
- S. Tomić, C. Perfumo, C. Kulkarni, A. Armejach, A. Cristal, O. Unsal, T. Harris, and M. Valero. EazyH™: eager-lazy hardware transactional memory. In Proc. IEEE/ACM Intl Symp. Microarchitecture, MICRO, pages 145--155, 2009. 10.1145/1669112.1669132.Google Scholar
- A. Wang, M. Gaudet, P. Wu, J. N. Amaral, M. Ohmacht, C. Barton, R. Silvera, and M. Michael. Evaluation of Blue Gene/Q hardware support for transactional memories. In Proc. Intl Conf. Parallel Architectures and Compilation Techniques, PACT, pages 127--136, 2012. 10.1145/2370816.2370836.Google Scholar
Digital Library
- L. Yen, J. Bobba, M. R. Marty, K. E. Moore, H. Volos, M. D. Hill, M. M. Swift, and D. A. Wood. LogTM-SE: Decoupling hardware transactional memory from caches. In Proc. IEEE Intl Symp. High Performance Computer Architecture, HPCA, pages 261--272, 2007. 10.1109/HPCA.2007.346204.Google Scholar
Digital Library
- R. M. Yoo, C. J. Hughes, K. Lai, and R. Rajwar. Performance evaluation of Intel transactional synchronization extensions for high-performance computing. In Proc. Intl Conf. for High Performance Computing, Networking, Storage and Analysis, SC, pages 19:1--19:11, 2013. 10.1145/2503210.2503232. Google Scholar
Digital Library
Index Terms
What Scalable Programs Need from Transactional Memory
Recommendations
What Scalable Programs Need from Transactional Memory
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsTransactional memory (TM) has been the focus of numerous studies, and it is supported in processors such as the IBM Blue Gene/Q and Intel Haswell. Many studies have used the STAMP benchmark suite to evaluate their designs. However, the speedups obtained ...
What Scalable Programs Need from Transactional Memory
Asplos'17Transactional memory (TM) has been the focus of numerous studies, and it is supported in processors such as the IBM Blue Gene/Q and Intel Haswell. Many studies have used the STAMP benchmark suite to evaluate their designs. However, the speedups obtained ...
Unbounded page-based transactional memory
Proceedings of the 2006 ASPLOS ConferenceExploiting thread level parallelism is paramount in the multicore era. Transactions enable programmers to expose such parallelism by greatly simplifying the multi-threaded programming model. Virtualized transactions (unbounded in space and time) are ...







Comments