Abstract
Current parallelizing compilers can tackle applications exercising regular access patterns on arrays or affine indices, where data dependencies can be expressed in a linear form. Unfortunately, there are cases that independence between statements of code cannot be guaranteed and thus the compiler conservatively produces sequential code. Programs that involve extensive pointer use, irregular access patterns, and loops with unknown number of iterations are examples of such cases. This limits the extraction of parallelism in cases where dependencies are rarely or never triggered at runtime. Speculative parallelism refers to methods employed during program execution that aim to produce a valid parallel execution schedule for programs immune to static parallelization. The motivation for this article is to review recent developments in the area of compiler-driven software speculation for thread-level parallelism and how they came about. The article is divided into two parts. In the first part the fundamentals of speculative parallelization for thread-level parallelism are explained along with a design choice categorization for implementing such systems. Design choices include the ways speculative data is handled, how data dependence violations are detected and resolved, how the correct data are made visible to other threads, or how speculative threads are scheduled. The second part is structured around those design choices providing the advances and trends in the literature with reference to key developments in the area. Although the focus of the article is in software speculative parallelization, a section is dedicated for providing the interested reader with pointers and references for exploring similar topics such as hardware thread-level speculation, transactional memory, and automatic parallelization.
- David I. August, Daniel A. Connors, Scott A. Mahlke, John W. Sias, Kevin M. Crozier, Ben-Chung Cheng, Patrick R. Eaton, Qudus B. Olaniran, and Wen mei W. Hwu. 1998. Integrated predicated and speculative execution in the IMPACT EPIC architecture. In Proceedings of the International Symposium on Computer Architecture (ISCA), 227--237. Google Scholar
Digital Library
- David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler transformations for high-performance computing. Computing Surveys 26, 4 (1994), 345--420. Google Scholar
Digital Library
- Hans-J. Boehm. 1996. Simple garbage-collector-safety. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 89--98. Google Scholar
Digital Library
- Matthew Bridges, Neil Vachharajani, Yun Zhang, Thomas Jablin, and David August. 2007. Revisiting the sequential programming model for multi-core. In Proceedings of the International Symposium on Microarchitecture (MICRO), 69--84. Google Scholar
Digital Library
- Matthew Bridges. 2008. The VELOCITY Compiler: Extracting Efficient Multicore Eexecution from Legacy Sequential Codes. Technical Report. Princeton University.Google Scholar
- Derek Bruening, Srikrishna Devabhaktuni, and Saman Amarasinghe. 2000. Softspec: Software-based speculative parallelism. In Workshop on Feedback-Directed and Dynamic Optimization (FDDO).Google Scholar
- Luis Ceze, James Tuck, Josep Torrellas, and Calin Cascaval. 2006. Bulk disambiguation of speculative threads in multiprocessors. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA), 227--238. Google Scholar
Digital Library
- Ding Kai Chen, Josep Torrellas, and Pen Chung Yew. 1994. An efficient algorithm for the run-time parallelization of DOACROSS loops. In Proceedings of the International Conference on Supercomputing (ICS), 518--527. Google Scholar
Digital Library
- Michael K. Chen and Kunle Olukotun. 2003. The Jrpm system for dynamically parallelizing Java programs. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), 434--446. Google Scholar
Digital Library
- Marcelo Cintra and Diego Llanos. 2005. Design space exploration of a software speculative parallelization scheme. IEEE Transactions on Parallel and Distributed Systems 16, 6 (2005), 562--576. Google Scholar
Digital Library
- Marcelo Cintra and Diego R. Llanos. 2003. Toward efficient and robust software speculative parallelization on multiprocessors. In Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). Google Scholar
Digital Library
- Marcelo Cintra, José F. Martínez, and Josep Torrellas. 2000. Architectural support for scalable speculative parallelization in shared-memory multiprocessors. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA), 13--24. Google Scholar
Digital Library
- Ron Cytron. 1986. DOACROSS: Beyond vectorization for multiprocessors. In Proceedings of the International Conference on Parallel Processing (ICPP), 836--844.Google Scholar
- Francis Dang, Hao Yu, and Lawrence Rauchwerger. 2002. The R-LRPD test: Speculative parallelization of partially parallel loops. In Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS), 20--29. Google Scholar
Digital Library
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of Conference on Symposium on Opearting Systems Design and Implementation (OSDI). Google Scholar
Digital Library
- Chen Ding, Xipeng Shen, Kirk Kelsey, Chris Tice, Ruke Huang, and Chengliang Zhang. 2007. Software behavior oriented parallelization. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 223--234. Google Scholar
Digital Library
- María Jesús Garzarán, Milos Prvulovic, José María Llabería, Víctor Viñals, Lawrence Rauchwerger, and Josep Torrellas. 2005. Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors. ACM Transactions in Architecture and Code Optimization 2, 3 (September 2005), 247--279. Google Scholar
Digital Library
- Sridhar Gopal, T. Vijaykumar, James Smith, and Gurindar Sohi. 1998. Speculative versioning cache. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture (HPCA), 195--215. Google Scholar
Digital Library
- Manish Gupta and Rahul Nim. 1998. Techniques for speculative run-time parallelization of loops. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC), 1--12. Google Scholar
Digital Library
- Apache Hadoop. 2005. Apache Hadoop. http://hadoop.apache.org/. (2005). Accessed February 2, 2015.Google Scholar
- Lance Hammond, Mark Willey, and Kunle Olukotun. 1998. Data speculation support for a chip multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 58--69. Google Scholar
Digital Library
- Tim Harris, James Larus, and Ravi Rajwar. 2010. Transactional Memory (2nd ed.). Morgan and Claypool Publishers. Google Scholar
Digital Library
- Tim Harris, Mark Plesko, Avraham Shinnar, and David Tarditi. 2006. Optimizing memory transactions. In Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI), 14--25. Google Scholar
Digital Library
- Maurice Herlihy and J. Eliot B. Moss. 1993. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture (ISCA), 289--300. Google Scholar
Digital Library
- Jason Howard, Saurabh Dighe, Yatin Hoskote, Sriram Vangal, David Finan, Gregory Ruhl, David Jenkins, Howard Wilson, Nitin Borkar, Gerhard Schrom, Fabrice Pailet, Shailendra Jain, Tiju Jacob, Satish Yada, Sraven Marella, Praveen Salihundam, Vasantha Erraguntla, Michael Konow, Michael Riepen, Guido Droege, Joerg Lindemann, Matthias Gries, Thomas Apel, Kersten Henriss, Tor Lund-Larsen, Sebastian Steibl, Shekhar Borkar, Vivek De, Rob Van Der Wijngaart, and Timothy Mattson. 2010. A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 108--109.Google Scholar
Cross Ref
- Shiwen Hu, Ravi Bhargava, and Lizy Kurian John. 2003. The role of return value prediction in exploiting speculative method-level parallelism. Journal of Instruction-Level Parallelsim 5 (2003), 1--21.Google Scholar
- Nikolas Ioannou, Jeremy Singer, Salman Khan, Paraskevas Yiapanis, Adam Pocock, Polychronis Xekalakis, Gavin Brown, Mikel Luján, Ian Watson, and Marcelo Cintra. 2010. Toward a more accurate understanding of the limits of the TLS execution paradigm. In Proceedings of the IEEE International Symposium on Workload Characterization. Google Scholar
Digital Library
- Nick P. Johnson, Hanjun Kim, Prakash Prabhu, Ayal Zaks, and David I. August. 2012. Speculative separation for privatization and peductions. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 359--370. Google Scholar
Digital Library
- Troy A. Johnson, Rudolf Eigenmann, and T. N. Vijaykumar. 2004. Min-cut program decomposition for thread-level speculation. In Proceedings of the International Conference on Programming Language Design and Implementation (PLDI), 59--70. Google Scholar
Digital Library
- Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA. Google Scholar
Digital Library
- Hanjun Kim, Arun Raman, Feng Liu, Jae W. Lee, and David I. August. 2010. Scalable speculative parallelization on commodity clusters. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 3--14. Google Scholar
Digital Library
- Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau, and Josep Torrellas. 2006. POSH: A TLS compiler that exploits program structure. In Proceedings of the International Symposium on Principles and Practice of Parallel Programming (PPoPP), 158--167. Google Scholar
Digital Library
- Mikel Luján, Phyllis Gustafson, Michael Paleczny, and Christopher A. Vick. 2007. Speculative parallelization—Eliminating the overhead of failure. In Proceedings of the 3rd International Conference on High Performance Computing and Communications (HPCC), 460--471. Google Scholar
Digital Library
- Clifford Lynch. 2008. Big data: How do your data grow? Nature 455, 7209 (2008), 28--29.Google Scholar
Cross Ref
- Pedro Marcuello and Antonio González. 1999. Exploiting speculative thread-level parallelism on a SMT processor. In Proceedings of the International Conference on High-Performance Computing and Networking, 754--763. Google Scholar
Digital Library
- Pedro Marcuello and Antonio González. 2002. Thread-spawning schemes for speculative multithreading. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture (HPCA), 55--67. Google Scholar
Digital Library
- Jan Kasper Martinsen and Hakan Grahn. 2011. A methodology for evaluating JavaScript execution behavior in interactive web applications. In Computer Systems and Applications (AICCSA). 241--248. Google Scholar
Digital Library
- Jan Martinsen, Hakan Grahn, and Anders Isberg. 2013. Using speculation to enhance javaScript performance in Web applications. IEEE Internet Computing 17, 2 (2013), 10--19. Google Scholar
Digital Library
- Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, and Scott Mahlke. 2009. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. In Proceedings of the ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI), 166--176. Google Scholar
Digital Library
- Samuel P. Midkiff. 2012. Automatic Parallelization: An Overview of Fundamental Compiler Techniques. Morgan & Claypool Publishers. Google Scholar
Digital Library
- Erik M. Nystrom, Hong-Seok Kim, and Wen-Mei W. Hwu. 2004. Bottom-up and top-down context-sensitive summary-based pointer analysis. In Proceedings of 11th Static Analysis Symposium (SAS), 165--180.Google Scholar
- Cosmin Oancea, Alan Mycroft, and Tim Harris. 2009. A lightweight in-place implementation for software thread-level speculation. In Proceedings of the 21st Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), 223--232. Google Scholar
Digital Library
- Jeffrey Oplinger, David Heine, Shih Liao, Basem A. Nayfeh, Monica S. Lam, and Kunle Olukotun. 1997. Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor. Technical Report CSL-TR-97-715. Stanford University. Google Scholar
Digital Library
- Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005a. Automatic thread extraction with decoupled software pipelining. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 105--118. Google Scholar
Digital Library
- Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005b. Automatic thread extraction with decoupled Software Pipelining. In Proceedings of International Symposium on Microarchitecture (MICRO), 105--118. Google Scholar
Digital Library
- Christopher J. F. Pickett and Clark Verbrugge. 2006. Software thread level speculation for the Java language and virtual machine environment. In Proceedings of the International Conference on Languages and Compilers for Parallel Computing (LCPC), 304--318. Google Scholar
Digital Library
- Manohar K. Prabhu and Kunle Olukotun. 2005. Exposing speculative thread parallelism in SPEC2000. In Proceedings of the Symposium on Principles and Practice of Parallel Programming (PPoPP), 142--152. Google Scholar
Digital Library
- Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August. 2010. Speculative parallelization using software multi-threaded transactions. In Proceedings of Architectural Support for Programming Languages and Operating Systems, 65--76. Google Scholar
Digital Library
- Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, and David I. August. 2008. Parallel-stage decoupled software pipelining. In Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 114--123. Google Scholar
Digital Library
- Ram Rangan, Neil Vachharajani, Guilherme Ottoni, and David I. August. 2008. Performance scalability of decoupled software pipelining. ACM Transactions on Architecture and Code Optimization 5, 2, Article 8 (2008), 8:1--8:25 pages. Google Scholar
Digital Library
- Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled software pipelining with the synchronization array. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 177--188. Google Scholar
Digital Library
- Paruj Ratanaworabhan, Benjamin Livshits, and Benjamin G. Zorn. 2010. JSMeter: Comparing the behavior of JavaScript benchmarks with real Web applications. In Proceedings of the 2010 USENIX Conference on Web Application Development (WebApps). Google Scholar
Digital Library
- Lawrence Rauchwerger. 1998. Run-time parallelization: Its time has come. Parallel Computing 24, 3--4 (1998), 527--556. Google Scholar
Digital Library
- Lawrence Rauchwerger and David Padua. 1994a. Speculative Run-Time Parallelization of Loops. Technical Report CSRD-827. Center for Supercomputing Research and Development, University of Illinois.Google Scholar
- Lawrence Rauchwerger and David Padua. 1994b. The privatizing DOALL test: A run-time technique for DOALL loop identification and array privatization. In Proceedings of the 8th International Conference on Supercomputing (ICS), 33--43. Google Scholar
Digital Library
- Lawrence Rauchwerger and David Padua. 1995. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the International Conference on Programming Language Design and Implementation (PLDI), 218--232. Google Scholar
Digital Library
- Jose Renau, Karin Strauss, Luis Ceze, Wei Liu, Smruti Sarangi, James Tuck, and Josep Torrellas. 2005a. Thread-level speculation on a CMP can be energy efficient. In Proceedings of the International Conference on Supercomputing, 219--228. Google Scholar
Digital Library
- Jose Renau, James Tuck, Wei Liu, Luis Ceze, Karin Strauss, and Josep Torrellas. 2005b. Tasking with out-of-order spawn in TLS chip multiprocessors: Microarchitecture and compilation. In Proceedings of the Internatonal Conference on Supercomputing, 179--188. Google Scholar
Digital Library
- Gregor Richards, Sylvain Lebresne, Brian Burg, and Jan Vitek. 2010. An analysis of the dynamic behavior of JavaScript programs. In Proceedings of the International Conference on Programming Language Design and Implementation (PLDI), 1--12. Google Scholar
Digital Library
- Peter Rundberg and Per Stenström. 2001. An all-software thread-level data dependence speculation system for multiprocessors. Journal of Instruction-Level Parallelism 3 (2001), 1--28.Google Scholar
- Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and Benjamin Hertzberg. 2006. McRT-STM: A high performance software transactional memory system for a multi-core runtime. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 187--197. Google Scholar
Digital Library
- Joel H. Salz and Ravi Mirchandaney. 1991. The preprocessed doacross loop. In Proceedings of ICPP, 174--178.Google Scholar
- Joel H. Salz, Ravi Mirchandaney, and Kay Crowley. 1989. The doconsider loop. In Proceedings of ICS. 29--40. Google Scholar
Digital Library
- Joel H. Salz, Ravi Mirchandaney, and Kay Crowley. 1991. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers 40, 5 (1991), 603--612. Google Scholar
Digital Library
- Michael F. Spear. 2010. Lightweight, robust adaptivity for software transactional memory. In Proceedings of the 22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 273--283. Google Scholar
Digital Library
- Michael F. Spear, Virendra J. Marathe, William N. Scherer, and Michael L. Scott. 2006. Conflict detection and validation strategies for software transactional memory. In Proceedings of the International Conference on Distributed Computing (DISC), 179--193. Google Scholar
Digital Library
- Gregory Steffan. 2003. Hardware Support for Thread-Level Speculation. Doctoral dissertation. Carnegie Mellon University Pittsburgh, PA. Google Scholar
Digital Library
- Gregory Steffan, Christopher Colohan, Antonia Zhai, and Todd C. Mowry. 2005. The STAMPede approach to thread-level speculation. ACM Transactions on Computer Systems 23, 3 (2005), 253--300. Google Scholar
Digital Library
- Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. 2000. A scalable approach to thread-level speculation. In Proceedings of the International Symposium on Computer Architecture (ISCA), 1--12. Google Scholar
Digital Library
- Peiyi Tang and Pen-Chung Yew. 1986. Processor self-scheduling for multiple nested parallel loops. In Proceedings of the International Conference of Parallel Processing, 528--535.Google Scholar
- William Thies, Vikram Chandrasekhar, and Saman Amarasinghe. 2007. A practical approach to exploiting coarse-grained pipeline parallelism in C programs. In Proceedings of the International Symposium on Microarchitecture (MICRO), 356--369. Google Scholar
Digital Library
- Chen Tian, Min Feng, and Rajiv Gupta. 2010. Supporting speculative parallelization in the presence of dynamic data structures. In Proceedings of the International Conference on Programming Language Design and Implementation (PLDI), 62--73. Google Scholar
Digital Library
- Chen Tian, Min Feng, Vijay Nagarajan, and Rajiv Gupta. 2008. Copy or discard execution model for speculative parallelization on multicores. In Proceedings of the 41st Annual International Symposium on Microarchitecture (MICRO), 330--341. Google Scholar
Digital Library
- Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme Ottoni, and David I. August. 2007. Speculative decoupled software pipelining. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT), 49--59. Google Scholar
Digital Library
- Amy Wang, Matthew Gaudet, Peng Wu, Jose Amaral, Martin Ohmacht, Christopher Barton, Raul Silvera, and Maged Michael. 2012. Evaluation of blue gene/q hardware support for transactional memories. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT), 127--136. Google Scholar
Digital Library
- Paraskevas Yiapanis. 2013. High Performance Optimizations in Runtime Speculative Parallelization for Multicore Architectures. Ph.D. dissertation. School of Computer Science, University of Manchester.Google Scholar
- Paraskevas Yiapanis, Demian Rosas-Ham, Gavin Brown, and Mikel Luján. 2013. Optimizing software runtime systems for speculative parallelization. ACM Transactions on Architecture and Code Optimization 9, 4, Article 39 (2013), 39:1--39:27. Google Scholar
Digital Library
- Chenggang Zhang, Guodong Han, and Cho-Li Wang. 2013. GPU-TLS: An efficient runtime for speculative loop parallelization on GPUs. In Proceedings of the International Symposium on Cluster, Cloud, and Grid Computing (CCGRID). 120--127.Google Scholar
- Hongtao Zhong, Mojtaba Mehrara, Steven A. Lieberman, and Scott A. Mahlke. 2008. Uncovering hidden loop level parallelism in sequential applications. In Proceedings of the International Conference on High-Performance Computer Architecture (HPCA), 290--301.Google Scholar
- Chuan-Qi Zhu and Pen-Chung Yew. 1987. A scheme to enforce data dependence on large multiprocessor systems. IEEE Transactions on Software Engineering 13, 6 (1987), 726--739. Google Scholar
Digital Library
Index Terms
Compiler-Driven Software Speculation for Thread-Level Parallelism
Recommendations
Optimizing software runtime systems for speculative parallelization
Special Issue on High-Performance Embedded Architectures and CompilersThread-Level Speculation (TLS) overcomes limitations intrinsic with conservative compile-time auto-parallelizing tools by extracting parallel threads optimistically and only ensuring absence of data dependence violations at runtime.
A significant ...
The STAMPede approach to thread-level speculation
Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to ...








Comments