Abstract
The demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found that for nondeterministic programs, some of these actual dependences can be satisfied with alternative data that can be generated in parallel, thus boosting the program's TLP. Satisfying these dependences with alternative data nonetheless produces final outputs that match those of the original nondeterministic program. To demonstrate the practicality of our technique, we describe the design, implementation, and evaluation of our compilers, autotuner, profiler, and runtime, which are enabled by our proposed C++ programming language extensions. The resulting system boosts the performance of six well-known nondeterministic and multi-threaded benchmarks by 158.2% (geometric mean) on a 28-core Intel-based platform.
- Wonsun Ahn, Shanxiang Qi, M Nicolaides, Josep Torrellas, J-W Lee, Xing Fang, S Midkiff, and David Wong. 2009. BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support. In International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Alexander Aiken and Alexandru Nicolau. 1988. Perfect Pipelining: A New Loop Parallelization Technique European Symposium on Programming (ESOP). Google Scholar
Digital Library
- Riad Akram, Mohammad Mejbah Ul Alam, and Abdullah Muzahid. 2016. Approximate Lock: Trading off Accuracy for Performance by Skipping Critical Sections International Symposium on Software Reliability Engineering (ISSRE).Google Scholar
- Ayaz Ali, Lennart Johnsson, and Jaspal Subhlok. 2007. Scheduling FFT Computation on SMP and Multicore Systems International Conference on Supercomputing (ICS). Google Scholar
Digital Library
- Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A Language and Compiler for Algorithmic Choice Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O'Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning Parallel Architectures and Compilation Techniques (PACT). Google Scholar
Digital Library
- Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. 2011. Language and Compiler Support for Auto-tuning Variable-accuracy Algorithms Code Generation and Optimization (CGO). Google Scholar
Digital Library
- Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. bibinfoschoolPrinceton University. Google Scholar
Digital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications Parallel Architectures and Compilation Techniques (PACT). Google Scholar
Digital Library
- Jeff Bilmes, Krste Asanovic, Chee-Whye Chin, and Jim Demmel. 1997. Optimizing Matrix Multiply Using PHiPAC: A Portable, High-performance, ANSI C Coding Methodology. In International Conference on Supercomputing (ICS). Google Scholar
Digital Library
- Sergey Blagodurov, Sergey Zhuravlev, Alexandra Fedorova, and Ali Kamali. 2010. A case for NUMA-aware contention management on multicore systems Parallel Architectures and Compilation Techniques (PACT). Google Scholar
Digital Library
- Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda Rankin, P. S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. 1988. iWarp: An Integrated Solution to High-Speed Parallel Computing International Conference on Supercomputing (ICS). Google Scholar
Digital Library
- Gary Bradski and Adrian Kaehler. 2008. Learning OpenCV: Computer vision with the OpenCV library. "O'Reilly Media, Inc.".Google Scholar
- Scott E. Breach, T. N. Vijaykumar, and Gurindar S. Sohi. 1994. The Anatomy of the Register File in a Multiscalar Processor International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy M. Jones, Gu-Yeon Wei, and David Brooks. 2014. HELIX-RC: An Architecture-compiler Co-design for Automatic Parallelization of Irregular Programs. In International Symposium on Computer Architecuture (ISCA). Google Scholar
Digital Library
- Simone Campanoni, Glenn Holloway, Gu-Yeon Wei, and David Brooks. 2015. HELIX-UP: Relaxing Program Semantics to Unleash Parallelization Code Generation and Optimization (CGO). Google Scholar
Digital Library
- Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay Janapa Reddi, Gu-Yeon Wei, and David Brooks. 2012 a. HELIX: Automatic Parallelization of Irregular Programs for Chip Multiprocessing Code Generation and Optimization (CGO). Google Scholar
Digital Library
- S. Campanoni, T. M. Jones, G. Holloway, G. Y. Wei, and D. Brooks. 2012 b. HELIX: Making the Extraction of Thread-Level Parallelism Mainstream International Symposium on Microarchitecture (MICRO).Google Scholar
- Shawn D. Casey. 2011. How to Determine the Effectiveness of Hyper-Threading Technology with an Application. https://goo.gl/ycuL6E. (2011). Accessed: 2018-01--14.Google Scholar
- Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders Landin, Sherman Yip, Håkan Zeffer, and Marc Tremblay. 2009. Rock: A high-performance sparc cmt processor. In International Symposium on Microarchitecture (MICRO).Google Scholar
Digital Library
- Ding-Kai Chen and Pen-Chung Yew. 1996. On Effective Execution of Nonuniform DOACROSS Loops Transactions on Parallel and Distributed Systems (TPDS). Google Scholar
Digital Library
- Ding-Kai Chen and Pen-Chung Yew. 1999. Redundant Synchronization Elimination for DOACROSS Loops Transactions on Parallel and Distributed Systems (TPDS). Google Scholar
Digital Library
- Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. Operating Systems Design and Implementation (OSDI). Google Scholar
Digital Library
- Cristian cTuapucs, I-Hsin Chung, and Jeffrey K. Hollingsworth. 2002. Active Harmony: Towards Automated Performance Tuning Supercomputing Conference (SC). Google Scholar
Digital Library
- Ron Cytron, Jeanne Ferrante, Barry K Rosen, Mark N Wegman, and F Kenneth Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems (TOPLAS) (1991). Google Scholar
Digital Library
- Leonardo Dagum and Ramesh Menon. 1998. OpenMP: An Industry-Standard API for Shared-Memory Programming IEEE Comput. Sci. Eng. Google Scholar
Digital Library
- D. L. Davies and D. W. Bouldin. 1979. A Cluster Separation Measure. In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). Google Scholar
Digital Library
- Kemal Ebcioglu and Alexandru Nicolau. 1989. A Global Resource-constrained Parallelization Technique International Conference on Supercomputing (ICS). Google Scholar
Digital Library
- Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, Shriram Krishnamurthi, Eli Barzilay, Jay McCarthy, and Sam Tobin-Hochstadt. 2015. The Racket Manifesto. In Summit on Advances in Programming Languages (SNAPL).Google Scholar
- Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. In International Symposium on Microarchitecture (MICRO).Google Scholar
- Chao-Ying Fu, Matthew D Jennings, Sergei Y Larin, and Thomas M Conte. 1998. Value speculation scheduling for high performance processors Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- José González and Antonio González. 1998. The potential of data value speculation to boost ILP International Conference on Supercomputing (ICS). Google Scholar
Digital Library
- P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. Hallnor, H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. 2014. Haswell: The fourth-generation intel core processor International Symposium on Microarchitecture (MICRO).Google Scholar
- Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K. Prabhu, Michael K. Chen, and Kunle Olukotun. 2000. The Stanford Hydra CMP. In International Symposium on Microarchitecture (MICRO).Google Scholar
- Liang Han, Wei Liu, and James M. Tuck. 2010. Speculative Parallelization of Partial Reduction Variables Code Generation and Optimization (CGO). Google Scholar
Digital Library
- R. Haring, M. Ohmacht, T. Fox, M. Gschwind, D. Satterfield, K. Sugavanam, P. Coteus, P. Heidelberger, M. Blumrich, R. Wisniewski, a. gara, G. Chiu, P. Boyle, N. Chist, and C. Kim. 2012. The IBM Blue Gene/Q Compute Chip. In International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal, and Martin Rinard. 2011. Dynamic Knobs for Responsive Power-aware Computing Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Jialu Huang, Arun Raman, Thomas B. Jablin, Yun Zhang, Tzu-Han Hung, and David I. August. 2010. Decoupled Software Pipelining Creates Parallelization Opportunities Code Generation and Optimization (CGO). Google Scholar
Digital Library
- A.R. Hurson, Joford T., LimKrishna M., and KaviBen Lee. 1997. Parallelization of DOALL and DOACROSS Loops - A Survey Advances in Computers.Google Scholar
- Eun-Jin Im and Katherine A. Yelick. 2001. Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY International Conference on Computational Sciences (ICCS). Google Scholar
Digital Library
- Christian Jacobi, Timothy Slegel, and Dan Greiner. 2012. Transactional memory architecture and implementation for IBM System z International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Troy A. Johnson, Rudolf Eigenmann, and T. N. Vijaykumar. 2007. Speculative Thread Decomposition Through Empirical Optimization Principles and Practice of Parallel Programming (PPoPP). Google Scholar
Digital Library
- K. Kelsey, T. Bai, C. Ding, and C. Zhang. 2009. Fast Track: A Software System for Speculative Program Optimization Code Generation and Optimization (CGO). Google Scholar
Digital Library
- C. Kessler and W. Löwe. 2012. Optimized Composition of Performance-aware Parallel Components Concurr. Comput. : Pract. Exper. Google Scholar
Digital Library
- Hanjun Kim, Nick P Johnson, Jae W Lee, Scott A Mahlke, and David I August. 2012. Automatic speculative DOALL for clusters. In Code Generation and Optimization (CGO). Google Scholar
Digital Library
- Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. 2007. Optimistic Parallelism Requires Abstractions. In Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation Code Generation and Optimization (CGO). Google Scholar
Digital Library
- Hung Q Le, GL Guthrie, DE Williams, Maged M Michael, BG Frey, William J Starke, Cathy May, Rei Odaira, and Takuya Nakaike. 2015. Transactional memory support in the IBM POWER8 processor IBM Journal of Research and Development.Google Scholar
- Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. USENIX Annual Technical Conference (USENIX ATC). Google Scholar
Digital Library
- Duo Liu, Zili Shao, Meng Wang, Minyi Guo, and Jingling Xue. 2009. Optimal Loop Parallelization for Maximizing Iteration-level Parallelism Compilers, Architecture, and Synthesis for Embedded Systems (CASES). Google Scholar
Digital Library
- Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau, and Josep Torrellas. 2006. POSH: A TLS Compiler That Exploits Program Structure Principles and Practice of Parallel Programming (PPoPP). Google Scholar
Digital Library
- Kathryn S. McKinley. 1994. Evaluating Automatic Parallelization for Efficient Execution on Shared-memory Multiprocessors International Conference on Supercomputing (ICS). Google Scholar
Digital Library
- Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, and Scott Mahlke. 2009. Parallelizing Sequential Applications on Commodity Hardware Using a Low-cost Software Transactional Memory. In Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Jiayuan Meng, Srimat Chakradhar, and Anand Raghunathan. 2009. Best-effort parallel execution framework for recognition and mining applications International Symposium on Parallel and Distributed Processing (IPDPS). Google Scholar
Digital Library
- Jiayuan Meng, Anand Raghunathan, Srimat T. Chakradhar, and Surendra Byna. 2010. Exploiting the forgiving nature of applications for scalable parallel execution International Symposium on Parallel and Distributed Processing (IPDPS).Google Scholar
- Sasa Misailovic, Deokhwan Kim, and Martin Rinard. 2013. Parallelizing Sequential Programs with Statistical Accuracy Tests ACM Trans. Embed. Comput. Syst. (TECS). Google Scholar
Digital Library
- Sasa Misailovic, Stelios Sidiroglou, Henry Hoffmann, and Martin Rinard. 2010. Quality of Service Profiling. In International Conference on Software Engineering (ICSE). Google Scholar
Digital Library
- Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005. Automatic Thread Extraction with Decoupled Software Pipelining International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Chuck Pheatt. 2008. Intel Threading Building Blocks. In J. Comput. Sci. Coll. Google Scholar
Digital Library
- Phitchaya Mangpo Phothilimthana, Jason Ansel, Jonathan Ragan-Kelley, and Saman Amarasinghe. 2013. Portable Performance on Heterogeneous Architectures Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Guilherme Piccoli, Henrique N Santos, Raphael E Rodrigues, Christiane Pousa, Edson Borin, and Fernando M Quint ao Pereira. 2014. Compiler support for selective page migration in NUMA architectures Parallel Architectures and Compilation Techniques (PACT). Google Scholar
Digital Library
- Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The Tao of Parallelism in Algorithms. In Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August. 2010. Speculative Parallelization Using Software Multi-threaded Transactions Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, and David I. August. 2008. Parallel-stage Decoupled Software Pipelining. In Code Generation and Optimization (CGO). Google Scholar
Digital Library
- Lakshminarayanan Renganarayana, Vijayalakshmi Srinivasan, Ravi Nair, and Daniel Prener. 2012. Programming with relaxed synchronization. In Relaxing synchronization for multicore and manycore scalability (RACES). Google Scholar
Digital Library
- Martin C Rinard. 2007. Using early phase termination to eliminate load imbalances at barrier synchronization points Object-oriented Programming, Systems, Languages, and Applications (OOPSLA). Google Scholar
Digital Library
- Behnam Robatmil, Dong Li, Hadi Esmaeilzadeh, Sibi Govindan, Aaron Smith, Andrew Putnam, Doug Burger, and Stephen W. Keckler. 2013. How to Implement Effective Prediction and Forwarding for Fusable Dynamic Multicore Architectures. In High-Performance Computer Architecture (HPCA). Google Scholar
Digital Library
- Mehrzad Samadi, Janghaeng Lee, D Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. Sage: Self-tuning approximation for graphics engines International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Nitya Ranganathan, Doug Burger, Stephen W. Keckler, Robert G. McDonald, and Charles R. Moore. 2004. TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP Transactions on Architecture and Code Optimization (TACO). Google Scholar
Digital Library
- Steven L. Scott. 1996. Synchronization and Communication in the T3E Multiprocessor Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing Performance vs. Accuracy Trade-offs with Loop Perforation European Conference on Foundations of Software Engineering (ESEC/FSE). Google Scholar
Digital Library
- Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multiscalar Processors. In International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Sharanyan Srikanthan, Sandhya Dwarkadas, and Kai Shen. 2015. Data Sharing or Resource Contention: Toward Performance Transparency on Multicore Systems USENIX Annual Technical Conference (USENIX ATC). Google Scholar
Digital Library
- Sharanyan Srikanthan, Sandhya Dwarkadas, and Kai Shen. 2016. Coherence stalls or latency tolerance: informed CPU scheduling for socket and core sharing USENIX Annual Technical Conference (USENIX ATC). Google Scholar
Digital Library
- J. Steffan and T Mowry. 1998. The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization. In High-Performance Computer Architecture (HPCA). Google Scholar
Digital Library
- J. Gregory Steffan, Christopher Colohan, Antonia Zhai, and Todd C. Mowry. 2005. The STAMPede Approach to Thread-level Speculation Transactions on Computer Systems (TOC). Google Scholar
Digital Library
- John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems IEEE Des. Test.Google Scholar
- Xin Sui, Andrew Lenharth, Donald S. Fussell, and Keshav Pingali. 2016. Proactive Control of Approximate Programs. In Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Georgios Tournavitis, Zheng Wang, Björn Franke, and Michael F.P. O'Boyle. 2009. Towards a Holistic Approach to Auto-parallelization: Integrating Profile-driven Parallelism Detection and Machine-learning Based Mapping Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Abhishek Udupa, Kaushik Rajan, and William Thies. 2011. ALTER: Exploiting Breakable Dependences for Parallelization Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Antonio Valles, M Gillespie, and G Drysdale. 2009. Performance Insights to Intel® Hyper-Threading Technology. http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology. (2009). Accessed: 2017-07-01.Google Scholar
- Keval Vora, Sai Charan Koduru, and Rajiv Gupta. 2014. ASPIRE: Exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based DSM. In Object-oriented Programming, Systems, Languages, and Applications (OOPSLA). Google Scholar
Digital Library
- Yevgen Voronenko, Frédéric de Mesmay, and Markus Püschel. 2009. Computer Generation of General Size Linear Transform Libraries Code Generation and Optimization (CGO). Google Scholar
Digital Library
- Cheng Wang, Youfeng Wu, Edson Borin, Shiliang Hu, Wei Liu, Dave Sager, Tin-fook Ngai, and Jesse Fang. 2009. Dynamic Parallelization of Single-threaded Binary Programs Using Speculative Slicing International Conference of Supercomputing (ICS). Google Scholar
Digital Library
- David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown, III, and Anant Agarwal. 2007. On-Chip Interconnection Architecture of the Tile Processor International Symposium on Microarchitecture (MICRO).Google Scholar
- R. Clint Whaley and Jack J. Dongarra. 1998. Automatically Tuned Linear Algebra Software. In Supercomputing Conference (SC). Google Scholar
Digital Library
- Cheng-Zhong Xu and Vipin Chaudhary. 2001. Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences Transactions on Parallel and Distributed Systems (TPDS). Google Scholar
Digital Library
- Antonia Zhai, J. Gregory Steffan, Christopher B. Colohan, and Todd C. Mowry. 2008. Compiler and Hardware Support for Reducing the Synchronization of Speculative Threads Transactions on Architecture and Code Optimization (TACO). Google Scholar
Digital Library
- Hongtao Zhong, Mojtaba Mehrara, Steven A. Lieberman, and Scott A. Mahlke. 2008. Uncovering hidden loop level parallelism in sequential applications High-Performance Computer Architecture (HPCA).Google Scholar
Index Terms
Unconventional Parallelization of Nondeterministic Applications
Recommendations
Unconventional Parallelization of Nondeterministic Applications
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsThe demand for thread-level-parallelism (TLP) on commodity processors is endless as it is essential for gaining performance and saving energy. However, TLP in today's programs is limited by dependences that must be satisfied at run time. We have found ...
Java with Auto-parallelization on Graphics Coprocessing Architecture
ICPP '13: Proceedings of the 2013 42nd International Conference on Parallel ProcessingGPU-based many-core accelerators have gained a footing in supercomputing. Their widespread adoption yet hinges on better parallelization and load scheduling techniques to utilize the hybrid system of CPU and GPU cores easily and efficiently. This paper ...
Enabling scalability-sensitive speculative parallelization for FSM computations
ICS '17: Proceedings of the International Conference on SupercomputingFinite state machines (FSMs) are the backbone of many applications, but are difficult to parallelize due to their inherent dependencies. Speculative FSM parallelization has shown promise on multicore machines with up to eight cores. However, as hardware ...







Comments