Abstract
We present Lerna, an end-to-end tool that automatically and transparently detects and extracts parallelism from data-dependent sequential loops. Lerna uses speculation combined with a set of techniques including code profiling, dependency analysis, instrumentation, and adaptive execution. Speculation is needed to avoid conservative actions and detect actual conflicts. Lerna targets applications that are hard-to-parallelize due to data dependency. Our experimental study involves the parallelization of 13 applications with data dependencies. Results on a 24-core machine show an average of 2.7× speedup for micro-benchmarks and 2.5× for the macro-benchmarks.
- {n.d.}. Intel Parallel Studio. Retrieved from https://software.intel.com/en-us/intel-parallel-studio-xe.Google Scholar
- {n.d.}. RSTM: The University of Rochester STM. Retrieved from www.cs.rochester.edu/research/synchronization/rstm/.Google Scholar
- Martín Abadi, Tim Harris, and Mojtaba Mehrara. 2009. Transactional memory with strong atomicity using off-the-shelf memory protection hardware. In ACM Sigplan Notices, Vol. 44. ACM, 185--196. Google Scholar
Digital Library
- Alfred V. Aho, Jeffrey D Ullman, et al. 1977. Principles of Compiler Design. Addision-Wesley Pub. Co. Google Scholar
Digital Library
- Matthew Arnold, Stephen Fink, David Grove, Michael Hind, and Peter F. Sweeney. 2000. Adaptive optimization in the Jalapeno JVM. In Proceedings of the 15th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA’00). ACM, New York, NY, 47--65. Google Scholar
Digital Library
- David A. Bader and Kamesh Madduri. 2005. Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors. In High Performance Computing (HiPC’05). Springer, 465--476. Google Scholar
Digital Library
- Joao Barreto, Aleksandar Dragojevic, Paulo Ferreira, Ricardo Filipe, and Rachid Guerraoui. 2012. Unifying thread-level speculation and transactional memory. In Proceedings of the 13th International Middleware Conference. Springer-Verlag New York, Inc., 187--207. Google Scholar
Digital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 72--81. Google Scholar
Digital Library
- Irina Calciu, Tatiana Shpeisman, Gilles Pokam, and Maurice Herlihy. 2014. Improved single global lock fallback for best-effort hardware transactional memory. In Proceedings of the 9th Workshop on Transactional Computing (TRANSACT’14). Available: http://transact2014.cse.lehigh.edu/.Google Scholar
- Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. 2008. STAMP: Stanford transactional applications for multi-processing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’08).Google Scholar
- Chi Cao Minh, Martin Trautmann, JaeWoong Chung, Austen McDonald, Nathan Bronson, Jared Casper, Christos Kozyrakis, and Kunle Olukotun. 2007. An effective hybrid transactional memory system with strong isolation guarantees. In Proceedings of the 34th Annual International Symposium on Computer Architecture. Google Scholar
Digital Library
- B. Chan. 2002. The UMT benchmark code. Lawrence Livermore National Laboratory, Livermore, CA (2002).Google Scholar
- Michael Chen and Kunle Olukotun. 2003. TEST: A tracer for extracting speculative threads. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’03). IEEE, 301--312. Google Scholar
Digital Library
- Michael K. Chen and Kunle Olukotun. 2003. The Jrpm system for dynamically parallelizing Java programs. In Proceedings of the 30th Annual International Symposium on Computer Architecture. IEEE, 434--445. Google Scholar
Digital Library
- Doreen Y. Cheng. 1993. A survey of parallel programming languages and tools. Computer Sciences Corporation, NASA Ames Research Center, Report RND-93-005, March (1993).Google Scholar
- Rezaul A. Chowdhury, Peter Djeu, Brendon Cahoon, James H. Burrill, and Kathryn S. McKinley. 2004. The limits of alias analysis for scalar optimizations. In Compiler Construction. Springer, 24--38.Google Scholar
- Luke Dalessandro, François Carouge, Sean White, Yossi Lev, Mark Moir, Michael L. Scott, and Michael F. Spear. 2011. Hybrid NOrec: A case study in the effectiveness of best effort hardware transactional memory. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, 39--52. Google Scholar
Digital Library
- Luke Dalessandro and Michael L. Scott. 2012. Sandboxing transactional memory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’12), Pen-Chung Yew, Sangyeun Cho, Luiz DeRose, and David J. Lilja (Eds.). ACM, 171--180. Google Scholar
Digital Library
- Luke Dalessandro, Michael F. Spear, and Michael L. Scott. 2010. NOrec: Streamlining STM by abolishing ownership records. In ACM Sigplan Notices, Vol. 45. ACM, 67--78. Google Scholar
Digital Library
- Francis Dang, Hao Yu, and Lawrence Rauchwerger. 2001. The R-LRPD test: Speculative parallelization of partially parallel loops. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02). IEEE, 10--pp. Google Scholar
Digital Library
- Matthew DeVuyst, Dean M. Tullsen, and Seon Wook Kim. 2011. Runtime parallelization of legacy code on a transactional memory system. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers. ACM, 127--136. Google Scholar
Digital Library
- Dave Dice, Ori Shalev, and Nir Shavit. 2006. Transactional locking II. In In Proc. of the 20th Intl. Symp. on Distributed Computing. Google Scholar
Digital Library
- Nicholas DiPasquale, T. Way, and V. Gehlot. 2005. Comparative survey of approaches to automatic parallelization. In MASPLAS’05.Google Scholar
- Tobias J. K. Edler von Koch and Björn Franke. 2013. Limits of region-based dynamic binary parallelization. In ACM SIGPLAN Notices, Vol. 48. ACM, 13--22. Google Scholar
Digital Library
- Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. Int. J. Parallel Program. 21, 5 (1992), 313--347. Google Scholar
Digital Library
- Pascal Felber, Christof Fetzer, and Torvald Riegel. 2008. Dynamic performance tuning of word-based software transactional memory. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2008, Salt Lake City, UT, February 20-23, 2008, Siddhartha Chatterjee and Michael L. Scott (Eds.). ACM, 237--246. Google Scholar
Digital Library
- Pascal Felber, Shady Issa, Alexander Matveev, and Paolo Romano. 2016. Hardware read-write lock elision. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16), Cristian Cadar, Peter R. Pietzuch, Kimberly Keeton, and Rodrigo Rodrigues (Eds.). ACM, 34:1--34:15. Google Scholar
Digital Library
- MA Gonzalez-Mesa, Eladio Gutierrez, Emilio L. Zapata, and Oscar Plata. 2014. Effective transactional memory execution management for improved concurrency. ACM Trans. Archit. Code Optim. 11, 3 (2014), 24. Google Scholar
Digital Library
- Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-polyhedral optimization in LLVM. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT), Vol. 2011.Google Scholar
- Manish Gupta, Sayak Mukhopadhyay, and Navin Sinha. 2000. Automatic parallelization of recursive procedures. Int. J. Parallel Program. 28, 6 (2000), 537--562.Google Scholar
Cross Ref
- Lance Hammond, Mark Willey, and Kunle Olukotun. 1998. Data speculation support for a chip multiprocessor. SIGOPS Oper. Syst. Rev. 32, 5 (Oct. 1998), 58--69. Google Scholar
Digital Library
- Tim Harris, James Larus, and Ravi Rajwar. 2010. Transactional memory, 2nd edition. Synth. Lect. Comput. Archit. 5, 1 (2010), 1--263. Google Scholar
Digital Library
- David Heath, Robert Jarrow, and Andrew Morton. 1992. Bond pricing and the term structure of interest rates: A new methodology for contingent claims valuation. Econometrica (1992), 77--105.Google Scholar
- Shan Shan Huang, Amir Hormati, David F. Bacon, and Rodric M. Rabbah. 2008. Liquid metal: Object-oriented programming across the hardware/software boundary. In Proceedings of the 22nd European Conference on Object-Oriented Programming (ECOOP’08). 76--103. Google Scholar
Digital Library
- Intel. 2012. Architecture instruction set extensions programming reference. Intel Corporation.Google Scholar
- Shady Issa, Pascal Felber, Alexander Matveev, and Paolo Romano. 2017. Extending hardware transactional memory capacity via rollback-only transactions and suspend/resume. In Proceedings of the 31st International Symposium on Distributed Computing (DISC’17) (LIPIcs), Andréa W. Richa (Ed.), Vol. 91. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 28:1--28:16.Google Scholar
- Natanael Karjanto, Binur Yermukanova, and Laila Zhexembay. 2015. Black-Scholes equation. arXiv preprint arXiv:1504.03074 (2015).Google Scholar
- Hironori Kasahara, Motoki Obata, and Kazuhisa Ishizaka. 2001. Automatic coarse grain task parallel processing on smp using openmp. In Languages and Compilers for Parallel Computing. Springer, 189--207. Google Scholar
Digital Library
- Sangman Kim, Michael Z. Lee, Alan M. Dunn, Owen S. Hofmann, Xuan Wang, Emmett Witchel, and Donald E. Porter. 2012. Improving server applications with system transactions. In Proceedings of the European Conference on Computer Systems,Pascal Felber, Frank Bellosa, and Herbert Bos (Eds.). ACM, 15--28. Google Scholar
Digital Library
- Venkata Krishnan and Josep Torrellas. 1999. A chip-multiprocessor architecture with speculative multithreading. IEEE Trans. Comput. 48, 9 (1999), 866--880. Google Scholar
Digital Library
- Leslie Lamport. 1974. The parallel execution of DO loops. Commun. ACM 17, 2 (1974), 83--93. Google Scholar
Digital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings ofthe International Symposium on Code Generation and Optimization (CGO’04). IEEE, 75--86. Google Scholar
Digital Library
- Amy W. Lim and Monica S. Lam. 1997. Maximizing parallelism and minimizing synchronization with affine transforms. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 201--214. Google Scholar
Digital Library
- Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau, and Josep Torrellas. 2006. POSH: A TLS compiler that exploits program structure. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 158--167. Google Scholar
Digital Library
- Alexander Matveev and Nir Shavit. 2015. Reduced hardware NOrec: A safe and scalable hybrid transactional memory. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 59--71. Google Scholar
Digital Library
- Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, and Scott Mahlke. 2009. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. In Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’09). ACM, New York, NY, 166--176. Google Scholar
Digital Library
- Mohamed Mohamedin, Roberto Palmieri, Ahmed Hassan, and Binoy Ravindran. 2017. Managing resource limitation of best-effort HTM. IEEE Trans. Parallel Distrib. Syst. 28, 8 (2017), 2299--2313.Google Scholar
Cross Ref
- Matthias Müller, David Charypar, and Markus Gross. 2003. Particle-based fluid simulation for interactive applications. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA’03). 154--159. Google Scholar
Digital Library
- Stefan C. Müller, Gustavo Alonso, Adam Amara, and André Csillaghy. 2014. Pydron: Semi-automatic parallelization for multi-core and the cloud. In Proceedings of the11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 645--659. Google Scholar
Digital Library
- AB MySQL. 1995. MySQL: The World’s Most Popular Open Source Database. MySQL AB.Google Scholar
- Nomair A. Naeem and Ondrej Lhoták. 2009. Efficient alias set analysis using SSA form. In Proceedings of the 2009 International Symposium on Memory Management. ACM, 79--88. Google Scholar
Digital Library
- William Morton Pottenger. 1995. Induction Variable Substitution and Reduction Recognition in the Polaris Parallelizing Compiler. Ph.D. Dissertation. Citeseer.Google Scholar
- Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August. 2010. Speculative parallelization using software multi-threaded transactions. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 65--76. Google Scholar
Digital Library
- Ravi Ramaseshan and Frank Mueller. 2008. Toward thread-level speculation for coarse-grained parallelism of regular access patterns. In Workshop on Programmability Issues for Multi-Core Computers. 12.Google Scholar
- Lawrence Rauchwerger and David A. Padua. 1999. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. IEEE Trans. Parallel Distrib. Syst. 10, 2 (1999), 160--180. Google Scholar
Digital Library
- James Reinders. 2013. Transactional Synchronization in Haswell. Retrieved from http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/.Google Scholar
- Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP’13, Farmington, PA, November 3--6, 2013, Michael Kaminsky and Mike Dahlin (Eds.). ACM, 49--68. Google Scholar
Digital Library
- Wenjia Ruan, Yujie Liu, and Michael Spear. 2015. Transactional read-modify-write without aborts. ACM Trans. Archit. Code Optim. 11, 4 (2015), 63. Google Scholar
Digital Library
- Radu Rugina and Martin Rinard. 1999. Automatic parallelization of divide and conquer algorithms. In ACM SIGPLAN Notices, Vol. 34. ACM, 72--83. Google Scholar
Digital Library
- Mohamed M. Saad. 2016. Extracting Parallelism from Legacy Sequential Code Using Transactional Memory. Ph.D. Dissertation. Virginia Tech. https://vtechworks.lib.vt.edu/handle/10919/71861.Google Scholar
- Mohamed M. Saad, Masoomeh Javidi Kishi, Shihao Jing, Sandeep Hans, and Roberto Palmieri. 2019. Processing transactions in a predefined order. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’19). ACM. Google Scholar
Digital Library
- Mohamed M. Saad, Mohamed Mohamedin, and Binoy Ravindran. 2012. HydraVM: Extracting parallelism from legacy sequential code using STM. In Proceedings of the 4th USENIX Workshop on Hot Topics in Parallelism (HotPar’12). Google Scholar
Digital Library
- Mohamed M. Saad, Roberto Palmieri, Ahmed Hassan, and Binoy Ravindran. 2016. Extending TM primitives using low level semantics. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, Christian Scheideler and Seth Gilbert (Eds.). ACM, 109--120. Google Scholar
Digital Library
- Bratin Saha, Ali-Reza Adl-Tabatabai, and Quinn Jacobson. 2006. Architectural support for software transactional memory. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, Washington, D.C., 185--196. Google Scholar
Digital Library
- Joel H. Saltz, Ravi Mirchandaney, and K. Crowley. 1991. The preprocessed doacross loop. In ICPP (2). 174--179.Google Scholar
- J. Greggory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. 2000. A Scalable Approach to Thread-level Speculation. Vol. 28. ACM. Google Scholar
Digital Library
- Kevin Streit, Clemens Hammacher, Andreas Zeller, and Sebastian Hack. 2013. Sambamba: Runtime adaptive parallel execution. In Proceedings of the 3rd International Workshop on Adaptive Self-Tuning Computing Systems. ACM, 7. Google Scholar
Digital Library
- Hans Vandierendonck, Sean Rul, and Koen De Bosschere. 2010. The Paralax infrastructure: Automatic parallelization with a helping hand. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 389--400. Google Scholar
Digital Library
- Christoph von Praun, Rajesh Bordawekar, and Calin Cascaval. 2008. Modeling optimistic concurrency using quantitative dependence analysis. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 185--196. Google Scholar
Digital Library
- Amos Waterland, Elaine Angelino, Ryan P. Adams, Jonathan Appavoo, and Margo I. Seltzer. 2014. ASC: Automatically scalable computation. In ASPLOS. Rajeev Balasubramonian, Al Davis, and Sarita V. Adve (Eds.). ACM, 575--590. Google Scholar
Digital Library
Index Terms
Lerna: Parallelizing Dependent Loops Using Speculation
Recommendations
Lerna: Parallelizing Dependent Loops Using Speculation
SYSTOR '18: Proceedings of the 11th ACM International Systems and Storage ConferenceWe present Lerna, an end-to-end tool that automatically and transparently detects and extracts parallelism from data dependent sequential loops using speculation combined with a set of techniques including code profiling, dependency analysis, ...
Lightweight barrier-based parallelization support for non-cache-coherent MPSoC platforms
CASES '07: Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systemsMany MPSoC applications are loop-intensive and amenable to automatic parallelization with suitable compiler support. One of the key components of any compiler-parallelized code is barrier instructions which are used to perform global synchronization ...
Predicting HPC parallel program performance based on LLVM compiler
Performance prediction of parallel program plays key roles in many areas, such as parallel system design, parallel program optimization, and parallel system procurement. Accurate and efficient performance prediction on large-scale parallel systems is a ...






Comments