skip to main content
research-article
Public Access

Lerna: Parallelizing Dependent Loops Using Speculation

Published:22 March 2019Publication History
Skip Abstract Section

Abstract

We present Lerna, an end-to-end tool that automatically and transparently detects and extracts parallelism from data-dependent sequential loops. Lerna uses speculation combined with a set of techniques including code profiling, dependency analysis, instrumentation, and adaptive execution. Speculation is needed to avoid conservative actions and detect actual conflicts. Lerna targets applications that are hard-to-parallelize due to data dependency. Our experimental study involves the parallelization of 13 applications with data dependencies. Results on a 24-core machine show an average of 2.7× speedup for micro-benchmarks and 2.5× for the macro-benchmarks.

References

  1. {n.d.}. Intel Parallel Studio. Retrieved from https://software.intel.com/en-us/intel-parallel-studio-xe.Google ScholarGoogle Scholar
  2. {n.d.}. RSTM: The University of Rochester STM. Retrieved from www.cs.rochester.edu/research/synchronization/rstm/.Google ScholarGoogle Scholar
  3. Martín Abadi, Tim Harris, and Mojtaba Mehrara. 2009. Transactional memory with strong atomicity using off-the-shelf memory protection hardware. In ACM Sigplan Notices, Vol. 44. ACM, 185--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Alfred V. Aho, Jeffrey D Ullman, et al. 1977. Principles of Compiler Design. Addision-Wesley Pub. Co. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Matthew Arnold, Stephen Fink, David Grove, Michael Hind, and Peter F. Sweeney. 2000. Adaptive optimization in the Jalapeno JVM. In Proceedings of the 15th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA’00). ACM, New York, NY, 47--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David A. Bader and Kamesh Madduri. 2005. Design and implementation of the HPCS graph analysis benchmark on symmetric multiprocessors. In High Performance Computing (HiPC’05). Springer, 465--476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Joao Barreto, Aleksandar Dragojevic, Paulo Ferreira, Ricardo Filipe, and Rachid Guerraoui. 2012. Unifying thread-level speculation and transactional memory. In Proceedings of the 13th International Middleware Conference. Springer-Verlag New York, Inc., 187--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT’08). ACM, New York, NY, 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Irina Calciu, Tatiana Shpeisman, Gilles Pokam, and Maurice Herlihy. 2014. Improved single global lock fallback for best-effort hardware transactional memory. In Proceedings of the 9th Workshop on Transactional Computing (TRANSACT’14). Available: http://transact2014.cse.lehigh.edu/.Google ScholarGoogle Scholar
  10. Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. 2008. STAMP: Stanford transactional applications for multi-processing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’08).Google ScholarGoogle Scholar
  11. Chi Cao Minh, Martin Trautmann, JaeWoong Chung, Austen McDonald, Nathan Bronson, Jared Casper, Christos Kozyrakis, and Kunle Olukotun. 2007. An effective hybrid transactional memory system with strong isolation guarantees. In Proceedings of the 34th Annual International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Chan. 2002. The UMT benchmark code. Lawrence Livermore National Laboratory, Livermore, CA (2002).Google ScholarGoogle Scholar
  13. Michael Chen and Kunle Olukotun. 2003. TEST: A tracer for extracting speculative threads. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’03). IEEE, 301--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Michael K. Chen and Kunle Olukotun. 2003. The Jrpm system for dynamically parallelizing Java programs. In Proceedings of the 30th Annual International Symposium on Computer Architecture. IEEE, 434--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Doreen Y. Cheng. 1993. A survey of parallel programming languages and tools. Computer Sciences Corporation, NASA Ames Research Center, Report RND-93-005, March (1993).Google ScholarGoogle Scholar
  16. Rezaul A. Chowdhury, Peter Djeu, Brendon Cahoon, James H. Burrill, and Kathryn S. McKinley. 2004. The limits of alias analysis for scalar optimizations. In Compiler Construction. Springer, 24--38.Google ScholarGoogle Scholar
  17. Luke Dalessandro, François Carouge, Sean White, Yossi Lev, Mark Moir, Michael L. Scott, and Michael F. Spear. 2011. Hybrid NOrec: A case study in the effectiveness of best effort hardware transactional memory. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVI). ACM, New York, NY, 39--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Luke Dalessandro and Michael L. Scott. 2012. Sandboxing transactional memory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’12), Pen-Chung Yew, Sangyeun Cho, Luiz DeRose, and David J. Lilja (Eds.). ACM, 171--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Luke Dalessandro, Michael F. Spear, and Michael L. Scott. 2010. NOrec: Streamlining STM by abolishing ownership records. In ACM Sigplan Notices, Vol. 45. ACM, 67--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Francis Dang, Hao Yu, and Lawrence Rauchwerger. 2001. The R-LRPD test: Speculative parallelization of partially parallel loops. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’02). IEEE, 10--pp. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Matthew DeVuyst, Dean M. Tullsen, and Seon Wook Kim. 2011. Runtime parallelization of legacy code on a transactional memory system. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers. ACM, 127--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Dave Dice, Ori Shalev, and Nir Shavit. 2006. Transactional locking II. In In Proc. of the 20th Intl. Symp. on Distributed Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Nicholas DiPasquale, T. Way, and V. Gehlot. 2005. Comparative survey of approaches to automatic parallelization. In MASPLAS’05.Google ScholarGoogle Scholar
  24. Tobias J. K. Edler von Koch and Björn Franke. 2013. Limits of region-based dynamic binary parallelization. In ACM SIGPLAN Notices, Vol. 48. ACM, 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. Int. J. Parallel Program. 21, 5 (1992), 313--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Pascal Felber, Christof Fetzer, and Torvald Riegel. 2008. Dynamic performance tuning of word-based software transactional memory. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2008, Salt Lake City, UT, February 20-23, 2008, Siddhartha Chatterjee and Michael L. Scott (Eds.). ACM, 237--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Pascal Felber, Shady Issa, Alexander Matveev, and Paolo Romano. 2016. Hardware read-write lock elision. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16), Cristian Cadar, Peter R. Pietzuch, Kimberly Keeton, and Rodrigo Rodrigues (Eds.). ACM, 34:1--34:15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. MA Gonzalez-Mesa, Eladio Gutierrez, Emilio L. Zapata, and Oscar Plata. 2014. Effective transactional memory execution management for improved concurrency. ACM Trans. Archit. Code Optim. 11, 3 (2014), 24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-polyhedral optimization in LLVM. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT), Vol. 2011.Google ScholarGoogle Scholar
  30. Manish Gupta, Sayak Mukhopadhyay, and Navin Sinha. 2000. Automatic parallelization of recursive procedures. Int. J. Parallel Program. 28, 6 (2000), 537--562.Google ScholarGoogle ScholarCross RefCross Ref
  31. Lance Hammond, Mark Willey, and Kunle Olukotun. 1998. Data speculation support for a chip multiprocessor. SIGOPS Oper. Syst. Rev. 32, 5 (Oct. 1998), 58--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tim Harris, James Larus, and Ravi Rajwar. 2010. Transactional memory, 2nd edition. Synth. Lect. Comput. Archit. 5, 1 (2010), 1--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. David Heath, Robert Jarrow, and Andrew Morton. 1992. Bond pricing and the term structure of interest rates: A new methodology for contingent claims valuation. Econometrica (1992), 77--105.Google ScholarGoogle Scholar
  34. Shan Shan Huang, Amir Hormati, David F. Bacon, and Rodric M. Rabbah. 2008. Liquid metal: Object-oriented programming across the hardware/software boundary. In Proceedings of the 22nd European Conference on Object-Oriented Programming (ECOOP’08). 76--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Intel. 2012. Architecture instruction set extensions programming reference. Intel Corporation.Google ScholarGoogle Scholar
  36. Shady Issa, Pascal Felber, Alexander Matveev, and Paolo Romano. 2017. Extending hardware transactional memory capacity via rollback-only transactions and suspend/resume. In Proceedings of the 31st International Symposium on Distributed Computing (DISC’17) (LIPIcs), Andréa W. Richa (Ed.), Vol. 91. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 28:1--28:16.Google ScholarGoogle Scholar
  37. Natanael Karjanto, Binur Yermukanova, and Laila Zhexembay. 2015. Black-Scholes equation. arXiv preprint arXiv:1504.03074 (2015).Google ScholarGoogle Scholar
  38. Hironori Kasahara, Motoki Obata, and Kazuhisa Ishizaka. 2001. Automatic coarse grain task parallel processing on smp using openmp. In Languages and Compilers for Parallel Computing. Springer, 189--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Sangman Kim, Michael Z. Lee, Alan M. Dunn, Owen S. Hofmann, Xuan Wang, Emmett Witchel, and Donald E. Porter. 2012. Improving server applications with system transactions. In Proceedings of the European Conference on Computer Systems,Pascal Felber, Frank Bellosa, and Herbert Bos (Eds.). ACM, 15--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Venkata Krishnan and Josep Torrellas. 1999. A chip-multiprocessor architecture with speculative multithreading. IEEE Trans. Comput. 48, 9 (1999), 866--880. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Leslie Lamport. 1974. The parallel execution of DO loops. Commun. ACM 17, 2 (1974), 83--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings ofthe International Symposium on Code Generation and Optimization (CGO’04). IEEE, 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Amy W. Lim and Monica S. Lam. 1997. Maximizing parallelism and minimizing synchronization with affine transforms. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 201--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau, and Josep Torrellas. 2006. POSH: A TLS compiler that exploits program structure. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 158--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Alexander Matveev and Nir Shavit. 2015. Reduced hardware NOrec: A safe and scalable hybrid transactional memory. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 59--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mojtaba Mehrara, Jeff Hao, Po-Chun Hsu, and Scott Mahlke. 2009. Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. In Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’09). ACM, New York, NY, 166--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Mohamed Mohamedin, Roberto Palmieri, Ahmed Hassan, and Binoy Ravindran. 2017. Managing resource limitation of best-effort HTM. IEEE Trans. Parallel Distrib. Syst. 28, 8 (2017), 2299--2313.Google ScholarGoogle ScholarCross RefCross Ref
  48. Matthias Müller, David Charypar, and Markus Gross. 2003. Particle-based fluid simulation for interactive applications. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA’03). 154--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Stefan C. Müller, Gustavo Alonso, Adam Amara, and André Csillaghy. 2014. Pydron: Semi-automatic parallelization for multi-core and the cloud. In Proceedings of the11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 645--659. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. AB MySQL. 1995. MySQL: The World’s Most Popular Open Source Database. MySQL AB.Google ScholarGoogle Scholar
  51. Nomair A. Naeem and Ondrej Lhoták. 2009. Efficient alias set analysis using SSA form. In Proceedings of the 2009 International Symposium on Memory Management. ACM, 79--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. William Morton Pottenger. 1995. Induction Variable Substitution and Reduction Recognition in the Polaris Parallelizing Compiler. Ph.D. Dissertation. Citeseer.Google ScholarGoogle Scholar
  53. Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I. August. 2010. Speculative parallelization using software multi-threaded transactions. In ACM SIGARCH Computer Architecture News, Vol. 38. ACM, 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ravi Ramaseshan and Frank Mueller. 2008. Toward thread-level speculation for coarse-grained parallelism of regular access patterns. In Workshop on Programmability Issues for Multi-Core Computers. 12.Google ScholarGoogle Scholar
  55. Lawrence Rauchwerger and David A. Padua. 1999. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. IEEE Trans. Parallel Distrib. Syst. 10, 2 (1999), 160--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. James Reinders. 2013. Transactional Synchronization in Haswell. Retrieved from http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/.Google ScholarGoogle Scholar
  57. Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP’13, Farmington, PA, November 3--6, 2013, Michael Kaminsky and Mike Dahlin (Eds.). ACM, 49--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Wenjia Ruan, Yujie Liu, and Michael Spear. 2015. Transactional read-modify-write without aborts. ACM Trans. Archit. Code Optim. 11, 4 (2015), 63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Radu Rugina and Martin Rinard. 1999. Automatic parallelization of divide and conquer algorithms. In ACM SIGPLAN Notices, Vol. 34. ACM, 72--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Mohamed M. Saad. 2016. Extracting Parallelism from Legacy Sequential Code Using Transactional Memory. Ph.D. Dissertation. Virginia Tech. https://vtechworks.lib.vt.edu/handle/10919/71861.Google ScholarGoogle Scholar
  61. Mohamed M. Saad, Masoomeh Javidi Kishi, Shihao Jing, Sandeep Hans, and Roberto Palmieri. 2019. Processing transactions in a predefined order. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’19). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Mohamed M. Saad, Mohamed Mohamedin, and Binoy Ravindran. 2012. HydraVM: Extracting parallelism from legacy sequential code using STM. In Proceedings of the 4th USENIX Workshop on Hot Topics in Parallelism (HotPar’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Mohamed M. Saad, Roberto Palmieri, Ahmed Hassan, and Binoy Ravindran. 2016. Extending TM primitives using low level semantics. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, Christian Scheideler and Seth Gilbert (Eds.). ACM, 109--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Bratin Saha, Ali-Reza Adl-Tabatabai, and Quinn Jacobson. 2006. Architectural support for software transactional memory. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, Washington, D.C., 185--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Joel H. Saltz, Ravi Mirchandaney, and K. Crowley. 1991. The preprocessed doacross loop. In ICPP (2). 174--179.Google ScholarGoogle Scholar
  66. J. Greggory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. 2000. A Scalable Approach to Thread-level Speculation. Vol. 28. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Kevin Streit, Clemens Hammacher, Andreas Zeller, and Sebastian Hack. 2013. Sambamba: Runtime adaptive parallel execution. In Proceedings of the 3rd International Workshop on Adaptive Self-Tuning Computing Systems. ACM, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Hans Vandierendonck, Sean Rul, and Koen De Bosschere. 2010. The Paralax infrastructure: Automatic parallelization with a helping hand. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 389--400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Christoph von Praun, Rajesh Bordawekar, and Calin Cascaval. 2008. Modeling optimistic concurrency using quantitative dependence analysis. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 185--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Amos Waterland, Elaine Angelino, Ryan P. Adams, Jonathan Appavoo, and Margo I. Seltzer. 2014. ASC: Automatically scalable computation. In ASPLOS. Rajeev Balasubramonian, Al Davis, and Sarita V. Adve (Eds.). ACM, 575--590. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Lerna: Parallelizing Dependent Loops Using Speculation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 15, Issue 1
          Special Issue on ACM International Systems and Storage Conference (SYSTOR) 2018
          February 2019
          194 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/3311821
          • Editor:
          • Sam H. Noh
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 March 2019
          • Revised: 1 January 2019
          • Accepted: 1 January 2019
          • Received: 1 November 2018
          Published in tos Volume 15, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!