skip to main content
research-article

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

Published:11 June 2018Publication History
Skip Abstract Section

Abstract

Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency.

In this paper, we address one of the main performance bottlenecks — last-level cache misses — through a software-hardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms.We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-flow and indirect memory accesses.

Skip Supplemental Material Section

Supplemental Material

p328-tran.webm

References

  1. Sarita V. Adve and Mark D. Hill. 1990. Weak Ordering - A New Deinition. In Proceedings of the 17th Annual International Symposium on Computer Architecture. Seattle, WA, June 1990, Jean-Loup Baer, Larry Snyder, and James R. Goodman (Eds.). ACM, 2ś14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexander Aiken, Alexandru Nicolau, and Steven Novack. 1995. Resource-Constrained Software Pipelining. IEEE Trans. Parallel Distrib. Syst. 6, 12 (1995), 1248ś1270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sam Ainsworth and Timothy M. Jones. 2017. Software prefetching for indirect memory accesses. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 305ś317. htp://dl.acm.org/citation.cfm?id=3049865 Google ScholarGoogle Scholar
  4. Manuel Arenaz, Juan Touriño, and Ramon Doallo. 2004. An InspectorExecutor Algorithm for Irregular Assignment Parallelization. In Parallel and Distributed Processing and Applications, Second InternationalSymposium, ISPA 2004, Hong Kong, China, December 13-15, 2004, Proceedings. 4ś15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. ARM. {n. d.}. ARM Cortex-A15 Processor. htp://www.arm.com/ products/processors/cortex-a/cortex-a15.php .Google ScholarGoogle Scholar
  6. ARM. {n. d.}. ARM Cortex-A7 Processor. htp://www.arm.com/ products/processors/cortex-a/cortex-a7.php .Google ScholarGoogle Scholar
  7. Manish Arora, Siddhartha Nath, Subhra Mazumdar, Scott B. Baden, and Dean M. Tullsen. 2012. Redeining the Role of the CPU in the Era of CPU-GPU Integration. IEEE Micro 32, 6 (2012), 4ś16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Mark Bohr. 2007. A 30 Year Retrospective on Dennard’s MOSFET Scaling Paper. IEEE Solid-State Circuits Society Newsletter 12, 1 (2007), 11ś13.Google ScholarGoogle Scholar
  9. Shekhar Borkar and Andrew A. Chien. 2011. The future of microprocessors. Commun. ACM 54, 5 (2011), 67ś77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Trevor E. Carlson, Wim Heirman, Osman Allam, Stefanos Kaxiras, and Lieven Eeckhout. 2015. The load slice core microarchitecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015, Deborah T. Marr and David H. Albonesi (Eds.). ACM, 272ś284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011. 52:1ś52:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An Evaluation of High-Level Mechanistic Core Models. TACO 11, 3 (2014), 28:1ś28:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt, and Yale N. Patt. 1999. Simultaneous Subordinate Microthreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture, ISCA 1999, Atlanta, Georgia, USA, May 2-4, 1999. 186ś195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders Landin, Sherman Yip, Håkan Zefer, and Marc Tremblay. 2009. Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun’s Rock Processor. In Proceedings of the Annual International Symposium on Computer Architecture. ACM, New York, NY, USA, 484ś495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jamison D. Collins, Dean M. Tullsen, Hong Wang, and John Paul Shen. 2001. Dynamic speculative precomputation. In Proceedings of the 34th Annual International Symposium on Microarchitecture, Austin, Texas, USA, December 1-5, 2001. 306ś317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher J. Hughes, Yong-Fong Lee, Daniel M. Lavery, and John Paul Shen. 2001. Speculative precomputation: long-range prefetching of delinquent loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001. 14ś25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Neal Clayton Crago and Sanjay J. Patel. 2011. OUTRIDER: eicient memory latency tolerance with decoupled strands. In 38th International Symposium on Computer Architecture (ISCA 2011), June 4-8, 2011, San Jose, CA, USA. 117ś128.Google ScholarGoogle Scholar
  18. Michel Dubois and Yong Ho Song. 1998. Assisted Execution. Technical Report CENG 98-25. Department of EE-Systems, University of Southern California.Google ScholarGoogle Scholar
  19. James Dundas and Trevor N. Mudge. 1997. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proceedings of the 11th international conference on Supercomputing, ICS 1997, Vienna, Austria, July 7-11, 1997. 68ś75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Richard James Eickemeyer, Hung Qui Le, Dung Quoc Nguyen, Benjamin Walter Stolt, and Brian William Thompto. 2009. Load lookahead prefetch for microprocessors. US Patent 7,594,096.Google ScholarGoogle Scholar
  21. Philip G. Emma, Allan Hartstein, Thomas R. Puzak, and Viji Srinivasan. 2005. Exploring the limits of prefetching. IBM Journal of Research and Development 49, 1 (2005), 127ś144. htp://www.research.ibm.com/ journal/rd/491/emma.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Joseph A. Fisher. 1998. Very Long Instruction Word Architectures and the ELI-512. In 25 Years of the International Symposia on Computer Architecture (Selected Papers). 263ś273. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Daniele Folegnani and Antonio González. 2001. Energy-efective issue logic. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001. 230ś239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Manoj Franklin. 1993. The multiscalar architecture. Ph.D. Dissertation. University of Wisconsin Madison. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2015. DeSC: decoupled supply-compute communication management for heterogeneous architectures. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9, 2015, Milos Prvulovic (Ed.). ACM, 191ś203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. John L. Hennessy and David A. Patterson. 2011. Computer Architecture, Fifth Edition: A Quantitative Approach, Appendix H: Hardware and Software for VLIW and EPIC (5th ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. SIGARCH Computer Architecture News 34, 4 (2006), 1ś17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Andrew D. Hilton, Santosh Nagarakatte, and Amir Roth. 2009. iCFP: Tolerating all-level cache misses in in-order processors. In 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 14-18 February 2009, Raleigh, North Carolina, USA. IEEE Computer Society, 431ś442.Google ScholarGoogle ScholarCross RefCross Ref
  29. Andrew D. Hilton and Amir Roth. 2010. BOLT: Energy-eicient Out-ofOrder Latency-Tolerant execution. In 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India, Matthew T. Jacob, Chita R. Das, and Pradip Bose (Eds.). IEEE Computer Society, 1ś12.Google ScholarGoogle Scholar
  30. Mark Horowitz, Margaret Martonosi, Todd C. Mowry, and Michael D. Smith. 1996. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, Philadelphia, PA, USA, May 22-24, 1996, Jean-Loup Baer (Ed.). ACM, 260ś270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mitsuhiko Igarashi, Toshifumi Uemura, Ryo Mori, Hiroshi Kishibe, Midori Nagayama, Masaaki Taniguchi, Kohei Wakahara, Toshiharu Saito, Masaki Fujigaya, Kazuki Fukuoka, Koji Nii, Takeshi Kataoka, and Toshihiro Hattori. 2015. A 28 nm High-k/MG Heterogeneous Multi-Core Mobile Application Processor With 2 GHz Cores and LowPower 1 GHz Cores. J. Solid-State Circuits 50, 1 (2015), 92ś101.Google ScholarGoogle Scholar
  32. Intel. 2010. Intel™ Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide. Nehalem. https://software.intel.com/sites/default/iles/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf.Google ScholarGoogle Scholar
  33. Alexandra Jimborean, Konstantinos Koukos, Vasileios Spiliopoulos, David Black-Schafer, and Stefanos Kaxiras. 2014. Fix the code. Don’t tweak the hardware: A new compiler approach to Voltage-Frequency scaling. In 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014, Orlando, FL, USA, February 15-19, 2014, David R. Kaeli and Tipp Moseley (Eds.). ACM, 262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Roel Jordans and Henk Corporaal. 2015. High-level softwarepipelining in LLVM. In Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES 2015, Sankt Goar, Germany, June 1-3, 2015, Henk Corporaal and Sander Stuijk (Eds.). ACM, 97ś100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2011. Intercore prefetching for multicore processors using migrating helper threads. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2011, Newport Beach, CA, USA, March 5-11, 2011, Rajiv Gupta and Todd C. Mowry (Eds.). ACM, 393ś404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Vinod Kathail, Michael Schlansker, and B Ramakrishna Rau. 1994. HPL PlayDoh architecture speciication: Version 1.0. Hewlett Packard Laboratories Palo Alto, California.Google ScholarGoogle Scholar
  37. Vinod Kathail, Michael S Schlansker, and B Ramakrishna Rau. 2000. HPL-PD architecture speciication: Version 1.1. Hewlett-Packard Laboratories.Google ScholarGoogle Scholar
  38. Muneeb Khan and Erik Hagersten. 2014. Resource conscious prefetching for irregular applications in multicores. In XIVth International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2014, Agios Konstantinos, Samos, Greece, July 14-17, 2014. IEEE, 34ś43.Google ScholarGoogle ScholarCross RefCross Ref
  39. Muneeb Khan, Michael A. Laurenzano, Jason Mars, Erik Hagersten, and David Black-Schafer. 2015. AREP: Adaptive Resource Eicient Prefetching for Maximizing Multicore Performance. In 2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 18-21, 2015. IEEE Computer Society, 367ś378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Muneeb Khan, Andreas Sandberg, and Erik Hagersten. 2014. A Case for Resource Eicient Prefetching in Multicores. In 43rd International Conference on Parallel Processing, ICPP 2014, Minneapolis, MN, USA, September 9-12, 2014. IEEE Computer Society, 101ś110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jinwoo Kim, Rodric M. Rabbah, Krishna V. Palem, and Weng-Fai Wong. 2004. Adaptive Compiler Directed Prefetching for EPIC Processors. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA ’04, June 21-24, 2004, Las Vegas, Nevada, USA, Volume 1, Hamid R. Arabnia (Ed.). CSREA Press, 495ś501.Google ScholarGoogle Scholar
  42. Konstantinos Koukos, Per Ekemark, Georgios Zacharopoulos, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2016. Multiversioned decoupled access-execute: the key to energy-eicient compilation of general-purpose programs. In Proceedings of the 25th International Conference on Compiler Construction, CC 2016, Barcelona, Spain, March 12-18, 2016, Ayal Zaks and Manuel V. Hermenegildo (Eds.). ACM, 121ś131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Monica S. Lam. 1988. Software Pipelining: An Efective Scheduling Technique for VLIW Machines. In Proceedings of the ACM SIGPLAN’88 Conference on Programming Language Design and Implementation (PLDI), Atlanta, Georgia, USA, June 22-24, 1988, Richard L. Wexelblat (Ed.). ACM, 318ś328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jaekyu Lee, Hyesoon Kim, and Richard W. Vuduc. 2012. When Prefetching Works, When It Doesn’t, and Why. TACO 9, 1 (2012), 2:1ś2:29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2013. The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing. TACO 10, 1 (2013), 5:1ś5:29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Sushil J. Louis. {n. d.}. CIGAR - Case Injected Genetic Algortihm. htp://www.cse.unr.edu/~sushil/class/gas/code/cigar/ htp://ecsl.cse. unr.edu/~sushil/class/gas/code/cigar/ .Google ScholarGoogle Scholar
  47. Chi-Keung Luk. 2001. Tolerating memory latency through softwarecontrolled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001, Per Stenström (Ed.). ACM, 40ś51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. 1992. Efective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, Oregon, November 1992, Wen-mei W. Hwu (Ed.). ACM/IEEE, 45ś54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA’03), Anaheim, California, USA, February 8-12, 2003. IEEE Computer Society, 129ś140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. NASA. 1999. NAS Parallel Benchmarks. htp://www.nas.nasa.gov/ assets/pdf/techreports/1999/nas-99-011.pdf htp://www.nas.nasa.gov/ assets/pdf/techreports/1999/nas-99-011.pdf .Google ScholarGoogle Scholar
  51. Karthik Natarajan, Heather Hanson, Stephen W. Keckler, Charles R. Moore, and Doug Burger. 2003. Microprocessor pipeline energy analysis. In Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003, Seoul, Korea, August 25-27, 2003, Ingrid Verbauwhede and Hyung Roh (Eds.). ACM, 282ś287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Satyanarayana Nekkalapu, Haitham Akkary, Komal Jothi, Renjith Retnamma, and Xiaoyu Song. 2008. A simple latency tolerant processor. In 26th International Conference on Computer Design, ICCD 2008, 12-15 October 2008, Lake Tahoe, CA, USA, Proceedings. IEEE Computer Society, 384ś389.Google ScholarGoogle ScholarCross RefCross Ref
  53. Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005. Automatic Thread Extraction with Decoupled Software Pipelining. In 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005), 12-16 November 2005, Barcelona, Spain. IEEE Computer Society, 105ś118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Emre Özer and Thomas M. Conte. 2005. High-Performance and LowCost Dual-Thread VLIW Processor Using Weld Architecture Paradigm. IEEE Trans. Parallel Distrib. Syst. 16, 12 (2005), 1132ś1142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Vlad-Mihai Panait, Amit Sasturkar, and Weng-Fai Wong. 2004. Static Identiication of Delinquent Loads. In 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA. IEEE Computer Society, 303ś314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Carlos García Quiñones, Carlos Madriles, F. Jesús Sánchez, Pedro Marcuello, Antonio González, and Dean M. Tullsen. 2005. Mitosis compiler: an infrastructure for speculative threading based on precomputation slices. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, Vivek Sarkar and Mary W. Hall (Eds.). ACM, 269ś279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A Case for MLP-Aware Cache Replacement. In 33rd International Symposium on Computer Architecture (ISCA 2006), June 17-21, 2006, Boston, MA, USA. IEEE Computer Society, 167ś178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled Software Pipelining with the Synchronization Array. In 13th International Conference on Parallel Architectures and Compilation Techniques (PACT 2004), 29 September - 3 October 2004, Antibes Juan-les-Pins, France. IEEE Computer Society, 177ś188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. B. Ramakrishna Rau. 1991. Data Flow and Dependence Analysis for Instruction Level Parallelism. In Languages and Compilers for Parallel Computing, Fourth International Workshop, Santa Clara, California, USA, August 7-9, 1991, Proceedings (Lecture Notes in Computer Science), Utpal Banerjee, David Gelernter, Alexandru Nicolau, and David A. Padua (Eds.), Vol. 589. Springer, 236ś250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Alberto Ros, Trevor E. Carlson, Mehdi Alipour, and Stefanos Kaxiras. 2017. Non-Speculative Load-Load Reordering in TSO. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017. ACM, 187ś200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Amir Roth and Gurindar S. Sohi. 2001. Speculative Data-Driven Multithreading. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA’01), Nuevo Leone, Mexico, January 20-24, 2001. IEEE Computer Society, 37ś48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Andreas Sembrant, Erik Hagersten, and David Black-Schafer. 2014. Navigating the cache hierarchy with a single lookup. In ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 14-18, 2014. IEEE Computer Society, 133ś 144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Carlo H. Séquin and David A. Patterson. 1982. Design and Implementation of RISC I. Technical Report UCB/CSD-82-106. EECS Department, University of California, Berkeley. htp://www2.eecs.berkeley.edu/ Pubs/TechRpts/1982/5449.html Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Rami Sheikh, James Tuck, and Eric Rotenberg. 2015. Control-Flow Decoupling: An Approach for Timely, Non-Speculative Branching. IEEE Trans. Computers 64, 8 (2015), 2182ś2203.Google ScholarGoogle Scholar
  65. James E. Smith. 1984. Decoupled Access/Execute Computer Architectures. ACM Trans. Comput. Syst. 2, 4 (1984), 289ś308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multiscalar Processors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA ’95, Santa Margherita Ligure, Italy, June 22-24, 1995, David A. Patterson (Ed.). ACM, 414ś425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Michael Upton. 2004. Continual low pipelines. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2004, Boston, MA, USA, October 7-13, 2004, Shubu Mukherjee and Kathryn S. McKinley (Eds.). ACM, 107ś119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Karthik Sundaramoorthy, Zachary Purser, and Eric Rotenberg. 2000. Slipstream Processors: Improving both Performance and Fault Tolerance. In ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, USA, November 12-15, 2000., Larry Rudolph and Anoop Gupta (Eds.). ACM Press, 257ś268. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Kim-Anh Tran, Trevor E. Carlson, Konstantinos Koukos, Magnus Sjä-lander, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2017. Clairvoyance: look-ahead compile-time scheduling. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 171ś184. htp://dl.acm.org/citation.cfm?id=3049852 Google ScholarGoogle ScholarCross RefCross Ref
  70. Marc Tremblay and Shailender Chaudhry. 2008. A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor. In 2008 IEEE International Solid-State Circuits Conference, ISSCC 2008, Digest of Technical Papers, San Francisco, CA, USA, February 3-7, 2008. IEEE, 82ś83.Google ScholarGoogle ScholarCross RefCross Ref
  71. Francis Tseng and Yale N. Patt. 2008. Achieving Out-of-Order Performance with Almost In-Order Complexity. In 35th International Symposium on Computer Architecture (ISCA 2008), June 21-25, 2008, Beijing, China. IEEE Computer Society, 3ś12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Vladimir Uzelac and Aleksandar Milenkovic. 2009. Experiment lows and microbenchmarks for reverse engineering of branch predictor structures. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2009, April 26-28, 2009, Boston, Massachusetts, USA, Proceedings. IEEE Computer Society, 207ś217.Google ScholarGoogle Scholar
  73. Steven P. Vanderwiel and David J. Lilja. 2000. Data prefetch mechanisms. ACM Comput. Surv. 32, 2 (2000), 174ś199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. T. N. Vijaykumar and Gurindar S. Sohi. 1998. Task Selection for a Multiscalar Processor. In Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 31, Dallas, Texas, USA, November 30 - December 2, 1998, James O. Bondi and Jim Smith (Eds.). ACM/IEEE Computer Society, 81ś92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Mark Weiser. 1981. Program Slicing. In Proceedings of the 5th International Conference on Software Engineering, San Diego, California, USA, March 9-12, 1981., Seymour Jefrey and Leon G. Stucki (Eds.). IEEE Computer Society, 439ś449. htp://dl.acm.org/citation.cfm?id=802557 Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Sebastian Winkel, Rakesh Krishnaiyer, and Robyn Sampson. 2008. Latency-tolerant software pipelining in a production compiler. In Sixth International Symposium on Code Generation and Optimization (CGO 2008), April 5-9, 2008, Boston, MA, USA, Mary Lou Sofa and Evelyn Duesterwald (Eds.). ACM, 104ś113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Carole-Jean Wu, Aamer Jaleel, William Hasenplaugh, Margaret Martonosi, Simon C. Steely Jr., and Joel S. Emer. 2011. SHiP: signaturebased hit predictor for high performance caching. In 44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil, December 3-7, 2011, Carlo Galuzzi, Luigi Carro, Andreas Moshovos, and Milos Prvulovic (Eds.). ACM, 430ś441. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: implications of the obvious. SIGARCH Computer Architecture News 23, 1 (1995), 20ś24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Xin-Xin Yang. 2014. An Introduction to the QorIQ LS1 Family. Presentation slides. htps://cache.freescale.com/files/training/doc/dwf/ DWF14_APF_NET_T0162.pdf .Google ScholarGoogle Scholar
  80. Adi Yoaz, Mattan Erez, Ronny Ronen, and Stéphan Jourdan. 1999. Speculation Techniques for Improving Load Related Instruction Scheduling. In Proceedings of the 26th Annual International Symposium on Computer Architecture, ISCA 1999, Atlanta, Georgia, USA, May 2-4, 1999, Allan Gottlieb and William J. Dally (Eds.). IEEE Computer Society, 42ś53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9, 2015, Milos Prvulovic (Ed.). ACM, 178ś190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Weifeng Zhang, Dean M. Tullsen, and Brad Calder. 2007. Accelerating and Adapting Precomputation Threads for Efcient Prefetching. In 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 10-14 February 2007, Phoenix, Arizona, USA. IEEE Computer Society, 85ś95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Chuan-Qi Zhu and Pen-Chung Yew. 1987. A Scheme to Enforce Data Dependence on Large Multiprocessor Systems. IEEE Trans. Software Eng. 13, 6 (1987), 726ś739. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Craig B. Zilles and Gurindar S. Sohi. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001, Per Stenström (Ed.). ACM, 2ś13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Victor V. Zyuban and Peter M. Kogge. 2001. Inherently Lower-Power High-Performance Superscalar Architectures. IEEE Trans. Computers 50, 3 (2001), 268ś285. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 53, Issue 4
          PLDI '18
          April 2018
          834 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/3296979
          Issue’s Table of Contents
          • cover image ACM Conferences
            PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation
            June 2018
            825 pages
            ISBN:9781450356985
            DOI:10.1145/3192366

          Copyright © 2018 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 11 June 2018

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!