Abstract
Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency.
In this paper, we address one of the main performance bottlenecks — last-level cache misses — through a software-hardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms.We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-flow and indirect memory accesses.
Supplemental Material
- Sarita V. Adve and Mark D. Hill. 1990. Weak Ordering - A New Deinition. In Proceedings of the 17th Annual International Symposium on Computer Architecture. Seattle, WA, June 1990, Jean-Loup Baer, Larry Snyder, and James R. Goodman (Eds.). ACM, 2ś14. Google Scholar
Digital Library
- Alexander Aiken, Alexandru Nicolau, and Steven Novack. 1995. Resource-Constrained Software Pipelining. IEEE Trans. Parallel Distrib. Syst. 6, 12 (1995), 1248ś1270. Google Scholar
Digital Library
- Sam Ainsworth and Timothy M. Jones. 2017. Software prefetching for indirect memory accesses. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 305ś317. htp://dl.acm.org/citation.cfm?id=3049865 Google Scholar
- Manuel Arenaz, Juan Touriño, and Ramon Doallo. 2004. An InspectorExecutor Algorithm for Irregular Assignment Parallelization. In Parallel and Distributed Processing and Applications, Second InternationalSymposium, ISPA 2004, Hong Kong, China, December 13-15, 2004, Proceedings. 4ś15. Google Scholar
Digital Library
- ARM. {n. d.}. ARM Cortex-A15 Processor. htp://www.arm.com/ products/processors/cortex-a/cortex-a15.php .Google Scholar
- ARM. {n. d.}. ARM Cortex-A7 Processor. htp://www.arm.com/ products/processors/cortex-a/cortex-a7.php .Google Scholar
- Manish Arora, Siddhartha Nath, Subhra Mazumdar, Scott B. Baden, and Dean M. Tullsen. 2012. Redeining the Role of the CPU in the Era of CPU-GPU Integration. IEEE Micro 32, 6 (2012), 4ś16. Google Scholar
Digital Library
- Mark Bohr. 2007. A 30 Year Retrospective on Dennard’s MOSFET Scaling Paper. IEEE Solid-State Circuits Society Newsletter 12, 1 (2007), 11ś13.Google Scholar
- Shekhar Borkar and Andrew A. Chien. 2011. The future of microprocessors. Commun. ACM 54, 5 (2011), 67ś77. Google Scholar
Digital Library
- Trevor E. Carlson, Wim Heirman, Osman Allam, Stefanos Kaxiras, and Lieven Eeckhout. 2015. The load slice core microarchitecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015, Deborah T. Marr and David H. Albonesi (Eds.). ACM, 272ś284. Google Scholar
Digital Library
- Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011. 52:1ś52:12. Google Scholar
Digital Library
- Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An Evaluation of High-Level Mechanistic Core Models. TACO 11, 3 (2014), 28:1ś28:25. Google Scholar
Digital Library
- Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt, and Yale N. Patt. 1999. Simultaneous Subordinate Microthreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture, ISCA 1999, Atlanta, Georgia, USA, May 2-4, 1999. 186ś195. Google Scholar
Digital Library
- Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders Landin, Sherman Yip, Håkan Zefer, and Marc Tremblay. 2009. Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun’s Rock Processor. In Proceedings of the Annual International Symposium on Computer Architecture. ACM, New York, NY, USA, 484ś495. Google Scholar
Digital Library
- Jamison D. Collins, Dean M. Tullsen, Hong Wang, and John Paul Shen. 2001. Dynamic speculative precomputation. In Proceedings of the 34th Annual International Symposium on Microarchitecture, Austin, Texas, USA, December 1-5, 2001. 306ś317. Google Scholar
Digital Library
- Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher J. Hughes, Yong-Fong Lee, Daniel M. Lavery, and John Paul Shen. 2001. Speculative precomputation: long-range prefetching of delinquent loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001. 14ś25. Google Scholar
Digital Library
- Neal Clayton Crago and Sanjay J. Patel. 2011. OUTRIDER: eicient memory latency tolerance with decoupled strands. In 38th International Symposium on Computer Architecture (ISCA 2011), June 4-8, 2011, San Jose, CA, USA. 117ś128.Google Scholar
- Michel Dubois and Yong Ho Song. 1998. Assisted Execution. Technical Report CENG 98-25. Department of EE-Systems, University of Southern California.Google Scholar
- James Dundas and Trevor N. Mudge. 1997. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proceedings of the 11th international conference on Supercomputing, ICS 1997, Vienna, Austria, July 7-11, 1997. 68ś75. Google Scholar
Digital Library
- Richard James Eickemeyer, Hung Qui Le, Dung Quoc Nguyen, Benjamin Walter Stolt, and Brian William Thompto. 2009. Load lookahead prefetch for microprocessors. US Patent 7,594,096.Google Scholar
- Philip G. Emma, Allan Hartstein, Thomas R. Puzak, and Viji Srinivasan. 2005. Exploring the limits of prefetching. IBM Journal of Research and Development 49, 1 (2005), 127ś144. htp://www.research.ibm.com/ journal/rd/491/emma.pdf Google Scholar
Digital Library
- Joseph A. Fisher. 1998. Very Long Instruction Word Architectures and the ELI-512. In 25 Years of the International Symposia on Computer Architecture (Selected Papers). 263ś273. Google Scholar
Digital Library
- Daniele Folegnani and Antonio González. 2001. Energy-efective issue logic. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001. 230ś239. Google Scholar
Digital Library
- Manoj Franklin. 1993. The multiscalar architecture. Ph.D. Dissertation. University of Wisconsin Madison. Google Scholar
Digital Library
- Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2015. DeSC: decoupled supply-compute communication management for heterogeneous architectures. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9, 2015, Milos Prvulovic (Ed.). ACM, 191ś203. Google Scholar
Digital Library
- John L. Hennessy and David A. Patterson. 2011. Computer Architecture, Fifth Edition: A Quantitative Approach, Appendix H: Hardware and Software for VLIW and EPIC (5th ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Google Scholar
Digital Library
- John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. SIGARCH Computer Architecture News 34, 4 (2006), 1ś17. Google Scholar
Digital Library
- Andrew D. Hilton, Santosh Nagarakatte, and Amir Roth. 2009. iCFP: Tolerating all-level cache misses in in-order processors. In 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 14-18 February 2009, Raleigh, North Carolina, USA. IEEE Computer Society, 431ś442.Google Scholar
Cross Ref
- Andrew D. Hilton and Amir Roth. 2010. BOLT: Energy-eicient Out-ofOrder Latency-Tolerant execution. In 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India, Matthew T. Jacob, Chita R. Das, and Pradip Bose (Eds.). IEEE Computer Society, 1ś12.Google Scholar
- Mark Horowitz, Margaret Martonosi, Todd C. Mowry, and Michael D. Smith. 1996. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, Philadelphia, PA, USA, May 22-24, 1996, Jean-Loup Baer (Ed.). ACM, 260ś270. Google Scholar
Digital Library
- Mitsuhiko Igarashi, Toshifumi Uemura, Ryo Mori, Hiroshi Kishibe, Midori Nagayama, Masaaki Taniguchi, Kohei Wakahara, Toshiharu Saito, Masaki Fujigaya, Kazuki Fukuoka, Koji Nii, Takeshi Kataoka, and Toshihiro Hattori. 2015. A 28 nm High-k/MG Heterogeneous Multi-Core Mobile Application Processor With 2 GHz Cores and LowPower 1 GHz Cores. J. Solid-State Circuits 50, 1 (2015), 92ś101.Google Scholar
- Intel. 2010. Intel™ Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide. Nehalem. https://software.intel.com/sites/default/iles/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf.Google Scholar
- Alexandra Jimborean, Konstantinos Koukos, Vasileios Spiliopoulos, David Black-Schafer, and Stefanos Kaxiras. 2014. Fix the code. Don’t tweak the hardware: A new compiler approach to Voltage-Frequency scaling. In 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014, Orlando, FL, USA, February 15-19, 2014, David R. Kaeli and Tipp Moseley (Eds.). ACM, 262. Google Scholar
Digital Library
- Roel Jordans and Henk Corporaal. 2015. High-level softwarepipelining in LLVM. In Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES 2015, Sankt Goar, Germany, June 1-3, 2015, Henk Corporaal and Sander Stuijk (Eds.). ACM, 97ś100. Google Scholar
Digital Library
- Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2011. Intercore prefetching for multicore processors using migrating helper threads. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2011, Newport Beach, CA, USA, March 5-11, 2011, Rajiv Gupta and Todd C. Mowry (Eds.). ACM, 393ś404. Google Scholar
Digital Library
- Vinod Kathail, Michael Schlansker, and B Ramakrishna Rau. 1994. HPL PlayDoh architecture speciication: Version 1.0. Hewlett Packard Laboratories Palo Alto, California.Google Scholar
- Vinod Kathail, Michael S Schlansker, and B Ramakrishna Rau. 2000. HPL-PD architecture speciication: Version 1.1. Hewlett-Packard Laboratories.Google Scholar
- Muneeb Khan and Erik Hagersten. 2014. Resource conscious prefetching for irregular applications in multicores. In XIVth International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2014, Agios Konstantinos, Samos, Greece, July 14-17, 2014. IEEE, 34ś43.Google Scholar
Cross Ref
- Muneeb Khan, Michael A. Laurenzano, Jason Mars, Erik Hagersten, and David Black-Schafer. 2015. AREP: Adaptive Resource Eicient Prefetching for Maximizing Multicore Performance. In 2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 18-21, 2015. IEEE Computer Society, 367ś378. Google Scholar
Digital Library
- Muneeb Khan, Andreas Sandberg, and Erik Hagersten. 2014. A Case for Resource Eicient Prefetching in Multicores. In 43rd International Conference on Parallel Processing, ICPP 2014, Minneapolis, MN, USA, September 9-12, 2014. IEEE Computer Society, 101ś110. Google Scholar
Digital Library
- Jinwoo Kim, Rodric M. Rabbah, Krishna V. Palem, and Weng-Fai Wong. 2004. Adaptive Compiler Directed Prefetching for EPIC Processors. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA ’04, June 21-24, 2004, Las Vegas, Nevada, USA, Volume 1, Hamid R. Arabnia (Ed.). CSREA Press, 495ś501.Google Scholar
- Konstantinos Koukos, Per Ekemark, Georgios Zacharopoulos, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2016. Multiversioned decoupled access-execute: the key to energy-eicient compilation of general-purpose programs. In Proceedings of the 25th International Conference on Compiler Construction, CC 2016, Barcelona, Spain, March 12-18, 2016, Ayal Zaks and Manuel V. Hermenegildo (Eds.). ACM, 121ś131. Google Scholar
Digital Library
- Monica S. Lam. 1988. Software Pipelining: An Efective Scheduling Technique for VLIW Machines. In Proceedings of the ACM SIGPLAN’88 Conference on Programming Language Design and Implementation (PLDI), Atlanta, Georgia, USA, June 22-24, 1988, Richard L. Wexelblat (Ed.). ACM, 318ś328. Google Scholar
Digital Library
- Jaekyu Lee, Hyesoon Kim, and Richard W. Vuduc. 2012. When Prefetching Works, When It Doesn’t, and Why. TACO 9, 1 (2012), 2:1ś2:29. Google Scholar
Digital Library
- Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2013. The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing. TACO 10, 1 (2013), 5:1ś5:29. Google Scholar
Digital Library
- Sushil J. Louis. {n. d.}. CIGAR - Case Injected Genetic Algortihm. htp://www.cse.unr.edu/~sushil/class/gas/code/cigar/ htp://ecsl.cse. unr.edu/~sushil/class/gas/code/cigar/ .Google Scholar
- Chi-Keung Luk. 2001. Tolerating memory latency through softwarecontrolled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001, Per Stenström (Ed.). ACM, 40ś51. Google Scholar
Digital Library
- Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. 1992. Efective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, Oregon, November 1992, Wen-mei W. Hwu (Ed.). ACM/IEEE, 45ś54. Google Scholar
Digital Library
- Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA’03), Anaheim, California, USA, February 8-12, 2003. IEEE Computer Society, 129ś140. Google Scholar
Digital Library
- NASA. 1999. NAS Parallel Benchmarks. htp://www.nas.nasa.gov/ assets/pdf/techreports/1999/nas-99-011.pdf htp://www.nas.nasa.gov/ assets/pdf/techreports/1999/nas-99-011.pdf .Google Scholar
- Karthik Natarajan, Heather Hanson, Stephen W. Keckler, Charles R. Moore, and Doug Burger. 2003. Microprocessor pipeline energy analysis. In Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003, Seoul, Korea, August 25-27, 2003, Ingrid Verbauwhede and Hyung Roh (Eds.). ACM, 282ś287. Google Scholar
Digital Library
- Satyanarayana Nekkalapu, Haitham Akkary, Komal Jothi, Renjith Retnamma, and Xiaoyu Song. 2008. A simple latency tolerant processor. In 26th International Conference on Computer Design, ICCD 2008, 12-15 October 2008, Lake Tahoe, CA, USA, Proceedings. IEEE Computer Society, 384ś389.Google Scholar
Cross Ref
- Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005. Automatic Thread Extraction with Decoupled Software Pipelining. In 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005), 12-16 November 2005, Barcelona, Spain. IEEE Computer Society, 105ś118. Google Scholar
Digital Library
- Emre Özer and Thomas M. Conte. 2005. High-Performance and LowCost Dual-Thread VLIW Processor Using Weld Architecture Paradigm. IEEE Trans. Parallel Distrib. Syst. 16, 12 (2005), 1132ś1142. Google Scholar
Digital Library
- Vlad-Mihai Panait, Amit Sasturkar, and Weng-Fai Wong. 2004. Static Identiication of Delinquent Loads. In 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA. IEEE Computer Society, 303ś314. Google Scholar
Digital Library
- Carlos García Quiñones, Carlos Madriles, F. Jesús Sánchez, Pedro Marcuello, Antonio González, and Dean M. Tullsen. 2005. Mitosis compiler: an infrastructure for speculative threading based on precomputation slices. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, Vivek Sarkar and Mary W. Hall (Eds.). ACM, 269ś279. Google Scholar
Digital Library
- Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A Case for MLP-Aware Cache Replacement. In 33rd International Symposium on Computer Architecture (ISCA 2006), June 17-21, 2006, Boston, MA, USA. IEEE Computer Society, 167ś178. Google Scholar
Digital Library
- Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled Software Pipelining with the Synchronization Array. In 13th International Conference on Parallel Architectures and Compilation Techniques (PACT 2004), 29 September - 3 October 2004, Antibes Juan-les-Pins, France. IEEE Computer Society, 177ś188. Google Scholar
Digital Library
- B. Ramakrishna Rau. 1991. Data Flow and Dependence Analysis for Instruction Level Parallelism. In Languages and Compilers for Parallel Computing, Fourth International Workshop, Santa Clara, California, USA, August 7-9, 1991, Proceedings (Lecture Notes in Computer Science), Utpal Banerjee, David Gelernter, Alexandru Nicolau, and David A. Padua (Eds.), Vol. 589. Springer, 236ś250. Google Scholar
Digital Library
- Alberto Ros, Trevor E. Carlson, Mehdi Alipour, and Stefanos Kaxiras. 2017. Non-Speculative Load-Load Reordering in TSO. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017. ACM, 187ś200. Google Scholar
Digital Library
- Amir Roth and Gurindar S. Sohi. 2001. Speculative Data-Driven Multithreading. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA’01), Nuevo Leone, Mexico, January 20-24, 2001. IEEE Computer Society, 37ś48. Google Scholar
Digital Library
- Andreas Sembrant, Erik Hagersten, and David Black-Schafer. 2014. Navigating the cache hierarchy with a single lookup. In ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 14-18, 2014. IEEE Computer Society, 133ś 144. Google Scholar
Digital Library
- Carlo H. Séquin and David A. Patterson. 1982. Design and Implementation of RISC I. Technical Report UCB/CSD-82-106. EECS Department, University of California, Berkeley. htp://www2.eecs.berkeley.edu/ Pubs/TechRpts/1982/5449.html Google Scholar
Digital Library
- Rami Sheikh, James Tuck, and Eric Rotenberg. 2015. Control-Flow Decoupling: An Approach for Timely, Non-Speculative Branching. IEEE Trans. Computers 64, 8 (2015), 2182ś2203.Google Scholar
- James E. Smith. 1984. Decoupled Access/Execute Computer Architectures. ACM Trans. Comput. Syst. 2, 4 (1984), 289ś308. Google Scholar
Digital Library
- Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multiscalar Processors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA ’95, Santa Margherita Ligure, Italy, June 22-24, 1995, David A. Patterson (Ed.). ACM, 414ś425. Google Scholar
Digital Library
- Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Michael Upton. 2004. Continual low pipelines. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2004, Boston, MA, USA, October 7-13, 2004, Shubu Mukherjee and Kathryn S. McKinley (Eds.). ACM, 107ś119. Google Scholar
Digital Library
- Karthik Sundaramoorthy, Zachary Purser, and Eric Rotenberg. 2000. Slipstream Processors: Improving both Performance and Fault Tolerance. In ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, USA, November 12-15, 2000., Larry Rudolph and Anoop Gupta (Eds.). ACM Press, 257ś268. Google Scholar
Digital Library
- Kim-Anh Tran, Trevor E. Carlson, Konstantinos Koukos, Magnus Sjä-lander, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2017. Clairvoyance: look-ahead compile-time scheduling. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 171ś184. htp://dl.acm.org/citation.cfm?id=3049852 Google Scholar
Cross Ref
- Marc Tremblay and Shailender Chaudhry. 2008. A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor. In 2008 IEEE International Solid-State Circuits Conference, ISSCC 2008, Digest of Technical Papers, San Francisco, CA, USA, February 3-7, 2008. IEEE, 82ś83.Google Scholar
Cross Ref
- Francis Tseng and Yale N. Patt. 2008. Achieving Out-of-Order Performance with Almost In-Order Complexity. In 35th International Symposium on Computer Architecture (ISCA 2008), June 21-25, 2008, Beijing, China. IEEE Computer Society, 3ś12. Google Scholar
Digital Library
- Vladimir Uzelac and Aleksandar Milenkovic. 2009. Experiment lows and microbenchmarks for reverse engineering of branch predictor structures. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2009, April 26-28, 2009, Boston, Massachusetts, USA, Proceedings. IEEE Computer Society, 207ś217.Google Scholar
- Steven P. Vanderwiel and David J. Lilja. 2000. Data prefetch mechanisms. ACM Comput. Surv. 32, 2 (2000), 174ś199. Google Scholar
Digital Library
- T. N. Vijaykumar and Gurindar S. Sohi. 1998. Task Selection for a Multiscalar Processor. In Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 31, Dallas, Texas, USA, November 30 - December 2, 1998, James O. Bondi and Jim Smith (Eds.). ACM/IEEE Computer Society, 81ś92. Google Scholar
Digital Library
- Mark Weiser. 1981. Program Slicing. In Proceedings of the 5th International Conference on Software Engineering, San Diego, California, USA, March 9-12, 1981., Seymour Jefrey and Leon G. Stucki (Eds.). IEEE Computer Society, 439ś449. htp://dl.acm.org/citation.cfm?id=802557 Google Scholar
Digital Library
- Sebastian Winkel, Rakesh Krishnaiyer, and Robyn Sampson. 2008. Latency-tolerant software pipelining in a production compiler. In Sixth International Symposium on Code Generation and Optimization (CGO 2008), April 5-9, 2008, Boston, MA, USA, Mary Lou Sofa and Evelyn Duesterwald (Eds.). ACM, 104ś113. Google Scholar
Digital Library
- Carole-Jean Wu, Aamer Jaleel, William Hasenplaugh, Margaret Martonosi, Simon C. Steely Jr., and Joel S. Emer. 2011. SHiP: signaturebased hit predictor for high performance caching. In 44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil, December 3-7, 2011, Carlo Galuzzi, Luigi Carro, Andreas Moshovos, and Milos Prvulovic (Eds.). ACM, 430ś441. Google Scholar
Digital Library
- William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: implications of the obvious. SIGARCH Computer Architecture News 23, 1 (1995), 20ś24. Google Scholar
Digital Library
- Xin-Xin Yang. 2014. An Introduction to the QorIQ LS1 Family. Presentation slides. htps://cache.freescale.com/files/training/doc/dwf/ DWF14_APF_NET_T0162.pdf .Google Scholar
- Adi Yoaz, Mattan Erez, Ronny Ronen, and Stéphan Jourdan. 1999. Speculation Techniques for Improving Load Related Instruction Scheduling. In Proceedings of the 26th Annual International Symposium on Computer Architecture, ISCA 1999, Atlanta, Georgia, USA, May 2-4, 1999, Allan Gottlieb and William J. Dally (Eds.). IEEE Computer Society, 42ś53. Google Scholar
Digital Library
- Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9, 2015, Milos Prvulovic (Ed.). ACM, 178ś190. Google Scholar
Digital Library
- Weifeng Zhang, Dean M. Tullsen, and Brad Calder. 2007. Accelerating and Adapting Precomputation Threads for Efcient Prefetching. In 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 10-14 February 2007, Phoenix, Arizona, USA. IEEE Computer Society, 85ś95. Google Scholar
Digital Library
- Chuan-Qi Zhu and Pen-Chung Yew. 1987. A Scheme to Enforce Data Dependence on Large Multiprocessor Systems. IEEE Trans. Software Eng. 13, 6 (1987), 726ś739. Google Scholar
Digital Library
- Craig B. Zilles and Gurindar S. Sohi. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001, Per Stenström (Ed.). ACM, 2ś13. Google Scholar
Digital Library
- Victor V. Zyuban and Peter M. Kogge. 2001. Inherently Lower-Power High-Performance Superscalar Architectures. IEEE Trans. Computers 50, 3 (2001), 268ś285. Google Scholar
Digital Library
Index Terms
SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores
Recommendations
SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and ImplementationIncreasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign ...
NOREBA: a compiler-informed non-speculative out-of-order commit processor
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating SystemsModern superscalar processors execute instructions out-of-order, but commit them in program order to provide precise exception handling and safe instruction retirement. However, in-order instruction commit is highly conservative and holds on to critical ...
Speculative precomputation: long-range prefetching of delinquent loads
Special Issue: Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01)This paper explores Speculative Precomputation, a technique that uses idle thread context in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future ...







Comments