Abstract
Hardware specialization has become a promising paradigm for overcoming the inefficiencies of general purpose microprocessors. Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with only certain properties, but remain largely configurable or programmable. The most important strength of BSAs -- their ability to target a wide variety of codes -- also makes their interactions and analysis complex, raising the following questions: can multiple BSAs be composed synergistically, what are their interactions with the general purpose core, and what combinations favor which workloads? From a methodological standpoint, BSAs are also challenging, as they each require ISA development, compiler and assembler extensions, and either simulator or RTL models.
To study the potential of BSAs, we propose a novel modeling technique called the Transformable Dependence Graph (TDG) - a higher level alternative to the time-consuming traditional compiler+simulator approach, while still enabling detailed microarchitectural models for both general cores and accelerators. We then propose a multi-BSA organization, called ExoCore, which we model and study using the TDG. A design space exploration reveals that an ExoCore organization can push designs beyond the established energy-performance frontiers for general purpose cores. For example, a 2-wide OOO processor with three BSAs matches the performance of a conventional 6-wide OOO core, has 40% lower area, and is 2.6x more energy efficient.
- Parboil Benchmark Suite. impact.crhc.illinois.edu/parboil/parboil.aspx.Google Scholar
- Vertical Microbenchmarks. http://cs.wisc.edu/vertical/microbench.Google Scholar
- Renee St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. General-purpose code acceleration with limited-precision analog computation. In ISCA, 2014.Google Scholar
- Thomas Ball and James R. Larus. Efficient path profiling. In MICRO, 1996.Google Scholar
Digital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 2011.Google Scholar
Digital Library
- Shekhar Borkar and Andrew A. Chien. The future of micro- processors. Commun. ACM, 54(5):67--77, 2011.Google Scholar
Digital Library
- Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine- learning. In ASPLOS, 2014.Google Scholar
Digital Library
- Nathan Clark, Manjunath Kudlur, Hyunchul Park, Scott Mahlke, and Krisztian Flautner. Application-specific processing on a general-purpose core via transparent instruction set customization. In MICRO, 2004.Google Scholar
Digital Library
- Rajagopalan Desikan, Doug Burger, and Stephen W. Keckler. Measuring experimental error in microprocessor simulation. In ISCA, 2001.Google Scholar
- Lieven Eeckhout. Computer architecture performance evaluation methods. Synthesis Lectures on Computer Architecture, 2010.Google Scholar
Digital Library
- Peter A. Milder Eric S. Chung, James C. Hoe, and Ken Mai. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPUs? In MICRO '10.Google Scholar
- Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. SIGARCH Comput. Archit. News, 2011.Google Scholar
Digital Library
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural acceleration for general-purpose approximate programs. In MICRO, 2012.Google Scholar
Digital Library
- Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. A mechanistic performance model for superscalar out-of-order processors. ACM Trans. Comput. Syst., 2009.Google Scholar
Digital Library
- Brian Fields, Shai Rubin, and Rastislav Bodik. Focusing processor policies via critical-path prediction. In ISCA, 2001.Google Scholar
- Saturnino Garcia, Donghwan Jeon, Chris Louie, and Michael Bedford Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI, 2011.Google Scholar
Digital Library
- Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. Dyser: Unifying functionality and parallelism specialization for energy efficient computing. IEEE Micro, 2012.Google Scholar
Digital Library
- Shantanu Gupta, Shuguang Feng, Amin Ansari, Scott Mahlke, and David August. Bundled execution of recurring traces for energy-efficient general purpose processing. In MICRO, 2011.Google Scholar
Digital Library
- Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. In ISCA '10.Google Scholar
- Mark Hempstead, Gu-Yeon Wei, and David Brooks. Navigo: An early-stage model to study power-contrained architectures and specialization. In Proceedings of Workshop on Modeling, Benchmarking, and Simulations (MoBS), 2009.Google Scholar
- Sunpyo Hong and Hyesoon Kim. An integrated GPU power and performance model. In ISCA '10.Google Scholar
- R. Iyer. Accelerator-rich architectures: Implications, opportunities and challenges. In Design Automation Conference (ASP-DAC), 2012 17th Asia and South Pacific, 2012.Google Scholar
- Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor. Kismet: Parallel Speedup Estimates for Serial Programs. In OOPSLA, 2011.Google Scholar
- Onur Kocberber, Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, and Parthasarathy Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In MICRO, 2013.Google Scholar
- Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, and Dean M. Tullsen. Single-isa heterogeneous multi-core architectures: The potential for processor power reduction. In MICRO, 2003.Google Scholar
Digital Library
- Rakesh Kumar, Dean M. Tullsen, and Norman P. Jouppi. Core architecture optimization for heterogeneous chip multiprocessors. In PACT, 2006.Google Scholar
Digital Library
- Chunho Lee, M. Potkonjak, and W.H. Mangione-Smith. Me- diaBench: a tool for evaluating and synthesizing multimedia and communications systems. In MICRO, 1997.Google Scholar
- Jaewon Lee, Hanhwi Jang, and Jangwoo Kim. Rpstacks: Fast and accurate processor design space exploration using representative stall-event stacks. In MICRO, 2014.Google Scholar
Digital Library
- Sheng Li, Jung-Ho Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, and N.P. Jouppi. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In MICRO, 2009.Google Scholar
Digital Library
- Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. PuDianNao: a polyvalent machine learning accelerator. In ASPLOS, 2015.Google Scholar
Digital Library
- Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Faissal M. Sleiman, Ronald Dreslinski, Thomas F. Wenisch, and Scott Mahlke. Composite Cores: Pushing heterogeneity into a core. In MICRO, 2012.Google Scholar
- J. Meng, V.A. Morozov, K. Kumaran, V. Vishwanath, and T.D. Uram. GROPHECY: GPU performance projection from CPU code skeletons. In SC'11. ACM, 2011.Google Scholar
Digital Library
- Tipp Moseley, Dirk Grunwald, Daniel A Connors, Ram Ra- manujam, Vasanth Tovinkere, and Ramesh Peri. Loopprof: Dynamic techniques for loop detection and profiling. In Proceedings of the 2006 Workshop on Binary Instrumentation and Applications (WBIA), 2006.Google Scholar
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. Cacti 6.0: A tool to model large caches. HP Laboratories, 2009.Google Scholar
- Sandeep Navada, Niket K. Choudhary, Salil V. Wadhavkar, and Eric Rotenberg. A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors. In PACT, 2013.Google Scholar
- Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. Exploring the potential of heterogeneous Von Neumann/Dataflow execution models. In ISCA, 2015.Google Scholar
Digital Library
- Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. Pushing the limits of accelerator efficiency while retaining programmability. In HPCA, 2016.Google Scholar
Cross Ref
- Tony Nowatzki, Venkatraman. Govindaraju, and Karthikeyan Sankaralingam. A graph-based program representation for analyzing hardware specialization approaches. Computer Architecture Letters, 2015.Google Scholar
Digital Library
- Cedric Nugteren and Henk Corporaal. A modular and parameterisable classification of algorithms. Technical Report ESR-2011-02, Eindhoven University of Technology, 2011.Google Scholar
- Cedric Nugteren and Henk Corporaal. The boat hull model: adapting the roofline model to enable performance prediction for parallel computing. In PPOPP, 2012.Google Scholar
- Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, and Scott Mahlke. Trace based phase prediction for tightly-coupled heterogeneous cores. In MICRO, 2013.Google Scholar
Digital Library
- Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A. Horowitz. Convolution engine: Balancing efficiency and flexibility in specialized computing. In ISCA, 2013.Google Scholar
- Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and Jim Smith. Trace processors. In MICRO, 1997.Google Scholar
Digital Library
- Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures. In ISCA, 2014.Google Scholar
- M. Shoaib Bin Altaf and D.A. Wood. LogCA: a performance model for hardware accelerators. Computer Architecture Letters, 2015.Google Scholar
Digital Library
- Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. A performance analysis framework for identifying potential benefits in GPGPU applications. In PPoPP, 2012.Google Scholar
Digital Library
- H. Singh, Ming-Hau Lee, Guangming Lu, F.J. Kurdahi, N. Bagherzadeh, and E.M. Chaves Filho. Morphosys: an integrated reconfigurable system for data-parallel and computation-intensive applications. Computers, IEEE Trans- actions on, 2000.Google Scholar
- Tyler Sondag and Hridesh Rajan. Phase-based tuning for better utilization of performance-asymmetric multicore processors. In CGO, 2011.Google Scholar
Digital Library
- Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang, and Christopher Batten. Architectural specialization for inter-iteration loop dependence patterns. In MICRO, 2014.Google Scholar
Digital Library
- Steven Swanson, Ken Michelson, Andrew Schwerin, and Mark Oskin. Wavescalar. In MICRO, pages 291--, 2003.Google Scholar
- Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and Joel Emer. Scheduling heterogeneous multicores through performance impact estimation (pie). In ISCA, 2012.Google Scholar
- Ashish Venkat and Dean M. Tullsen. Harnessing isa diversity: Design of a heterogeneous-isa chip multiprocessor. In ISCA, 2014.Google Scholar
Digital Library
- Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. Conservation Cores: Reducing the Energy of Mature Computations. In ASPLOS '10.Google Scholar
- Samuel Williams, Andrew Waterman, and David Patterson. Roofline: an insightful visual performance model for multi-core architectures. Commun. ACM, 2009.Google Scholar
- Jonathan A. Winter, David H. Albonesi, and Christine A. Shoemaker. Scalable thread scheduling and global power management for heterogeneous many-core architectures. In PACT, 2010.Google Scholar
Digital Library
- Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Kenneth A. Ross. Q100: The architecture and design of a database processing unit. In ASPLOS, 2014.Google Scholar
Digital Library
- T. Zidenberg, I. Keslassy, and U. Weiser. Optimal resource allocation with multiamdahl. Computer, 2013.Google Scholar
Index Terms
Analyzing Behavior Specialized Acceleration
Recommendations
Analyzing Behavior Specialized Acceleration
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsHardware specialization has become a promising paradigm for overcoming the inefficiencies of general purpose microprocessors. Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with ...
Analyzing Behavior Specialized Acceleration
ASPLOS'16Hardware specialization has become a promising paradigm for overcoming the inefficiencies of general purpose microprocessors. Of significant interest are Behavioral Specialized Accelerators (BSAs), which are designed to efficiently execute code with ...
GPU Acceleration for Simulating Massively Parallel Many-Core Platforms
Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation ...







Comments