Abstract
There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading.
Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.
- Arvind and Rishiyur S. Nikhil. 1990. Executing a program on the MIT tagged-token dataflow architecture. IEEE Transactions on Computers 39, 3, 300--318. Google Scholar
Digital Library
- Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley.Google Scholar
- Bluespec, Inc. 2007. Bluespec System Verilog Reference Guide. Bluespec.Google Scholar
- Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, and William Yoder. 2004. Scaling to the end of silicon with edge architectures. Computer 37, 7, 44--55. DOI:http://dx.doi.org/10.1109/MC.2004.65 Google Scholar
Digital Library
- Luca P. Carloni, Kenneth L. McMillan, and Alberto L. Sangiovanni-Vincentelli. 2001. Theory of latency-insensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 20, 9, 1059--1076. DOI:http://dx.doi.org/10.1109/43.945302 Google Scholar
Digital Library
- K. Mani Chandy and Jayadev Misra. 1988. Parallel Program Design: A Foundation. Addison-Wesley. Google Scholar
Digital Library
- Katherine Compton and Scott Hauck. 2002. Reconfigurable computing: A survey of systems and software. ACM Computer Surveys 34, 2, 171--210. DOI:http://dx.doi.org/10.1145/508352.508353 Google Scholar
Digital Library
- William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann, San Francisco, CA. Google Scholar
Digital Library
- Jack B. Dennis and David P. Misunas. 1975. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2nd Annual Symposium on Computer Architecture. 126--132. Google Scholar
Digital Library
- Edsger W. Dijkstra. 1975. Guarded commands, nondeterminacy and formal derivation of programs. Communications of the ACM 18, 8, 453--457. DOI:http://dx.doi.org/10.1145/360933.360975 Google Scholar
Digital Library
- Joel Emer, Pritpal Ahuja, Eric Borch, Artur Klauser, Chi-Keung Luk, Srilatha Manne, Shubhendu S. Mukherjee, Harish Patil, Steven Wallace, Nathan Binkert, Roger Espasa, and Toni Juan. 2002. Asim: A performance model framework. Computer 35, 2, 68--76. Google Scholar
Digital Library
- Joel S. Emer and Douglas W. Clark. 1984. A characterization of processor performance in the VAX-11/780. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA’84). 301--310. DOI:http://dx.doi.org/10.1145/800015.808199 Google Scholar
Digital Library
- Kermin Elliott Fleming, Michael Adler, Michael Pellauer, Angshuman Parashar, Arvind Mithal, and Joel Emer. 2012. Leveraging latency-insensitivity to ease multiple FPGA design. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 175--184. DOI:http://dx.doi.org/10.1145/2145694.2145725 Google Scholar
Digital Library
- Robert A. Van De Geijin and Jarell Watts. 1997. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Technical Report. Google Scholar
Digital Library
- Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th International Conference on High Performance Computer Architecture (HPCA’11). 503--514. Google Scholar
Digital Library
- John R. Hauser and John Wawrzynek. 1997. Garp: A MIPS processor with a reconfigurable coprocessor. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM’97). 12--21. Google Scholar
Digital Library
- Jan Hoogerbrugge and Henk Corporaal. 1994. Transport-triggering vs. operation-triggering. In Compiler Construction. Lecture Notes in Computer Science, Vol. 786. Springer, 435--449. Google Scholar
Digital Library
- Myron King, Nirav Dave, and Arvind. 2012. Automatic generation of hardware/software interfaces. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, 325--336. Google Scholar
Digital Library
- Donald E. Knuth, James H. Morris, and Vaughan R. Pratt. 1977. Fast pattern matching in strings. SIAM Journal of Computing 6, 2, 323--350.Google Scholar
Cross Ref
- Hsiang-Tsung Kung. 1986. The CMU warp processor. In Supercomputers: Algorithms, Architectures, and Scientific Computation, F. A. Matsen and T. Tajima (Eds.). University of Texas Press, Austin, TX, 235--247. Google Scholar
Digital Library
- Alexander Marquardt, Vaughn Betz, and Jonathan Rose. 2000. Speed and area tradeoffs in cluster-based FPGA architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8, 1, 84--93. DOI:http://dx.doi.org/10.1109/92.820764 Google Scholar
Digital Library
- Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In Proceedings of 13th International Conference on Field-Programmable Logic and Applications. 61--70.Google Scholar
Cross Ref
- Duane G. Merrill and Andrew S. Grimshaw. 2010. Revisiting sorting for GPGPU stream architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 545--546. DOI:http://dx.doi.org/10.1145/1854273.1854344 Google Scholar
Digital Library
- Ethan Mirsky and Andre DeHon. 1996. MATRIX: A reconfigurable computing architecture with configurable instruction distribution and deployable resources. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines. 157--166.Google Scholar
Cross Ref
- Gajinder Panesar, Daniel Towner, Andrew Duller, Alan Gray, and Will Robbins. 2006. Deterministic parallel processing. International Journal of Parallel Programming 34, 4, 323--341. DOI:http://dx.doi.org/10.1007/s10766-006-0019-9 Google Scholar
Digital Library
- Li-Shiuan Peh and Natalie Enright Jerger. 2009. On-Chip Networks. Morgan and Claypool. Google Scholar
Digital Library
- Michael Pellauer, Michael Adler, Derek Chiou, and Joel Emer. 2009. Soft connections: Addressing the hardware-design modularity problem. In Proceedings of the 46th ACM/IEEE Design Automation Conference (DAC’09). 276--281. Google Scholar
Digital Library
- Herman Schmit, David Whelihan, Andrew Tsai, Matthew Moe, Benjamin Levine, and R. Reed Taylor. 2002. PipeRench: A virtualized programmable datapath in 0.18 micron technology. In Proceedings of the 2002 IEEE Custom Integrated Circuits Conference. 63--66.Google Scholar
- Aaron Smith, Ramadass Nagarajan, Karthikeyan Sankaralingam, Robert McDonald, Doug Burger, Stephen W. Keckler, and Kathryn S. McKinley. 2006. Dataflow predication. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39). 89--102. Google Scholar
Digital Library
- Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson, Mark Oskin, and Susan J. Eggers. 2007. The wavescalar architecture. ACM Transactions on Computer Systems 25, 2, Article No. 4. DOI:http://dx.doi.org/10.1145/1233307.1233308 Google Scholar
Digital Library
- Michael B. Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook. Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2002. The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro 22, 2, 25--35. Google Scholar
Digital Library
- Dean N. Truong, Wayne H. Cheng, Tinoosh Mohsenin, Zhiyi Yu, Anthony T. Jacobson, Gouri Landge, Michael J. Meeuwsen, Christine Watnik, Ahn T. Tran, Zhibin Xiao, Eric W. Work, Jeremy W. Webb, Paul V. Mejia, and Bevan M. Baas. 2009. A 167-processor computational platform in 65 nm CMOS. IEEE Journal of Solid-State Circuits 44, 4, 1130--1144. DOI:http://dx.doi.org/10.1109/JSSC.2009.2013772Google Scholar
Cross Ref
- Muralidaran Vijayaraghavan and Arvind. 2009. Bounded dataflow networks and latency-insensitive circuits. In Proceedings of the 7th IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE’09). IEEE, Los Alamitos, CA, 171--180. http://dl.acm.org/citation.cfm? id=1715759.1715781 Google Scholar
Digital Library
- Zhi A. Ye, Andreas Moshovos, Scott Hauck, and Prithviraj Banerjee. 2000. CHIMAERA: A high-performance architecture with a tightly-coupled reconfigurable functional unit. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA’00). 225--235. Google Scholar
Digital Library
- Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, and Bevan Baas. 2006. An asynchronous array of simple processors for DSP applications. In Proceedings of the Solid-State Circuits Conference (ISSCC’06). 1696--1705.Google Scholar
Index Terms
Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures
Recommendations
Triggered instructions: a control paradigm for spatially-programmed architectures
ICSA '13In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition ...
Triggered instructions: a control paradigm for spatially-programmed architectures
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureIn this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition ...
Microarchitecture of a Coarse-Grain Out-of-Order Superscalar Processor
We explore the design, implementation, and evaluation of a coarse-grain superscalar processor in the context of the microarchitecture of the Control Processor (CP) of the Multilevel Computing Architecture (MLCA), a novel architecture targeted for ...






Comments