skip to main content
research-article

Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Published:11 September 2015Publication History
Skip Abstract Section

Abstract

There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading.

Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.

References

  1. Arvind and Rishiyur S. Nikhil. 1990. Executing a program on the MIT tagged-token dataflow architecture. IEEE Transactions on Computers 39, 3, 300--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, and Katherine A. Yelick. 2006. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183. EECS Department, University of California, Berkeley.Google ScholarGoogle Scholar
  3. Bluespec, Inc. 2007. Bluespec System Verilog Reference Guide. Bluespec.Google ScholarGoogle Scholar
  4. Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, and William Yoder. 2004. Scaling to the end of silicon with edge architectures. Computer 37, 7, 44--55. DOI:http://dx.doi.org/10.1109/MC.2004.65 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Luca P. Carloni, Kenneth L. McMillan, and Alberto L. Sangiovanni-Vincentelli. 2001. Theory of latency-insensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 20, 9, 1059--1076. DOI:http://dx.doi.org/10.1109/43.945302 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Mani Chandy and Jayadev Misra. 1988. Parallel Program Design: A Foundation. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Katherine Compton and Scott Hauck. 2002. Reconfigurable computing: A survey of systems and software. ACM Computer Surveys 34, 2, 171--210. DOI:http://dx.doi.org/10.1145/508352.508353 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jack B. Dennis and David P. Misunas. 1975. A preliminary architecture for a basic data-flow processor. In Proceedings of the 2nd Annual Symposium on Computer Architecture. 126--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Edsger W. Dijkstra. 1975. Guarded commands, nondeterminacy and formal derivation of programs. Communications of the ACM 18, 8, 453--457. DOI:http://dx.doi.org/10.1145/360933.360975 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Joel Emer, Pritpal Ahuja, Eric Borch, Artur Klauser, Chi-Keung Luk, Srilatha Manne, Shubhendu S. Mukherjee, Harish Patil, Steven Wallace, Nathan Binkert, Roger Espasa, and Toni Juan. 2002. Asim: A performance model framework. Computer 35, 2, 68--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Joel S. Emer and Douglas W. Clark. 1984. A characterization of processor performance in the VAX-11/780. In Proceedings of the 11th Annual International Symposium on Computer Architecture (ISCA’84). 301--310. DOI:http://dx.doi.org/10.1145/800015.808199 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kermin Elliott Fleming, Michael Adler, Michael Pellauer, Angshuman Parashar, Arvind Mithal, and Joel Emer. 2012. Leveraging latency-insensitivity to ease multiple FPGA design. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 175--184. DOI:http://dx.doi.org/10.1145/2145694.2145725 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Robert A. Van De Geijin and Jarell Watts. 1997. SUMMA: Scalable Universal Matrix Multiplication Algorithm. Technical Report. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th International Conference on High Performance Computer Architecture (HPCA’11). 503--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. John R. Hauser and John Wawrzynek. 1997. Garp: A MIPS processor with a reconfigurable coprocessor. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM’97). 12--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jan Hoogerbrugge and Henk Corporaal. 1994. Transport-triggering vs. operation-triggering. In Compiler Construction. Lecture Notes in Computer Science, Vol. 786. Springer, 435--449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Myron King, Nirav Dave, and Arvind. 2012. Automatic generation of hardware/software interfaces. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XVII). ACM, New York, NY, 325--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Donald E. Knuth, James H. Morris, and Vaughan R. Pratt. 1977. Fast pattern matching in strings. SIAM Journal of Computing 6, 2, 323--350.Google ScholarGoogle ScholarCross RefCross Ref
  20. Hsiang-Tsung Kung. 1986. The CMU warp processor. In Supercomputers: Algorithms, Architectures, and Scientific Computation, F. A. Matsen and T. Tajima (Eds.). University of Texas Press, Austin, TX, 235--247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alexander Marquardt, Vaughn Betz, and Jonathan Rose. 2000. Speed and area tradeoffs in cluster-based FPGA architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8, 1, 84--93. DOI:http://dx.doi.org/10.1109/92.820764 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In Proceedings of 13th International Conference on Field-Programmable Logic and Applications. 61--70.Google ScholarGoogle ScholarCross RefCross Ref
  23. Duane G. Merrill and Andrew S. Grimshaw. 2010. Revisiting sorting for GPGPU stream architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 545--546. DOI:http://dx.doi.org/10.1145/1854273.1854344 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ethan Mirsky and Andre DeHon. 1996. MATRIX: A reconfigurable computing architecture with configurable instruction distribution and deployable resources. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines. 157--166.Google ScholarGoogle ScholarCross RefCross Ref
  25. Gajinder Panesar, Daniel Towner, Andrew Duller, Alan Gray, and Will Robbins. 2006. Deterministic parallel processing. International Journal of Parallel Programming 34, 4, 323--341. DOI:http://dx.doi.org/10.1007/s10766-006-0019-9 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Li-Shiuan Peh and Natalie Enright Jerger. 2009. On-Chip Networks. Morgan and Claypool. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michael Pellauer, Michael Adler, Derek Chiou, and Joel Emer. 2009. Soft connections: Addressing the hardware-design modularity problem. In Proceedings of the 46th ACM/IEEE Design Automation Conference (DAC’09). 276--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Herman Schmit, David Whelihan, Andrew Tsai, Matthew Moe, Benjamin Levine, and R. Reed Taylor. 2002. PipeRench: A virtualized programmable datapath in 0.18 micron technology. In Proceedings of the 2002 IEEE Custom Integrated Circuits Conference. 63--66.Google ScholarGoogle Scholar
  29. Aaron Smith, Ramadass Nagarajan, Karthikeyan Sankaralingam, Robert McDonald, Doug Burger, Stephen W. Keckler, and Kathryn S. McKinley. 2006. Dataflow predication. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39). 89--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson, Mark Oskin, and Susan J. Eggers. 2007. The wavescalar architecture. ACM Transactions on Computer Systems 25, 2, Article No. 4. DOI:http://dx.doi.org/10.1145/1233307.1233308 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Michael B. Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook. Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. 2002. The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro 22, 2, 25--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Dean N. Truong, Wayne H. Cheng, Tinoosh Mohsenin, Zhiyi Yu, Anthony T. Jacobson, Gouri Landge, Michael J. Meeuwsen, Christine Watnik, Ahn T. Tran, Zhibin Xiao, Eric W. Work, Jeremy W. Webb, Paul V. Mejia, and Bevan M. Baas. 2009. A 167-processor computational platform in 65 nm CMOS. IEEE Journal of Solid-State Circuits 44, 4, 1130--1144. DOI:http://dx.doi.org/10.1109/JSSC.2009.2013772Google ScholarGoogle ScholarCross RefCross Ref
  33. Muralidaran Vijayaraghavan and Arvind. 2009. Bounded dataflow networks and latency-insensitive circuits. In Proceedings of the 7th IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE’09). IEEE, Los Alamitos, CA, 171--180. http://dl.acm.org/citation.cfm? id=1715759.1715781 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Zhi A. Ye, Andreas Moshovos, Scott Hauck, and Prithviraj Banerjee. 2000. CHIMAERA: A high-performance architecture with a tightly-coupled reconfigurable functional unit. In Proceedings of the 27th International Symposium on Computer Architecture (ISCA’00). 225--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, and Bevan Baas. 2006. An asynchronous array of simple processors for DSP applications. In Proceedings of the Solid-State Circuits Conference (ISSCC’06). 1696--1705.Google ScholarGoogle Scholar

Index Terms

  1. Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Computer Systems
      ACM Transactions on Computer Systems  Volume 33, Issue 3
      September 2015
      140 pages
      ISSN:0734-2071
      EISSN:1557-7333
      DOI:10.1145/2818727
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 September 2015
      • Accepted: 1 March 2015
      • Received: 1 December 2014
      Published in tocs Volume 33, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!