Abstract
Although custom (and reconfigurable) computing can provide orders-of-magnitude improvements in energy efficiency and performance for many numeric, data-parallel applications, performance on nonnumeric, sequential code is often worse than conventional superscalar processors. This work attempts to improve sequential performance in custom hardware by (a) switching from a statically scheduled to a dynamically scheduled (dataflow) execution model and (b) developing a new compiler IR for high-level synthesis—the value state flow graph (VSFG)—that enables aggressive exposition of ILP even in the presence of complex control flow. Compared to existing control-data flow graph (CDFG)-based IRs, the VSFG exposes more instruction-level parallelism from control-intensive sequential code by exploiting aggressive speculation, enabling control dependence analysis, as well as execution along multiple flows of control. This new IR is directly implemented as a static-dataflow graph in hardware by our prototype high-level synthesis tool chain and shows an average speedup of 1.13× over equivalent hardware generated using LegUp, an existing CDFG-based HLS tool. Furthermore, the VSFG allows us to further trade area and energy for performance through loop unrolling, increasing the average speedup to 1.55×, with a peak speedup of 4.05×. Our VSFG-based hardware approaches the sequential cycle counts of an Intel Nehalem Core i7 processor while consuming only 0.25× the energy of an in-order Altera Nios IIf processor.
- Volker Baumgarte, Gerd Ehlers, Frank May, Armin Nückel, Martin Vorbach, and Markus Weinhardt. 2003. PACT XPP—a self-reconfigurable data processing architecture. Journal of Supercomputing 26, 2, 167--184. Google Scholar
Digital Library
- Geoffrey Blake, Ronald G. Dreslinski, Trevor Mudge, and Krisztián Flautner. 2010. Evolution of thread-level parallelism in desktop applications. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 302--313. Google Scholar
Digital Library
- Mihai Budiu. 2003. Spatial Computation. Ph.D. Dissertation. Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. http://www.cs.cmu.edu/∼mihaib/research/thesis.pdf Technical report CMU-CS-03-217 Google Scholar
Digital Library
- Mihai Budiu, Pedro V. Artigas, and Seth C. Goldstein. 2005. Dataflow: A complement to superscalar. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’05). 177--186. Google Scholar
Digital Library
- Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen Goldstein. 2004. Spatial computation. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XI). 14--26. Google Scholar
Digital Library
- Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, William Yoder, and the TRIPS Team. 2004. Scaling to the end of silicon with edge architectures. Computer 37, 7, 44--55. Google Scholar
Digital Library
- Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’11). 33--36. Google Scholar
Digital Library
- Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’11). Google Scholar
Digital Library
- Philippe Coussy and Adam Morawiec. 2008. High-Level Synthesis: From Algorithm to Digital Circuit. Springer. Google Scholar
Digital Library
- Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 365--376. Google Scholar
Digital Library
- Hagen Gädke and Andreas Koch. 2008. Accelerating speculative execution in high-level synthesis with cancel tokens. In Proceedings of the 4th International Workshop on Reconfigurable Computing: Architectures, Tools, and Applications (ARC’08). Springer-Verlag, Berlin, 185--195. Google Scholar
Digital Library
- Guang R. Gao. 1989. Algorithmic aspects of balancing techniques for pipelined data flow code generation. Journal of Parallel and Distributed Computing 6, 1, 39--61. Google Scholar
Digital Library
- Guang R. Gao. 1991. Algorithmic aspects of pipeline balancing. In A Code Mapping Scheme for Dataflow Software Pipelining. The Kluwer International Series in Engineering and Computer Science, Vol. 125. Springer, 41--59.Google Scholar
- Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th International Conference on High Performance Computer Architecture (HPCA’11). Google Scholar
Digital Library
- Ed Grochowski and Murali Annavaram. 2006. Energy Per Instruction Trends in Intel® Microprocessors. Retrieved November 10, 2015, from http://www.intel.com/pressroom/kits/core2duo/pdf/epi-trends-final2.pdf.Google Scholar
- Sumit Gupta, Nikil Dutt, Rajesh Gupta, and Alexandru Nicolau. 2004a. Loop shifting and compaction for the high-level synthesis of designs with complex control flow. In Proceedings of the Conference on Design, Automation, and Test in Europe—Volume 1 (DATE’04). 114--119. Google Scholar
Digital Library
- Sumit Gupta, Rajesh Kumar Gupta, Nikil D. Dutt, and Alexandru Nicolau. 2004b. Coordinated parallelizing compiler optimizations and high-level synthesis. ACM Transactions on Design Automation of Electronic Systems 9, 4, 441--470. Google Scholar
Digital Library
- Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th International Symposium on Computer Architecture (ISCA’10). 37--47. Google Scholar
Digital Library
- Yuko Hara, Hiroyuki Tomiyama, Shinya Honda, Hiroaki Takada, and Katsuya Ishii. 2008. CHStone: A benchmark program suite for practical C-based high-level synthesis. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’08). 1192--1195.Google Scholar
- Chao Huang, Srivaths Ravi, Anand Raghunathan, and Niraj K. Jha. 2007. Generation of heterogeneous distributed architectures for memory-intensive applications through high-level synthesis. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 15, 11, 1191--1204. Google Scholar
Digital Library
- Xiangyang Liang, Minh Nguyen, and Hao Che. 2013. Wimpy or brawny cores: A throughput perspective. Journal of Parallel and Distributed Computing 73, 10, 1351--1361. Google Scholar
Digital Library
- Wen-Mei Hwu, Shane Ryoo, Sain-Zee Ueng, John H. Kelm, Isaac Gelado, Sam S. Stone, Robert E. Kidd, Sara S. Baghsorkhi, Aqeel A. Mahesri, Stephanie C. Tsao, Nacho Navarro, Steve S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2007. Implicitly parallel programming models for thousand-core microprocessors. In Proceedings of the 44th Annual Design Automation Conference (DAC’07). 754--759. Google Scholar
Digital Library
- Neil Johnson and Alan Mycroft. 2003. Combined code motion and register allocation using the value state dependence graph. In Proceedings of the 12th International Conference on Compiler Construction (CC’03). 1--16. Google Scholar
Digital Library
- Changkyu Kim, Simha Sethumadhavan, Madhu S. Govindan, Nitya Ranganathan, Divya Gulati, Doug Burger, and Stephen W. Keckler. 2007. Composable lightweight processors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). 381--394. Google Scholar
Digital Library
- Srikanth Kurra, Neeraj Kumar Singh, and Preeti Ranjan Panda. 2007. The impact of loop unrolling on controller delay in high level synthesis. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’07). 391--396. Google Scholar
Digital Library
- Monica S. Lam and Robert P. Wilson. 1992. Limits of control flow on parallelism. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA’92). 46--57. Google Scholar
Digital Library
- Alan C. Lawrence. 2007. Optimizing Compilation with the Value State Dependence Graph. Technical Report UCAM-CL-TR-705. Computer Laboratory, University of Cambridge. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-705.pdf.Google Scholar
- Edward A. Lee. 2006. The problem with threads. Computer 39, 5, 33--42. Google Scholar
Digital Library
- Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO 25). 45--54. Google Scholar
Digital Library
- Jonathan Mak and Alan Mycroft. 2009. Limits of parallelism using dynamic dependence graphs. In Proceedings of the 7th International Workshop on Dynamic Analysis (WODA’09). 42--48. Google Scholar
Digital Library
- Daniel S. McFarlin, Charles Tucker, and Craig Zilles. 2013. Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism? In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 241--252. Google Scholar
Digital Library
- Mahim Mishra, Timothy J. Callahan, Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein, and Mihai. Budiu. 2006. Tartan: Evaluating spatial computation for whole program execution. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 163--174. Google Scholar
Digital Library
- Rishiyur Nikhil. 2004. Bluespec system verilog: Efficient, correct RTL from high level specifications. In Proceedings of the 2nd ACM and IEEE International Conference on Formal Methods and Models for Co-Design (MEMOCODE’04). 69--70. DOI:http://dx.doi.org/10.1109/MEMCOD.2004.1459818Google Scholar
Digital Library
- Andrew R. Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, and Prasanna Sundararajan. 2008. CHiMPS: A high-level compilation flow for hybrid CPU-FPGA architectures. In Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays (FPGA’08). 261. Google Scholar
Digital Library
- Jack Sampson, Ganesh Venkatesh, Nathan Goulding-Hotta, Saturnino Garcia, Steven Swanson, and Michael Bedford Taylor. 2011. Efficient complex operators for irregular codes. In Proceedings of the Conference on High Performance Computing Architecture (HPCA’11). Google Scholar
Digital Library
- Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Nitya Ranganathan, Doug Burger, Stephen W. Keckler, Robert G. McDonald, and Charles R. Moore. 2004. TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP. ACM Transactions on Architecture and Code Optimization 1, 1, 62--93. Google Scholar
Digital Library
- Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multiscalar processors. ACM SIGARCH Computer Architecture News 23, 2, 414--425. Google Scholar
Digital Library
- Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson, Mark Oskin, and Susan J. Eggers. 2007. The wavescalar architecture. ACM Transactions on Computer Systems 25, 2, Article No. 4. Google Scholar
Digital Library
- Girish Venkataramani, Tobias Bjerregaard, Tiberiu Chelcea, and Seth C. Goldstein. 2006. Hardware compilation of application-specific memory-access interconnect. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems 25, 5, 756--771. Google Scholar
Digital Library
- Girish Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael B. Taylor. 2010. Conservation cores: Reducing the energy of mature computations. ACM SIGARCH Computer Architecture News 38, 1, 205--218. Google Scholar
Digital Library
- Paraskevas Yiapanis, Demian Rosas-Ham, Gavin Brown, and Mikel Luján. 2013. Optimizing software runtime systems for speculative parallelization. ACM Transactions on Architecture and Code Optimization 9, 4, Article No. 39. Google Scholar
Digital Library
- Ali Mustafa Zaidi. 2015. Accelerating Control-Flow Intensive Code in Spatial Hardware. Technical Report UCAM-CL-TR-870. Computer Laboratory, University of Cambridge. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-870.pdf.Google Scholar
- Ali Mustafa Zaidi and David J. Greaves. 2014. A new dataflow compiler IR for accelerating control-intensive code in spatial hardware. In Proceedings of the 2014 IEEE International Parallel and Distributed Symposium Workshops (IPDPSW’14). 122--131. Google Scholar
Digital Library
Index Terms
Value State Flow Graph: A Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware
Recommendations
Compiler-Assisted Multiple Instruction Word Retry for VLIW Architectures
Very Long Instruction Word (VLIW) architectures can enhance performance by exploiting fine-grained instruction level parallelism. In this paper, we describe a compiler assisted multiple instruction word retry scheme for VLIW architectures. A read buffer ...
Tuning the continual flow pipeline architecture with virtual register renaming
Continual Flow Pipelines (CFPs) allow a processor core to process hundreds of in-flight instructions without increasing cycle-critical pipeline resources. When a load misses the data cache, CFP checkpoints the processor register state and then moves all ...
Exposing ILP in custom hardware with a dataflow compiler IR
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques






Comments