skip to main content
research-article

Value State Flow Graph: A Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware

Authors Info & Claims
Published:04 December 2015Publication History
Skip Abstract Section

Abstract

Although custom (and reconfigurable) computing can provide orders-of-magnitude improvements in energy efficiency and performance for many numeric, data-parallel applications, performance on nonnumeric, sequential code is often worse than conventional superscalar processors. This work attempts to improve sequential performance in custom hardware by (a) switching from a statically scheduled to a dynamically scheduled (dataflow) execution model and (b) developing a new compiler IR for high-level synthesis—the value state flow graph (VSFG)—that enables aggressive exposition of ILP even in the presence of complex control flow. Compared to existing control-data flow graph (CDFG)-based IRs, the VSFG exposes more instruction-level parallelism from control-intensive sequential code by exploiting aggressive speculation, enabling control dependence analysis, as well as execution along multiple flows of control. This new IR is directly implemented as a static-dataflow graph in hardware by our prototype high-level synthesis tool chain and shows an average speedup of 1.13× over equivalent hardware generated using LegUp, an existing CDFG-based HLS tool. Furthermore, the VSFG allows us to further trade area and energy for performance through loop unrolling, increasing the average speedup to 1.55×, with a peak speedup of 4.05×. Our VSFG-based hardware approaches the sequential cycle counts of an Intel Nehalem Core i7 processor while consuming only 0.25× the energy of an in-order Altera Nios IIf processor.

References

  1. Volker Baumgarte, Gerd Ehlers, Frank May, Armin Nückel, Martin Vorbach, and Markus Weinhardt. 2003. PACT XPP—a self-reconfigurable data processing architecture. Journal of Supercomputing 26, 2, 167--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Geoffrey Blake, Ronald G. Dreslinski, Trevor Mudge, and Krisztián Flautner. 2010. Evolution of thread-level parallelism in desktop applications. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 302--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mihai Budiu. 2003. Spatial Computation. Ph.D. Dissertation. Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. http://www.cs.cmu.edu/∼mihaib/research/thesis.pdf Technical report CMU-CS-03-217 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Mihai Budiu, Pedro V. Artigas, and Seth C. Goldstein. 2005. Dataflow: A complement to superscalar. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’05). 177--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen Goldstein. 2004. Spatial computation. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XI). 14--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Doug Burger, Stephen W. Keckler, Kathryn S. McKinley, Mike Dahlin, Lizy K. John, Calvin Lin, Charles R. Moore, James Burrill, Robert G. McDonald, William Yoder, and the TRIPS Team. 2004. Scaling to the end of silicon with edge architectures. Computer 37, 7, 44--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson, Stephen Brown, and Tomasz Czajkowski. 2011. LegUp: High-level synthesis for FPGA-based processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’11). 33--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Philippe Coussy and Adam Morawiec. 2008. High-Level Synthesis: From Algorithm to Digital Circuit. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). 365--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hagen Gädke and Andreas Koch. 2008. Accelerating speculative execution in high-level synthesis with cancel tokens. In Proceedings of the 4th International Workshop on Reconfigurable Computing: Architectures, Tools, and Applications (ARC’08). Springer-Verlag, Berlin, 185--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Guang R. Gao. 1989. Algorithmic aspects of balancing techniques for pipelined data flow code generation. Journal of Parallel and Distributed Computing 6, 1, 39--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Guang R. Gao. 1991. Algorithmic aspects of pipeline balancing. In A Code Mapping Scheme for Dataflow Software Pipelining. The Kluwer International Series in Engineering and Computer Science, Vol. 125. Springer, 41--59.Google ScholarGoogle Scholar
  14. Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 17th International Conference on High Performance Computer Architecture (HPCA’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ed Grochowski and Murali Annavaram. 2006. Energy Per Instruction Trends in Intel® Microprocessors. Retrieved November 10, 2015, from http://www.intel.com/pressroom/kits/core2duo/pdf/epi-trends-final2.pdf.Google ScholarGoogle Scholar
  16. Sumit Gupta, Nikil Dutt, Rajesh Gupta, and Alexandru Nicolau. 2004a. Loop shifting and compaction for the high-level synthesis of designs with complex control flow. In Proceedings of the Conference on Design, Automation, and Test in Europe—Volume 1 (DATE’04). 114--119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sumit Gupta, Rajesh Kumar Gupta, Nikil D. Dutt, and Alexandru Nicolau. 2004b. Coordinated parallelizing compiler optimizations and high-level synthesis. ACM Transactions on Design Automation of Electronic Systems 9, 4, 441--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th International Symposium on Computer Architecture (ISCA’10). 37--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yuko Hara, Hiroyuki Tomiyama, Shinya Honda, Hiroaki Takada, and Katsuya Ishii. 2008. CHStone: A benchmark program suite for practical C-based high-level synthesis. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’08). 1192--1195.Google ScholarGoogle Scholar
  20. Chao Huang, Srivaths Ravi, Anand Raghunathan, and Niraj K. Jha. 2007. Generation of heterogeneous distributed architectures for memory-intensive applications through high-level synthesis. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 15, 11, 1191--1204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Xiangyang Liang, Minh Nguyen, and Hao Che. 2013. Wimpy or brawny cores: A throughput perspective. Journal of Parallel and Distributed Computing 73, 10, 1351--1361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Wen-Mei Hwu, Shane Ryoo, Sain-Zee Ueng, John H. Kelm, Isaac Gelado, Sam S. Stone, Robert E. Kidd, Sara S. Baghsorkhi, Aqeel A. Mahesri, Stephanie C. Tsao, Nacho Navarro, Steve S. Lumetta, Matthew I. Frank, and Sanjay J. Patel. 2007. Implicitly parallel programming models for thousand-core microprocessors. In Proceedings of the 44th Annual Design Automation Conference (DAC’07). 754--759. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Neil Johnson and Alan Mycroft. 2003. Combined code motion and register allocation using the value state dependence graph. In Proceedings of the 12th International Conference on Compiler Construction (CC’03). 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Changkyu Kim, Simha Sethumadhavan, Madhu S. Govindan, Nitya Ranganathan, Divya Gulati, Doug Burger, and Stephen W. Keckler. 2007. Composable lightweight processors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). 381--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Srikanth Kurra, Neeraj Kumar Singh, and Preeti Ranjan Panda. 2007. The impact of loop unrolling on controller delay in high level synthesis. In Proceedings of the Conference on Design, Automation, and Test in Europe (DATE’07). 391--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Monica S. Lam and Robert P. Wilson. 1992. Limits of control flow on parallelism. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA’92). 46--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Alan C. Lawrence. 2007. Optimizing Compilation with the Value State Dependence Graph. Technical Report UCAM-CL-TR-705. Computer Laboratory, University of Cambridge. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-705.pdf.Google ScholarGoogle Scholar
  28. Edward A. Lee. 2006. The problem with threads. Computer 39, 5, 33--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. 1992. Effective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO 25). 45--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jonathan Mak and Alan Mycroft. 2009. Limits of parallelism using dynamic dependence graphs. In Proceedings of the 7th International Workshop on Dynamic Analysis (WODA’09). 42--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Daniel S. McFarlin, Charles Tucker, and Craig Zilles. 2013. Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism? In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). 241--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mahim Mishra, Timothy J. Callahan, Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein, and Mihai. Budiu. 2006. Tartan: Evaluating spatial computation for whole program execution. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 163--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Rishiyur Nikhil. 2004. Bluespec system verilog: Efficient, correct RTL from high level specifications. In Proceedings of the 2nd ACM and IEEE International Conference on Formal Methods and Models for Co-Design (MEMOCODE’04). 69--70. DOI:http://dx.doi.org/10.1109/MEMCOD.2004.1459818Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Andrew R. Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, and Prasanna Sundararajan. 2008. CHiMPS: A high-level compilation flow for hybrid CPU-FPGA architectures. In Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays (FPGA’08). 261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jack Sampson, Ganesh Venkatesh, Nathan Goulding-Hotta, Saturnino Garcia, Steven Swanson, and Michael Bedford Taylor. 2011. Efficient complex operators for irregular codes. In Proceedings of the Conference on High Performance Computing Architecture (HPCA’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Nitya Ranganathan, Doug Burger, Stephen W. Keckler, Robert G. McDonald, and Charles R. Moore. 2004. TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP. ACM Transactions on Architecture and Code Optimization 1, 1, 62--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multiscalar processors. ACM SIGARCH Computer Architecture News 23, 2, 414--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson, Mark Oskin, and Susan J. Eggers. 2007. The wavescalar architecture. ACM Transactions on Computer Systems 25, 2, Article No. 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Girish Venkataramani, Tobias Bjerregaard, Tiberiu Chelcea, and Seth C. Goldstein. 2006. Hardware compilation of application-specific memory-access interconnect. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems 25, 5, 756--771. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Girish Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael B. Taylor. 2010. Conservation cores: Reducing the energy of mature computations. ACM SIGARCH Computer Architecture News 38, 1, 205--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Paraskevas Yiapanis, Demian Rosas-Ham, Gavin Brown, and Mikel Luján. 2013. Optimizing software runtime systems for speculative parallelization. ACM Transactions on Architecture and Code Optimization 9, 4, Article No. 39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ali Mustafa Zaidi. 2015. Accelerating Control-Flow Intensive Code in Spatial Hardware. Technical Report UCAM-CL-TR-870. Computer Laboratory, University of Cambridge. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-870.pdf.Google ScholarGoogle Scholar
  43. Ali Mustafa Zaidi and David J. Greaves. 2014. A new dataflow compiler IR for accelerating control-intensive code in spatial hardware. In Proceedings of the 2014 IEEE International Parallel and Distributed Symposium Workshops (IPDPSW’14). 122--131. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Value State Flow Graph: A Dataflow Compiler IR for Accelerating Control-Intensive Code in Spatial Hardware

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!