skip to main content
research-article
Public Access

COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators

Published:27 September 2017Publication History
Skip Abstract Section

Abstract

Hardware accelerators are key to the efficiency and performance of system-on-chip (SoC) architectures. With high-level synthesis (HLS), designers can easily obtain several performance-cost trade-off implementations for each component of a complex hardware accelerator. However, navigating this design space in search of the Pareto-optimal implementations at the system level is a hard optimization task. We present COSMOS, an automatic methodology for the design-space exploration (DSE) of complex accelerators, that coordinates both HLS and memory optimization tools in a compositional way. First, thanks to the co-design of datapath and memory, COSMOS produces a large set of Pareto-optimal implementations for each component of the accelerator. Then, COSMOS leverages compositional design techniques to quickly converge to the desired trade-off point between cost and performance at the system level. When applied to the system-level design (SLD) of an accelerator for wide-area motion imagery (WAMI), COSMOS explores the design space as completely as an exhaustive search, but it reduces the number of invocations to the HLS tool by up to 14.6×.

References

  1. M. Amdahl. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In Proc. of the ACM Spring Joint Computer Conference (AFIPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Baradaran and P. C. Diniz. 2008. A Compiler Approach to Managing Storage and Memory Bandwidth in Configurable Architectures. ACM Transaction on Design Automation of Electronic Systems (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Barker, T. Benson, D. Campbell, D. Ediger, R. Gioiosa, A. Hoisie, D. Kerbyson, J. Manzano, A. Marquez, L. Song, N. Tallent, and A. Tumeo. 2013. PERFECT (Power Efficiency Revolution For Embedded Computing Technologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute. http://hpc.pnl.gov/PERFECT/.Google ScholarGoogle Scholar
  4. S. Borkar and A. Chien. 2011. The Future of Microprocessors. Communication of the ACM (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Boyd and L. Vandenberghe. 2004. Convex Optimization. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Campos, G. Chiola, J. M. Colom, and M. Silva. 1992. Properties and Performance Bounds for Timed Marked Graphs. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications (1992).Google ScholarGoogle Scholar
  7. L. P. Carloni. 2015. From Latency-Insensitive Design to Communication-Based System-Level Design. Proc. of the IEEE (2015).Google ScholarGoogle ScholarCross RefCross Ref
  8. L. P. Carloni. 2016. The Case for Embedded Scalable Platforms. In Proc. of the ACM/IEEE Design Automation Conference (DAC). (Invited). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. 2014. DaDianNao: A Machine-Learning Supercomputer. In Proc. of the Annual ACM/IEEE International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits (2017).Google ScholarGoogle Scholar
  11. J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and G. Reinman. 2014. Accelerator-Rich Architectures: Opportunities and Progresses. In Proc. of the ACM/IEEE Design Automation Conference (DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Cong, P. Li, B. Xiao, and P. Zhang. 2016. An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Cong, P. Wei, C. H. Yu, and P. Zhou. 2017. Bandwidth Optimization Through On-Chip Memory Restructuring for HLS. In Proc. of the Annual Design Automation Conference (DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Cong, P. Zhang, and Y. Zou. 2011. Combined Loop Transformation and Hierarchy Allocation for Data Reuse Optimization. In Proc. of the ACM/IEEE International Conference on Computer-Aided Design (ICCAD). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Cong, P. Zhang, and Y. Zou. 2012. Optimizing Memory Hierarchy Allocation with Loop Transformations for High-Level Synthesis. In Proc. of the ACM/IEEE Design Automation Conference (DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. G. Cota, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2015. An Analysis of Accelerator Coupling in Heterogeneous Architectures. In Proc. of the ACM/IEEE Design Automation Conference (DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Ferrandi, P. L. Lanzi, D. Loiacono, C. Pilato, and D. Sciuto. 2008. A Multi-objective Genetic Algorithm for Design Space Exploration in High-Level Synthesis. In Proc. of the IEEE Computer Society Annual Symposium on VLSI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Gerstlauer, C. Haubelt, A. D. Pimentel, T. P. Stefanov, D. D. Gajski, and J. Teich. 2009. Electronic System-level Synthesis Methodologies. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Ghenassia. 2006. Transaction-Level Modeling with SystemC. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics. In Proc. of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  21. C. Haubelt and J. Teich. 2003. Accelerating Design Space Exploration Using Pareto-Front Arithmetics {SoC design}. In Proc. of the ACM/IEEE Asia and South Pacific Design Automation Conference (ASP-DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Horowitz. 2014. Computing’s energy problem (and what we can do about it). In Proc. of the IEEE International Solid-State Circuits Conference (ISSCC).Google ScholarGoogle ScholarCross RefCross Ref
  23. L. W. Kim. 2017. DeepX: Deep Learning Accelerator for Restricted Boltzmann Machine Artificial Neural Networks. IEEE Transactions on Neural Networks and Learning Systems (2017).Google ScholarGoogle Scholar
  24. S. Kurra, N. K. Singh, and P. R. Panda. 2007. The Impact of Loop Unrolling on Controller Delay in High Level Synthesis. In Proc. of the ACM/IEEE Conference on Design, Automation and Test in Europe (DATE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Li, Z. Fang, and R. Iyer. 2011. Template-based Memory Access Engine for Accelerators in SoCs. In Proc. of the ACM/IEEE Asia and South Pacific Design Automation Conference (ASP-DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. Y. Liu and L. P. Carloni. 2013. On Learning-Based Methods for Design-Space Exploration with High-Level Synthesis. In Proc. of the ACM/IEEE Design Automation Conference (DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Y. Liu, I. Diakonikolas, M. Petracca, and L. P. Carloni. 2011. Supervised Design Space Exploration by Compositional Approximation of Pareto Sets. In Proc. of the ACM/IEEE Design Automation Conference (DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Y. Liu, M. Petracca, and L. P. Carloni. 2012. Compositional System-Level Design Exploration with Planning of High-Level Synthesis. In Proc. of the AMC/IEEE Conference on Design, Automation, and Test in Europe (DATE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. X. Liu, Y. Chen, T. Nguyen, S. Gurumani, K. Rupnow, and D. Chen. 2016. High Level Synthesis of Complex Applications: An H.264 Video Decoder. In Proc. of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. J. Lyons, M. Hempstead, G. Y. Wei, and D. Brooks. 2012. The Accelerator Store: A Shared Memory Framework for Accelerator-based Systems. ACM Transactions on Architecture and Code Optimization (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Mahapatra and B. Carrion Schafer. 2014. Machine-learning based Simulated Annealer Method for High Level Synthesis Design Space Exploration. In Proc. of the Electronic System Level Synthesis Conference (ESLsyn).Google ScholarGoogle Scholar
  32. W. Meeus, K. Van Beeck, T. Goedemé, J. Meel, and D. Stroobandt. 2012. An Overview of Today’s High-Level Synthesis Tools. Design Automation for Embedded Systems (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. K. Mishra and A. Sengupta. 2014. PSDSE: Particle Swarm Driven Design Space Exploration of Architecture and Unrolling Factors for Nested Loops in High Level Synthesis. In Proc. of the IEEE International Symposium on Electronic System Design (ISED).Google ScholarGoogle Scholar
  34. T. Murata. 1989. Petri Nets: Properties, Analysis and Applications. Proc. of the IEEE (1989).Google ScholarGoogle ScholarCross RefCross Ref
  35. L. Piccolboni, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2017. Broadening the Exploration of the Accelerator Design Space in Embedded Scalable Platforms. In Proc. of the IEEE High Performance Extreme Computing Conference (HPEC).Google ScholarGoogle Scholar
  36. C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2014. System-level Memory Optimization for High-level Synthesis of Component-based SoCs. In Proc. of the ACM/IEEE International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Pilato, P. Mantovani, G. Di Guglielmo, and L. P. Carloni. 2017. System-Level Optimization of Accelerator Local Memory for Heterogeneous Systems-on-Chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Porter, A. M. Fraser, and D. Hush. 2010. Wide-Area Motion Imagery. IEEE Signal Processing Magazine (2010).Google ScholarGoogle Scholar
  39. A. Qamar, F. B. Muslim, F. Gregoretti, L. Lavagno, and M. T. Lazarescu. 2017. High-Level Synthesis for Semi-Global Matching: Is the Juice Worth the Squeeze? IEEE Access (2017).Google ScholarGoogle Scholar
  40. C. V. Ramamoorthy and G. S. Ho. 1980. Performance Evaluation of Asynchronous Concurrent Systems Using Petri Nets. IEEE Transaction on Software Engineering (1980). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. HernÃandez-Lobato, G. Y. Wei, and D. Brooks. 2016. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators. In Proc. of the ACM/IEEE Annual International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Sangiovanni-Vincentelli. 2007. Quo Vadis, SLD? Reasoning About the Trends and Challenges of System Level Design. Proc. of the IEEE (2007).Google ScholarGoogle ScholarCross RefCross Ref
  43. B. Carrion Schafer. 2016. Probabilistic Multiknob High-Level Synthesis Design Space Exploration Acceleration. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. B. Carrion Schafer, T. Takenaka, and K. Wakabayashi. 2009. Adaptive Simulated Annealer for High Level Synthesis Design Space Exploration. In Proc. of the IEEE International Symposium on VLSI Design, Automation and Test (VLSI-DAT).Google ScholarGoogle Scholar
  45. B. Carrion Schafer and K. Wakabayashi. 2012. Machine Learning Predictive Modelling High-Level Synthesis Design Space Exploration. IET Computers Digital Techniques (2012).Google ScholarGoogle Scholar
  46. A. Seznec. 2015. Bank-interleaved Cache or Memory Indexing Does Not Require Euclidean Division. In Proc. of the Annual Workshop on Duplicating, Deconstructing, and Debunking (WDDD).Google ScholarGoogle Scholar
  47. Y. S. Shao, B. Reagen, G. Y. Wei, and D. Brooks. 2014. Aladdin: A Pre-RTL, Power-performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. In Proc. of the ACM/IEEE Annual International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proc. of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!