Abstract
Architectural heterogeneity has proven to be an effective design paradigm to cope with an ever-increasing demand for computational power within tight energy budgets, in virtually every computing domain. Programmable manycore accelerators are currently widely used not only in high-performance computing systems, but also in embedded devices, in which they operate as coprocessors under the control of a general-purpose CPU (the host processor). Clearly, such powerful hardware architectures are paired with sophisticated and complex software ecosystems, composed of operating systems, programming models plus associated runtime engines, and increasingly complex user applications with related libraries. System modeling has always played a key role in early architectural exploration or software development when the real hardware is not available. The necessity of efficiently coping with the huge HW/SW design space provided by the described heterogeneous Systems on Chip (SoCs) calls for advanced full-system simulation methodologies and tools, capable of assessing various metrics for the functional and nonfunctional properties of the target system. In this article, we describe VirtualSoC, a simulation tool targeting the full-system simulation of massively parallel heterogeneous SoCs. We also describe how VirtualSoC has been successfully adopted in several research projects.
- José L. Abellán, Juan Fernández, Manuel E. Acacio, Davide Bertozzi, Daniele Bortolotti, Andrea Marongiu, and Luca Benini. 2012. Design of a collective communication infrastructure for barrier synchronization in cluster-based nanoscale MPSoCs. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’12). EDA Consortium, 491--496. Google Scholar
Digital Library
- Adapteva. 2013. Epiphany Architecture Reference. Retrieved September 9, 2016 from http://www.adapteva.com/docs/epiphany_arch_ref.pdf.Google Scholar
- Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). IEEE, 163--174.Google Scholar
- Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX 2005 Annual Technical Conference (DATE’05), FREENIX Track. 41--46. Google Scholar
Digital Library
- Luca Benini, Eric Flamand, Didier Fuin, and Diego Melpignano. 2012. P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’12). EDA Consortium, 983--987. Google Scholar
Digital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, and others. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2, 1--7. Google Scholar
Digital Library
- Daniele Bortolotti, Andrea Bartolini, Christian Weis, Davide Rossi, and Luca Benini. 2014a. Hybrid memory architecture for voltage scaling in ultra-low power multi-core biomedical processors. In Design, Automation and Test in Europe Conference and Exhibition (DATE’14). IEEE, 1--6. Google Scholar
Digital Library
- Daniele Bortolotti, Hossein Mamaghanian, Andrea Bartolini, Maryam Ashouei, Jan Stuijt, David Atienza, Pierre Vandergheynst, and Luca Benini. 2014b. Approximate compressed sensing: Ultra-low power biosignal processing via aggressive voltage scaling on a hybrid memory multi-core processor. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design. ACM, 45--50. Google Scholar
Digital Library
- Nathan Brookwood. 2010. AMD fusion family of APUs: Enabling a superior, immersive PC experience. Insight 64, 1, 1--8.Google Scholar
- Doug Burger and Todd M. Austin. 1997. The SimpleScalar tool set, version 2.0. ACM SIGARCH Computer Architecture News 25, 3, 13--25. Google Scholar
Digital Library
- Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). 52:1--52:12. Google Scholar
Digital Library
- Ik Joon Chang, Debabrata Mohapatra, and Kaushik Roy. 2011. A priority-based 6t/8t hybrid SRAM architecture for aggressive voltage scaling in video applications. IEEE Transactions on Circuits and Systems for Video Technology 21, 2, 101--112. Google Scholar
Digital Library
- Bruce R. Childers, Alex K. Jones, and Daniel Mossé. 2015. A roadmap and plan of action for community-supported empirical evaluation in computer architecture. ACM SIGOPS Operating Systems Review 49, 1, 108--117. Google Scholar
Digital Library
- Leonardo Dagum and Rameshm Enon. 1998. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science 8 Engineering 5, 1, 46--55. Google Scholar
Digital Library
- M. Dall’Osso, G. Biccari, L. Giovannini, D. Bertozzi, and L. Benini. 2012. Xpipes: A latency insensitive parameterized network-on-chip architecture for multi-processor SoCs. In IEEE 30th International Conference on Computer Design (ICCD’12). 45--48. DOI:http://dx.doi.org/10.1109/ICCD.2012.6378615 Google Scholar
Digital Library
- Benoît Dupont de Dinechin, Renaud Ayrignac, Pierre-Edouard Beaucamps, Patrice Couvert, Benoit Ganne, Pierre Guironnet de Massas, Frederique Jacquet, Simon Jones, Nicolas Morey Chaisemartin, Frédéric Riss, and others. 2013. A clustered manycore processor architecture for embedded and accelerated applications. In IEEE High Performance Extreme Computing Conference (HPEC’13). IEEE, 1--6.Google Scholar
Cross Ref
- Cesare Ferri, Andrea Marongiu, Benjamin Lipton, R. Iris Bahar, Tali Moreshet, Luca Benini, and Maurice Herlihy. 2011. SoC-TM: Integrated HW/SW support for transactional memory programming on embedded MPSoCs. In Proceedings of the 9th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’11), part of ESWeek’11 7th Embedded Systems Week, Taipei, Taiwan, 9-14 October, 2011. 39--48. Google Scholar
Digital Library
- Christophe Guillon. 2011. Program instrumentation with QEMU. In 1st International QEMU Users Forum, Vol. 1. 15--18.Google Scholar
- Alvaro Gutierrez, Joseph Pusdesris, Ronald G. Dreslinski, Trevor Mudge, Chander Sudanthi, Christopher D. Emmons, Mitchell Hayenga, and Nigel Paver. 2014. Sources of error in full-system simulation. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14). IEEE, 13--22.Google Scholar
Cross Ref
- C. Helmstetter and V. Joloboff. 2008. SimSoC: A SystemC TLM integrated ISS for full system simulation. In IEEE Asia Pacific Conference on Circuits and Systems (APCCAS’08). 1759--1762.Google Scholar
- Maurice Herlihy and J. Eliot B. Moss. 1993. Transactional Memory: Architectural Support for Lock-free Data Structures. Vol. 21. ACM. Google Scholar
Digital Library
- Imperas Software. 2015. OVPSim. Retrieved September 9, 2016 from http://www.ovpworld.org/technology_ovpsim.Google Scholar
- James Jeffers and James Reinders. 2013. Intel Xeon Phi Coprocessor High-performance Programming. Newnes, Boston, MA. Google Scholar
Digital Library
- Kalray. 2015. MPPA 256 - Programmable Manycore Processor. Retrieved September 9, 2016 from www.kalray.eu/products/mppa-manycore/mppa-256.Google Scholar
- Khronos OpenCL Working Group and others. 2008. The OpenCL specification. A. Munshi, ed.Google Scholar
- Peter S. Magnusson, Magnus Christensson, Jesper Eskilson, Daniel Forsgren, Gustav Hallberg, Johan Hogberg, Fredrik Larsson, Andreas Moestedt, and Bengt Werner. 2002. Simics: A full system simulation platform. Computer 35, 2, 50--58. Google Scholar
Digital Library
- Hossein Mamaghanian, Nadia Khaled, David Atienza, and Pierre Vandergheynst. 2011. Compressed sensing for real-time energy-efficient ECG compression on wireless body sensor nodes. IEEE Transactions on Biomedical Engineering 58, 9, 2456--2466.Google Scholar
Cross Ref
- Andrea Marongiu, Alessandro Capotondi, and Luca Benini. 2016. Controlling {NUMA} effects in embedded manycore applications with lightweight nested parallelism support. Parallel Computing In press. DOI:http://dx.doi.org/10.1016/j.parco.2016.02.002Google Scholar
- Andrea Marongiu, Alessandro Capotondi, Giuseppe Tagliavini, and Luca Benini. 2015. Simplifying many-core-based heterogeneous SoC programming with offload directives. IEEE Transactions on Industrial Informatics 11, 4, 957--967.Google Scholar
Cross Ref
- Aline Mello, Isaac Maia, Alain Greiner, and Francois Pecheux. 2010. Parallel simulation of SystemC TLM 2.0 compliant MPSoC on SMP workstations. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’10). IEEE, 606--609. Google Scholar
Digital Library
- MentorGraphics. 2015. Vista Virtual Prototyping. (2015). Retrieved September 9, 2016 from https://www.mentor.com/esl/vista/virtual-prototyping/.Google Scholar
- Marius Monton, Antoni Portero, Marc Moreno, Borja Martinez, and Jordi Carrabina. 2007. Mixed SW/SystemC SoC emulation framework. In IEEE International Symposium on Industrial Electronics (ISIE’07). 2338--2341.Google Scholar
Cross Ref
- NVIDIA. 2015. NVIDIA Tegra X1. Retrieved September 9, 2016 from http://www.nvidia.com/object/tegra-x1-processor.html.Google Scholar
- NVIDIA Corp. 2015. NVIDIA Tegra X1 Architecture. Retrieved September 9, 2016 from http://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf.Google Scholar
- OpenACC. 2013. The OpenACC Application Programming Interface, Version 2.0. Retrieved September 9, 2016 from http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf.Google Scholar
- OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface Version 4.0. Retrieved September 9, 2016 from http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.Google Scholar
- OSCI. 2009. Open SystemC Initiative (OSCI) TLM-2.0 LANGUAGE REFERENCE MANUAL. Retrieved September 9, 2016 from http://www.accellera.org/images/downloads/standards/systemc/TLM_2_0_LRM.pdf.Google Scholar
- Dimitra Papagiannopoulou, Tali Moreshet, Andrea Marongiu, Luca Benini, Maurice Herlihy, and R. Iris Bahar. 2014. Speculative synchronization for coherence-free embedded NUMA architectures. In International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV’14). IEEE, 99--106.Google Scholar
- Avadh Patel, Furat Afram, Shunfei Chen, and Kanad Ghose. 2011. MARSS: A full system simulator for multicore x86 CPUs. In Proceedings of the 48th Design Automation Conference. ACM, 1050--1055. Google Scholar
Digital Library
- PEZY. 2015. PEZY-SC Many Core Processor. Retrieved September 9, 2016 from http://www.pezy.co.jp/en/products/pezy-sc.html.Google Scholar
- Christian Pinto, Shivani Raghav, Andrea Marongiu, Martino Ruggiero, David Atienza, and Luca Benini. 2011. GPGPU-accelerated parallel and fast simulation of thousand-core platforms. In Proceedings of the 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’11). IEEE Computer Society, Washington, DC, 53--62. DOI:http://dx.doi.org/10.1109/CCGrid.2011.64 Google Scholar
Digital Library
- Plurality Ltd. 2010. The hypercore architecture. White paper. Technical report version 1.7.Google Scholar
- J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood. 2015. Gem5-gpu: A heterogeneous CPU-GPU simulator. Computer Architecture Letters 14, 1, 34--36. DOI:http://dx.doi.org/10.1109/LCA.2014.2299539Google Scholar
Digital Library
- PULP. 2016. PULP - An Open Parallel Ultra-Low-Power Processing-Platform. Retrieved September 9, 2016 from http://iis-projects.ee.ethz.ch/index.php/PULP.Google Scholar
- Shivani Raghav, Andrea Marongiu, Christian Pinto, David Atienza, Martino Ruggiero, and Luca Benini. 2012. Full system simulation of many-core heterogeneous SoCs using GPU and QEMU semihosting. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU’12). ACM, New York, NY, 101--109. Google Scholar
Digital Library
- Shivani Raghav, Andrea Marongiu, Christian Pinto, Martino Ruggiero, David Atienza, and Luca Benini. 2013. SIMinG-1k: A thousand-core simulator running on general-purpose graphical processing units. Concurrency and Computation: Practice and Experience 25, 10, 1443--1461.Google Scholar
Cross Ref
- Abbas Rahimi, Daniele Cesarini, Andrea Marongiu, Rajesh K. Gupta, and Luca Benini. 2015. Task scheduling strategies to mitigate hardware variability in embedded shared memory clusters. In Proceedings of the 52nd Annual Design Automation Conference (DAC’15). ACM, New York, NY, Article 152, 152:1--152:6 pages. Google Scholar
Digital Library
- Abbas Rahimi, Igor Loi, Mohammad Reza Kakoee, and Luca Benini. 2011. A fully-synthesizable single-cycle interconnection network for shared-L1 processor clusters. In Design, Automation Test in Europe Conference Exhibition (DATE’11). 1--6.Google Scholar
- Davide Rossi, Igor Loi, Germain Haugou, and Luca Benini. 2014. Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters. In Proceedings of the 11th ACM Conference on Computing Frontiers. ACM, 15. Google Scholar
Digital Library
- Synopsys. 2015. Platform Architect. Retrieved September 9, 2016 from http://www.synopsys.com/Prototyping/ArchitectureDesign/pages/platform-architect.aspx.Google Scholar
- Texas Instruments. 2013. Multicore DSP+ARM KeyStone II System-on-Chip (SoC). Retrieved September 9, 2016 from http://www.ti.com/lit/ds/sprs866e/sprs866e.pdf.Google Scholar
- Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 335--344. Google Scholar
Digital Library
- David Wang, Brinda Ganesh, Nuengwong Tuaycharoen, Kathleen Baynes, Aamer Jaleel, and Bruce Jacob. 2005. DRAMsim: A memory system simulator. ACM SIGARCH Computer Architecture News 33, 4, 100--107. Google Scholar
Digital Library
- Wind River. 2015. Simics Full System Simulator. Retrieved September 9, 2016 from http://www.windriver.com/products/simics.Google Scholar
- Matt T. Yourst. 2007. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In IEEE International Symposium on Performance Analysis of Systems 8 Software (ISPASS’07). IEEE, 23--34.Google Scholar
Cross Ref
Index Terms
VirtualSoC: A Research Tool for Modern MPSoCs
Recommendations
Network interfaces for programmable NICs and multicore platforms
The availability of multicore processors and programmable NICs, such as TOEs (TCP/IP Offloading Engines), provides new opportunities for designing efficient network interfaces to cope with the gap between the improvement rates of link bandwidths and ...
Exploring many-core architecture design space for parallel discrete event simulation
SIGSIM PADS '14: Proceedings of the 2nd ACM SIGSIM Conference on Principles of Advanced Discrete SimulationAs multicore and manycore processor architectures are emerging and the core counts per chip continue to increase, it is important to evaluate and understand the performance and scalability of Parallel Discrete Event Simulation (PDES) on these platforms. ...
Multicore design is the challenge! what is the solution?
DAC '08: Proceedings of the 45th annual Design Automation ConferenceMulti Processor SoC (MPSoC) are being designed today. MPSoC design can help achieve aggressive performance and low power targets but it creates new design challenges: How to design the interconnect fabric and memory sub-system to allow the massive data ...






Comments