Abstract
Communication frequency is increasing with the growing complexity of emerging embedded applications and the number of processors in the implemented multiprocessor SoC architectures. In this article, we consider the issue of communication cost reduction during multithreaded code generation from partitioned Simulink models to help designers in code optimization to improve system performance. We first propose a technique combining message aggregation and communication pipeline methods, which groups communications with the same destinations and sources and parallelizes communication and computation tasks. We also present a method to apply static analysis and dynamic emulation for efficient communication buffer allocation to further reduce synchronization cost and increase processor utilization. The existing cyclic dependency in the mapped model may hinder the effectiveness of the two techniques. We further propose a set of optimizations involving repartition with strongly connected threads to maximize the degree of communication reduction and preprocessing strategies with available delays in the model to reduce the number of communication channels that cannot be optimized. Experimental results demonstrate the advantages of the proposed optimizations with 11--143% throughput improvement.
- Prithviraj Banerjee, John A. Chandy, Manish Gupta, Eugene W. Hodges IV, John G. Holm, Antonio Lain, Daniel J. Palermo, Shankar Ramaswamy, and Ernesto Su.1995. The paradigm compiler for distributed-memory multicomputers. Computer 28, 10 (October 1995), 37--47. Google Scholar
Digital Library
- Lisane Brisolara, Sang-il Han, Xavier Guerin, Luigi Carro, Ricardo Reis, Soo-Ik Chae, and Ahmed Jerraya. 2007. Reducing fine-grain communication overhead in multithread code generation for heterogeneous MPSoC. In Proceedings of the 10th International Workshop on Software & Compilers for Embedded Systems (SCOPES’’07), Heiko Falk and Peter Marwedel (Eds.). ACM, New York, NY, 81--89. Google Scholar
Digital Library
- Jeronimo Castrillon, Andreas Tretter, Rainer Leupers, and Gerd Ascheid. 2012. Communication-aware mapping of KPN applications onto heterogeneous MPSoCs. In Proceedings of the 49th Annual Design Automation Conference (DAC’12). ACM, New York, NY, 1266--1271. Google Scholar
Digital Library
- Jeronimo Castrillon, Rainer Leupers, and Gerd Ascheid. 2013. MAPS: Mapping concurrent dataflow applications to heterogeneous mpsocs. IEEE Transactions on Industrial Informatics 9, 1, 527--545.Google Scholar
Cross Ref
- Gregory A. Chadwick. 2013. Communication-centric, Multi-Core, Fine-Grained Processor Architecture. Technical Report UCAM-CL-TR-832. University of Cambridge, Computer Laboratory.Google Scholar
- Eric Cheung, Harry Hsieh, and Felice Balarin. 2007. Automatic buffer sizing for rate-constrained KPN applications on multiprocessor system-on-chip. In Proceedings of the 2007 IEEE International High Level Design Validation and Test Workshop. IEEE Computer Society, Washington, DC, 37--44. Google Scholar
Digital Library
- Jason Cong, Guoling Han, and Wei Jiang. 2007. Synthesis of an application-specific soft multiprocessor system. In Proceedings of the 2007 ACM/SIGDA 15th International Symposium on Field Programmable Gate Arrays (FPGA’07). ACM, New York, NY, 99--107. Google Scholar
Digital Library
- C-SKY Inc. Homepage. Retrieved from http://www.c-sky.com.Google Scholar
- RTI-MP, dSPACE, Inc. Retrieved from http://www.dspaceinc.com/ww/en/inc/home/products/sw/impsw/rtimpblo.cfm.Google Scholar
- Stijn Eyerman and Lieven Eeckhout. 2010. Modeling critical sections in Amdahl's law and its implications for multicore design. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 362--370. Google Scholar
Digital Library
- Sang-Il Han, Amer Baghdadi, Marius Bonaciu, Soo-Ik Chae, and Ahmed A. Jerraya. 2004. An efficient scalable and flexible data transfer architecture for multiprocessor SoC with massive distributed memory. In Proceedings of the 41st Annual Design Automation Conference (DAC’04). ACM, New York, NY, 250--255. Google Scholar
Digital Library
- Sang-Il Han, Soo-Ik Chae, Lisane Brisolara, Luigi Carro, Ricardo Reis, Xavier Guérin, and Ahmed A. Jerraya. 2007. Memory-efficient multithreaded code generation from Simulink for heterogeneous MPSoC. Design Automation for Embedded Systems 11, 4, 249--283. Google Scholar
Digital Library
- Sang-Il Han, Soo-Ik Chae, Lisane Brisolara, Luigi Carro, Katalin Popovici, Xavier Guerin, Ahmed A. Jerraya, Kai Huang, Lei Li, and Xiaolang Yan. 2009. Simulink®-based heterogeneous multiprocessor SoC design flow for mixed hardware/software refinement and simulation. Integrated VLSI Journal 42, 2 (February 2009), 227--245. Google Scholar
Digital Library
- Sang-Il Han, Soo-Ik Chae, and Ahmed A. Jerraya. 2006a. Functional modeling techniques for efficient SW code generation of video codec applications. In Proceedings of the 2006 Asia and South Pacific Design Automation Conference (ASP-DAC’06). IEEE Press, Piscataway, NJ, 935--940. Google Scholar
Digital Library
- Sang-Il Han, Xavier Guerin, Soo-Ik Chae, and Ahmed A. Jerraya. 2006b. Buffer memory optimization for video codec application modeled in Simulink. In Proceedings of the 43rd Annual Design Automation Conference (DAC’06). ACM, New York, NY, 689--694. Google Scholar
Digital Library
- Pieter H. Hartel, Theo C. Ruys, and Marc C. W. Geilen. 2008. Scheduling optimisations for SPIN to minimise buffer requirements in synchronous data flow. In Proceedings of the 2008 International Conference on Formal Methods in Computer-Aided Design (FMCAD’08), Alessandro Cimatti and Robert B. Jones (Eds.). IEEE Press, Piscataway, NJ, Article 21, 10 pages. Google Scholar
Digital Library
- Gerard Holzmann. 2003. The Spin Model Checker: Primer and Reference Manual (First ed.). Addison-Wesley Professional. Google Scholar
Digital Library
- Kai Huang, Wolfgang Haid, Iuliana Bacivarov, Matthias Keller, and Lothar Thiele. 2012. Embedding formal performance analysis into the design cycle of MPSoCs for real-time streaming applications. ACM Transactions on Embedded Computer Systems 11, 1, Article 8 (April 2012), 23 pages. Google Scholar
Digital Library
- Kai Huang, Sang-il Han, Katalin Popovici, Lisane Brisolara, Xavier Guerin, Lei Li, Xiaolang Yan, Soo-lk Chae, Luigi Carro, and Ahmed Amine Jerraya. 2007. Simulink-based MPSoC design flow: Case study of Motion-JPEG and H.264. In Proceedings of the 44th Annual Design Automation Conference (DAC’07). ACM, New York, NY, 39--42. Google Scholar
Digital Library
- Gilles Kahn and David MacQueeen. 1976. Coroutines and networks of parallel processors. In Proceedings of World Computer Congress-IFIP (1977), Toronto, Canada, 993--998.Google Scholar
- Edward A. Lee and Thomas M. Parks. 2001. Dataflow process networks. In Readings in Hardware/Software Co-Design, Giovanni De Micheli, Rolf Ernst, and Wayne Wolf (Eds.). Kluwer Academic Publishers, Norwell, MA, 59--85. Google Scholar
Digital Library
- Weichen Liu, Zonghua Gu, Jiang Xu, Yu Wang, and Mingxuan Yuan. 2009. An efficient technique for analysis of minimal buffer requirements of synchronous dataflow graphs with model checking. In Proceedings of the 7th IEEE /ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’09). ACM, New York, NY, 61--70. Google Scholar
Digital Library
- Simulink, Mathworks. Retrieved from http://www.mathworks.com.Google Scholar
- Real-time workshop, Mathworks. Retrieved from http://www.mathworks.com.Google Scholar
- Simon Moore and Daniel Greenfield. 2008. The next resource war: computation vs. communication. In Proceedings of the 2008 International Workshop on System Level Interconnect Prediction (SLIP’08). ACM, New York, NY, 81--86. Google Scholar
Digital Library
- UML, Object Management Group, Inc. http://www.uml.org/.Google Scholar
- Tae-ho Shin, Hyunok Oh, and Soonhoi Ha. 2011. Minimizing buffer requirements for throughput constrained parallel execution of synchronous dataflow graph. In Proceedings of the 16th Asia and South Pacific Design Automation Conference (ASPDAC’11). IEEE Press, Piscataway, NJ, 165--170. Google Scholar
Digital Library
- Sander Stuijk, Marc Geilen, and Twan Basten. 2006. Exploring trade-offs in buffer requirements and throughput constraints for synchronous dataflow graphs. In Proceedings of the 43rd Annual Design Automation Conference (DAC’06). ACM, New York, NY, 899--904. Google Scholar
Digital Library
- Robert Tarjan. 1971. Depth-first search and linear graph algorithms. In Proceedings of the 12th Annual Symposium on Switching and Automata Theory (SWAT’71). IEEE Computer Society, Washington, DC, 114--121. Google Scholar
Digital Library
- V6 TAI Logic Module, S2C Inc. http://www.s2cinc.com/product/HardWare/V6TAILogicModule.htm.Google Scholar
- Jia Yu, Jingnan Yao, Laxmi Bhuyan, and Jun Yang. 2007. Program mapping onto network processors by recursive bipartitioning and refining. In Proceedings of the 44th Annual Design Automation Conference (DAC’07). ACM, New York, NY, 805--810. Google Scholar
Digital Library
Index Terms
Communication Optimizations for Multithreaded Code Generation from Simulink Models
Recommendations
Communication Pipelining for Code Generation from Simulink Models
TRUSTCOM '13: Proceedings of the 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and CommunicationsAutomatic multi-threaded code generation is one of the key techniques to improve MPSoC-based programming efficiency. Besides the saving on programming effort, system performance is also an important ant issue to be considered. As thread communication is ...
A Torus-Based Hierarchical Optical-Electronic Network-on-Chip for Multiprocessor System-on-Chip
Networks-on-chip (NoCs) are emerging as a key on-chip communication architecture for multiprocessor systems-on-chip (MPSoCs). Optical communication technologies are introduced to NoCs in order to empower ultra-high bandwidth with low power consumption. ...
Compilation framework for code size reduction using reduced bit-width ISAs (rISAs)
For many embedded applications, program code size is a critical design factor. One promising approach for reducing code size is to employ a “dual instruction set”, where processor architectures support a normal (usually 32-bit) Instruction Set, and a ...






Comments