Abstract
Scaling the memory hierarchy is a major challenge when we scale the number of cores in a multicore processor. Software Managed Multicore (SMM) architectures come up as one of the promising solutions. In an SMM architecture, there are no caches, and each core has only a local scratchpad memory [Banakar et al. 2002]. As the local memory usually is small, large applications cannot be directly executed on it. Code and data of the task mapped to each core need to be managed between global memory and local memory. This article solves the problem of efficiently managing code on an SMM architecture. The primary requirement of generating efficient code assignments is a correct management cost model. In this article, we address this problem by proposing a cost calculation graph. In addition, we develop two heuristics CMSM (Code Mapping for Software Managed multicores) and CMSM_advanced that result in efficient code management execution on the local scratchpad memory. Experimental results collected after executing applications from the MiBench suite [Guthaus et al. 2001] demonstrate that merely by adopting the correct management cost calculation, even using previous code assignment schemes, we can improve performance by an average of 12%. Combining the correct management cost model and a more optimized code mapping algorithm together, our heuristics can reduce runtime in more than 80% of the cases, and by up to 20% on our set of benchmarks, compared to the state-of-the-art code assignment approach [Jung et al. 2010]. When compared with Instruction-level Parallelism (ILP) results, CMSM_advanced performs an average of 5% worse. We also simulate the benchmarks on a cache-based system, and find that the code management overhead on SMM core with our code management is much less than memory latency of a cache-based system.
- Federico Angiolini et al. 2004. A post-compiler approach to scratchpad mapping of code. In Proceedings of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’04). 259--267. Google Scholar
Digital Library
- Todd Austin, Eric Larson, and Dan Ernst. 2002. SimpleScalar: An infrastructure for computer system modeling. Computer 35, 2 (Feb. 2002), 59--67. Google Scholar
Digital Library
- Ke Bai, Di Lu, and Aviral Shrivastava. 2011a. Vector class on limited local memory (LLM) multi-core processors. In Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’11). 215--224. Google Scholar
Digital Library
- Ke Bai, Aviral Shrivastava, and Saleel Kudchadker. 2011b. Stack data management for limited local memory (LLM) multi-core processors. In Proceedings of the 2011 IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP). 231--234. Google Scholar
Digital Library
- Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce Holton. 2013. CMSM: An efficient and effective code management for software managed multicores. In 2013 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 1--9. Google Scholar
Digital Library
- Ke Bai and Aviral Shrivastava. 2010. Heap data management for limited local memory (LLM) multi-core processors. In Proceedings of the 8th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’10). 317--326. Google Scholar
Digital Library
- Ke Bai and Aviral Shrivastava. 2013a. A software-only scheme for managing heap data on limited local memory (LLM) multicore processors. ACM Transactions on Embedded Computing Systems (TECS) 13, 1 (2013), 5. Google Scholar
Digital Library
- Ke Bai and Aviral Shrivastava. 2013b. Automatic and efficient heap data management for limited local memory multicore architectures. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’13). 593--598. Google Scholar
Digital Library
- Michael A. Baker, Amrit Panda, Nikhil Ghadge, Aniruddha Kadne, and Karam S. Chatha. 2010. A performance model and code overlay generator for scratchpad enhanced embedded processors. In Proceedings of the 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’10). 287--296. Google Scholar
Digital Library
- Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, Mahesh Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th International Symposium on Hardware/Software Codesign. 73--78. Google Scholar
Digital Library
- Garo Bournoutian and Alex Orailoglu. 2011. Dynamic, multi-core cache coherence architecture for power-sensitive mobile processors. In Proceedings of CODES+ISSS. 89--98. Google Scholar
Digital Library
- Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, and Ching-Tsun Chou. 2011. DeNovo: Rethinking the memory hierarchy for disciplined parallelism. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 155--166. Google Scholar
Digital Library
- Benoît Dupont de Dinechin, Pierre Guironnet de Massas, Guillaume Lager, Clément Léger, Benjamin Orgogozo, Jérôme Reybert, and Thierry Strudel. 2013. A distributed run-time environment for the Kalray MPPA®-256 integrated manycore processor. Procedia Computer Science 18 (2013), 1654--1663.Google Scholar
Cross Ref
- Bernhard Egger, Seungkyun Kim, Choonki Jang, Jaejin Lee, Sang Lyul Min, and Heonshik Shin. 2010. Scratchpad memory management techniques for code in embedded systems without an MMU. IEEE Transactions on Computers 59, 8 (2010). Google Scholar
Digital Library
- Bernhard Egger, Jaejin Lee, and Heonshik Shin. 2006. Scratchpad memory management for portable systems with a memory management unit. In Proceedings of the 6th ACM & IEEE International Conference on Embedded Software (EMSOFT’’06). 321--330. Google Scholar
Digital Library
- Brian Flachs, Shigehiro Asano, Sang Dhong, Peter Hofstee, Gilles Gervais, Roy Kim, Tien Le, Peichun Liu, Jens Leenstra, John Liberty, Brad Michael, Hwa-Joon Oh, Silvia Melitta Mueller, Osamu Takahashi, Akiyuki Hatakeyama, Yukio Watanabe, Naoka Yano, Daniel A. Brokenshire, Mohammad Peyravian, VanDung To, and Eiji Iwata. 2006. The microarchitecture of the synergistic processor for a cell processor. IEEE Solid-State Circuits 41, 1 (2006), 63--70.Google Scholar
Cross Ref
- Antonio García-Guirado, Ricardo Fernández-Pascual, Alberto Ros, and José M. García. 2011. Energy-efficient cache coherence protocols in chip-multiprocessors for server consolidation. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP’11). 51--62. Google Scholar
Digital Library
- Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, and Richard B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of Workload Characterization. 3--14. Google Scholar
Digital Library
- Bryce Holton, Ke Bai, Aviral Shrivastava, and Harini Ramaprasad. 2014. Construction of GCCFG for inter-procedural optimizations in software managed manycore (SMM) architectures. In 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’14). 1--10. Google Scholar
Digital Library
- IBM. 2006. Programmer’s Guide: Software Development Kit for Multicore Acceleration Version 3.1. Technical Report.Google Scholar
- Intel. 2010. Intel core i7 processor extreme edition and intel core i7 processor datasheet, volume 1. In White paper. Intel.Google Scholar
- Intel. 2012. The SCC Programmer’s Guide. https://communities.intel.com/servlet/JiveServlet/previewBody/5684-102-8-22523/SCCProgrammersGuide.pdf. (2012).Google Scholar
- Andhi Janapsatya, Aleksandar Ignjatović, and Sri Parameswaran. 2006. A novel instruction scratchpad memory optimization method based on concomitance metric. In Proceedings of the Asia and South Pacific Conference on Design Automation (ASP-DAC). 612--617. Google Scholar
Digital Library
- Choonki Jang, Jaejin Lee, Bernhard Egger, and Soojung Ryu. 2012. Automatic code overlay generation and partially redundant code fetch elimination. ACM Transactions on Architecture and Code Optimization 9, 2 (June 2012), 10:1--10:32. Google Scholar
Digital Library
- Seung Chul Jung, Aviral Shrivastava, and Ke Bai. 2010. Dynamic code mapping for limited local memory systems. In Proceedings of the 21st IEEE Internatonal Conference on Application-Specific Systems Architectures and Processors (ASAP’10). 13--20.Google Scholar
Cross Ref
- Michael Kistler, Michael Perrone, and Fabrizio Petrini. 2006. Cell multiprocessor communication network: Built for speed. IEEE Micro 26, 3 (May 2006), 10--23. Google Scholar
Digital Library
- Lian Li, Hui Feng, and Jingling Xue. 2009. Compiler-directed scratchpad memory management via graph coloring. ACM Transactions on Architecture and Code Optimization 6, 3, Article 9 (Oct. 2009), 17 pages. Google Scholar
Digital Library
- Lian Li, Lin Gao, and Jingling Xue. 2005. Memory coloring: A compiler approach for scratchpad memory management. In Proceedings of 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). 329--338. Google Scholar
Digital Library
- Jing Lu, Ke Bai, and Aviral Shrivastava. 2013. SSDM: Smart stack data management for software managed multicores (SMMs). In Proceedings of the 50th Annual Design Automation Conference (DAC’13). 149--156. Google Scholar
Digital Library
- Stefan Metzlaff, Irakli Guliashvili, Sascha Uhrig, and Theo Ungerer. 2011. A dynamic instruction scratchpad memory for embedded processors managed by hardware. Architecture of Computing Systems 6566 (2011), 122--134. Google Scholar
Digital Library
- Pierre Michaud, André Seznec, Damien Fetis, Yiannakis Sazeides, and Theofanis Constantinou. 2007. A study of thread migration in temperature-constrained multicores. ACM Transactions on Architecture and Code Optimization 4, 2, Article 9 (2007). Google Scholar
Digital Library
- Amit Pabalkar, Aviral Shrivastava, Arun Kannan, and Jongeun Lee. 2008. SDRM: Simultaneous determination of regions and function-to-region mapping for scratchpad memories. In Proceedings of 15th International Conference on High Performance Computing (HPC’08). 569--582. Google Scholar
Digital Library
- Martin Schoeberl. 2009. Time-predictable cache organization. In Software Technologies for Future Dependable Distributed Systems. 11--16. Google Scholar
Digital Library
- James E. Smith. 1981. A study of branch prediction strategies. In Proeedings of 8th Annual Symposium on Computer Architecture (ISCA’81). 135--148. Google Scholar
Digital Library
- Stefan Steinke, Nils Grunwald, Lars Wehmeyer, Rajeshwari Banakar, Mahesh Balakrishnan, and Peter Marwedel. 2002. Reducing energy consumption by dynamic copying of instructions onto on-chip memory. In Proceedings of 15th International Symposium on System Synthesis (ISSS’02). 213--218. Google Scholar
Digital Library
- Tom’s Hardware. 2010. Raw performance: SiSoftware sandra 2010 pro (GFLOPS).Google Scholar
- Loc Truong. 2009. Low Power Consumption and a Competitive Price Tag Make the Six-Core TMS320C6472 Ideal for High-Performance Applications. Technical Report. Texas Instruments.Google Scholar
- Sumesh Udayakumaran, Angel Dominguez, and Rajeev Barua. 2006. Dynamic allocation for scratch-pad memory using compile-time decisions. Transactions on Embedded Computing Systems 5, 2 (2006), 472--511. Google Scholar
Digital Library
- Kaushik Vaidyanathan, Qiuling Zhu, Lars Liebmann, Kafai Lai, Stephen Wu, Renzhi Liu, Yandong Liu, Andzrej Strojwas, and Larry Pileggi. 2015. Exploiting sub-20-nm complementary metal-oxide semiconductor technology challenges to design affordable systems-on-chip. Journal of Micro/Nanolithography, MEMS, and MOEMS 14, 1 (2015), 011007--011007.Google Scholar
Cross Ref
- Manish Verma and Peter Marwedel. 2006. Overlay techniques for scratchpad memories in low power embedded processors. IEEE VLSI 14, 8 (2006), 802--815. Google Scholar
Digital Library
- Yi Xu, Yu Du, Youtao Zhang, and Jun Yang. 2011. A composite and scalable cache coherence protocol for large scale CMPs. In Proceedings of the International Conference on Supercomputing (ICS’11). 285--294. Google Scholar
Digital Library
Index Terms
Efficient Code Assignment Techniques for Local Memory on Software Managed Multicores
Recommendations
CMSM: an efficient and effective code management for software managed multicores
CODES+ISSS '13: Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System SynthesisAs we scale the number of cores in a multicore processor, scaling the memory hierarchy is a major challenge. Software Managed Multicore (SMM) architectures are one of the promising solutions. In an SMM architecture, there are no caches, and each core ...
A software-only scheme for managing heap data on limited local memory(LLM) multicore processors
This article presents a scheme for managing heap data in the local memory present in each core of a limited local memory (LLM) multicore architecture. Although managing heap data semi-automatically with software cache is feasible, it may require ...
SSDM: smart stack data management for software managed multicores (SMMs)
DAC '13: Proceedings of the 50th Annual Design Automation ConferenceSoftware Managed Multicore (SMM) architectures have been proposed as a solution for scaling the memory architecture. In an SMM architecture, there are no caches, and each core has only a local scratchpad memory. If all the code and data of the task to ...






Comments