skip to main content
research-article

Efficient Code Assignment Techniques for Local Memory on Software Managed Multicores

Published:08 December 2015Publication History
Skip Abstract Section

Abstract

Scaling the memory hierarchy is a major challenge when we scale the number of cores in a multicore processor. Software Managed Multicore (SMM) architectures come up as one of the promising solutions. In an SMM architecture, there are no caches, and each core has only a local scratchpad memory [Banakar et al. 2002]. As the local memory usually is small, large applications cannot be directly executed on it. Code and data of the task mapped to each core need to be managed between global memory and local memory. This article solves the problem of efficiently managing code on an SMM architecture. The primary requirement of generating efficient code assignments is a correct management cost model. In this article, we address this problem by proposing a cost calculation graph. In addition, we develop two heuristics CMSM (Code Mapping for Software Managed multicores) and CMSM_advanced that result in efficient code management execution on the local scratchpad memory. Experimental results collected after executing applications from the MiBench suite [Guthaus et al. 2001] demonstrate that merely by adopting the correct management cost calculation, even using previous code assignment schemes, we can improve performance by an average of 12%. Combining the correct management cost model and a more optimized code mapping algorithm together, our heuristics can reduce runtime in more than 80% of the cases, and by up to 20% on our set of benchmarks, compared to the state-of-the-art code assignment approach [Jung et al. 2010]. When compared with Instruction-level Parallelism (ILP) results, CMSM_advanced performs an average of 5% worse. We also simulate the benchmarks on a cache-based system, and find that the code management overhead on SMM core with our code management is much less than memory latency of a cache-based system.

References

  1. Federico Angiolini et al. 2004. A post-compiler approach to scratchpad mapping of code. In Proceedings of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’04). 259--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Todd Austin, Eric Larson, and Dan Ernst. 2002. SimpleScalar: An infrastructure for computer system modeling. Computer 35, 2 (Feb. 2002), 59--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ke Bai, Di Lu, and Aviral Shrivastava. 2011a. Vector class on limited local memory (LLM) multi-core processors. In Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’11). 215--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ke Bai, Aviral Shrivastava, and Saleel Kudchadker. 2011b. Stack data management for limited local memory (LLM) multi-core processors. In Proceedings of the 2011 IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP). 231--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce Holton. 2013. CMSM: An efficient and effective code management for software managed multicores. In 2013 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ke Bai and Aviral Shrivastava. 2010. Heap data management for limited local memory (LLM) multi-core processors. In Proceedings of the 8th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’10). 317--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ke Bai and Aviral Shrivastava. 2013a. A software-only scheme for managing heap data on limited local memory (LLM) multicore processors. ACM Transactions on Embedded Computing Systems (TECS) 13, 1 (2013), 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ke Bai and Aviral Shrivastava. 2013b. Automatic and efficient heap data management for limited local memory multicore architectures. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’13). 593--598. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Michael A. Baker, Amrit Panda, Nikhil Ghadge, Aniruddha Kadne, and Karam S. Chatha. 2010. A performance model and code overlay generator for scratchpad enhanced embedded processors. In Proceedings of the 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’10). 287--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, Mahesh Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th International Symposium on Hardware/Software Codesign. 73--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Garo Bournoutian and Alex Orailoglu. 2011. Dynamic, multi-core cache coherence architecture for power-sensitive mobile processors. In Proceedings of CODES+ISSS. 89--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, and Ching-Tsun Chou. 2011. DeNovo: Rethinking the memory hierarchy for disciplined parallelism. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). 155--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Benoît Dupont de Dinechin, Pierre Guironnet de Massas, Guillaume Lager, Clément Léger, Benjamin Orgogozo, Jérôme Reybert, and Thierry Strudel. 2013. A distributed run-time environment for the Kalray MPPA®-256 integrated manycore processor. Procedia Computer Science 18 (2013), 1654--1663.Google ScholarGoogle ScholarCross RefCross Ref
  14. Bernhard Egger, Seungkyun Kim, Choonki Jang, Jaejin Lee, Sang Lyul Min, and Heonshik Shin. 2010. Scratchpad memory management techniques for code in embedded systems without an MMU. IEEE Transactions on Computers 59, 8 (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bernhard Egger, Jaejin Lee, and Heonshik Shin. 2006. Scratchpad memory management for portable systems with a memory management unit. In Proceedings of the 6th ACM & IEEE International Conference on Embedded Software (EMSOFT’’06). 321--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Brian Flachs, Shigehiro Asano, Sang Dhong, Peter Hofstee, Gilles Gervais, Roy Kim, Tien Le, Peichun Liu, Jens Leenstra, John Liberty, Brad Michael, Hwa-Joon Oh, Silvia Melitta Mueller, Osamu Takahashi, Akiyuki Hatakeyama, Yukio Watanabe, Naoka Yano, Daniel A. Brokenshire, Mohammad Peyravian, VanDung To, and Eiji Iwata. 2006. The microarchitecture of the synergistic processor for a cell processor. IEEE Solid-State Circuits 41, 1 (2006), 63--70.Google ScholarGoogle ScholarCross RefCross Ref
  17. Antonio García-Guirado, Ricardo Fernández-Pascual, Alberto Ros, and José M. García. 2011. Energy-efficient cache coherence protocols in chip-multiprocessors for server consolidation. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP’11). 51--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, and Richard B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of Workload Characterization. 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Bryce Holton, Ke Bai, Aviral Shrivastava, and Harini Ramaprasad. 2014. Construction of GCCFG for inter-procedural optimizations in software managed manycore (SMM) architectures. In 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’14). 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. IBM. 2006. Programmer’s Guide: Software Development Kit for Multicore Acceleration Version 3.1. Technical Report.Google ScholarGoogle Scholar
  21. Intel. 2010. Intel core i7 processor extreme edition and intel core i7 processor datasheet, volume 1. In White paper. Intel.Google ScholarGoogle Scholar
  22. Intel. 2012. The SCC Programmer’s Guide. https://communities.intel.com/servlet/JiveServlet/previewBody/5684-102-8-22523/SCCProgrammersGuide.pdf. (2012).Google ScholarGoogle Scholar
  23. Andhi Janapsatya, Aleksandar Ignjatović, and Sri Parameswaran. 2006. A novel instruction scratchpad memory optimization method based on concomitance metric. In Proceedings of the Asia and South Pacific Conference on Design Automation (ASP-DAC). 612--617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Choonki Jang, Jaejin Lee, Bernhard Egger, and Soojung Ryu. 2012. Automatic code overlay generation and partially redundant code fetch elimination. ACM Transactions on Architecture and Code Optimization 9, 2 (June 2012), 10:1--10:32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Seung Chul Jung, Aviral Shrivastava, and Ke Bai. 2010. Dynamic code mapping for limited local memory systems. In Proceedings of the 21st IEEE Internatonal Conference on Application-Specific Systems Architectures and Processors (ASAP’10). 13--20.Google ScholarGoogle ScholarCross RefCross Ref
  26. Michael Kistler, Michael Perrone, and Fabrizio Petrini. 2006. Cell multiprocessor communication network: Built for speed. IEEE Micro 26, 3 (May 2006), 10--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lian Li, Hui Feng, and Jingling Xue. 2009. Compiler-directed scratchpad memory management via graph coloring. ACM Transactions on Architecture and Code Optimization 6, 3, Article 9 (Oct. 2009), 17 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lian Li, Lin Gao, and Jingling Xue. 2005. Memory coloring: A compiler approach for scratchpad memory management. In Proceedings of 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05). 329--338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jing Lu, Ke Bai, and Aviral Shrivastava. 2013. SSDM: Smart stack data management for software managed multicores (SMMs). In Proceedings of the 50th Annual Design Automation Conference (DAC’13). 149--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Stefan Metzlaff, Irakli Guliashvili, Sascha Uhrig, and Theo Ungerer. 2011. A dynamic instruction scratchpad memory for embedded processors managed by hardware. Architecture of Computing Systems 6566 (2011), 122--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Pierre Michaud, André Seznec, Damien Fetis, Yiannakis Sazeides, and Theofanis Constantinou. 2007. A study of thread migration in temperature-constrained multicores. ACM Transactions on Architecture and Code Optimization 4, 2, Article 9 (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Amit Pabalkar, Aviral Shrivastava, Arun Kannan, and Jongeun Lee. 2008. SDRM: Simultaneous determination of regions and function-to-region mapping for scratchpad memories. In Proceedings of 15th International Conference on High Performance Computing (HPC’08). 569--582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Martin Schoeberl. 2009. Time-predictable cache organization. In Software Technologies for Future Dependable Distributed Systems. 11--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. James E. Smith. 1981. A study of branch prediction strategies. In Proeedings of 8th Annual Symposium on Computer Architecture (ISCA’81). 135--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Stefan Steinke, Nils Grunwald, Lars Wehmeyer, Rajeshwari Banakar, Mahesh Balakrishnan, and Peter Marwedel. 2002. Reducing energy consumption by dynamic copying of instructions onto on-chip memory. In Proceedings of 15th International Symposium on System Synthesis (ISSS’02). 213--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Tom’s Hardware. 2010. Raw performance: SiSoftware sandra 2010 pro (GFLOPS).Google ScholarGoogle Scholar
  37. Loc Truong. 2009. Low Power Consumption and a Competitive Price Tag Make the Six-Core TMS320C6472 Ideal for High-Performance Applications. Technical Report. Texas Instruments.Google ScholarGoogle Scholar
  38. Sumesh Udayakumaran, Angel Dominguez, and Rajeev Barua. 2006. Dynamic allocation for scratch-pad memory using compile-time decisions. Transactions on Embedded Computing Systems 5, 2 (2006), 472--511. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Kaushik Vaidyanathan, Qiuling Zhu, Lars Liebmann, Kafai Lai, Stephen Wu, Renzhi Liu, Yandong Liu, Andzrej Strojwas, and Larry Pileggi. 2015. Exploiting sub-20-nm complementary metal-oxide semiconductor technology challenges to design affordable systems-on-chip. Journal of Micro/Nanolithography, MEMS, and MOEMS 14, 1 (2015), 011007--011007.Google ScholarGoogle ScholarCross RefCross Ref
  40. Manish Verma and Peter Marwedel. 2006. Overlay techniques for scratchpad memories in low power embedded processors. IEEE VLSI 14, 8 (2006), 802--815. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yi Xu, Yu Du, Youtao Zhang, and Jun Yang. 2011. A composite and scalable cache coherence protocol for large scale CMPs. In Proceedings of the International Conference on Supercomputing (ICS’11). 285--294. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient Code Assignment Techniques for Local Memory on Software Managed Multicores

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!