skip to main content
research-article

Scratchpad-Memory Management for Multi-Threaded Applications on Many-Core Architectures

Published:05 February 2019Publication History
Skip Abstract Section

Abstract

Contemporary many-core architectures, such as Adapteva Epiphany and Sunway TaihuLight, employ per-core software-controlled Scratchpad Memory (SPM) rather than caches for better performance-per-watt and predictability. In these architectures, a core is allowed to access its own SPM as well as remote SPMs through the Network-On-Chip (NoC). However, the compiler/programmer is required to explicitly manage the movement of data between SPMs and off-chip memory. Utilizing SPMs for multi-threaded applications is even more challenging, as the shared variables across the threads need to be placed appropriately. Accessing variables from remote SPMs with higher access latency further complicates this problem as certain links in the NoC may be heavily contended by multiple threads. Therefore, certain variables may need to be replicated in multiple SPMs to reduce the contention delay and/or the overall access time. We present Coordinated Data Management (CDM), a compile-time framework that automatically identifies shared/private variables and places them with replication (if necessary) to suitable on-chip or off-chip memory, taking NoC contention into consideration. We develop both an exact Integer Linear Programming (ILP) formulation as well as an iterative, scalable algorithm for placing the data variables in multi-threaded applications on many-core SPMs. Experimental evaluation on the Parallella hardware platform confirms that our allocation strategy reduces the overall execution time and energy consumption by 1.84× and 1.83×, respectively, when compared to the existing approaches.

References

  1. Adapteva. 2014. Epiphany Architecture Reference Manual - Adapteva. Retrieved on January 24, 2019 from http://www.adapteva.com/docs/epiphany_arch_ref.pdf.Google ScholarGoogle Scholar
  2. Nawaaz Ahmed, Nikolay Mateev, and Keshav Pingali. 2001. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. Int. J. Parallel Program. 29, 5 (Oct. 2001), 493--544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Federico Angiolini, Francesco Menichelli, Alberto Ferrero, Luca Benini, and Mauro Olivieri. 2004. A post-compiler approach to scratchpad mapping of code. In Proceedings of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’04). ACM, New York, 259--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Oren Avissar, Rajeev Barua, and Dave Stewart. 2002. An optimal memory allocation scheme for scratch-pad-based embedded systems. ACM Trans. Embed. Comput. Syst. 1, 1 (Nov. 2002), 6--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ke Bai and Aviral Shrivastava. 2010. Heap data management for limited local memory (LLM) multi-core processors. In Proceedings of the 8th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, 317--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ke Bai and Aviral Shrivastava. 2013. Automatic and efficient heap data management for limited local memory multicore architectures. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’13). IEEE, 593--598. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th International Symposium on Hardware/Software Codesign (CODES’02). ACM, New York, 73--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. {n.d.}. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of PACT’08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Uday Bondhugula, Aravind Acharya, and Albert Cohen. 2016. The Pluto+ Algorithm: A practical approach for parallelization and locality optimization of affine loop nests. ACM Trans. Program. Lang. Syst. 38, 3 (April 2016), Article 12, 32 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shekhar Borkar. 2007. Thousand core chips: A technology perspective. In Proceedings of the 44th Annual Design Automation Conference (DAC’07). ACM, New York, 746--749. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Peter Brauer, Martin Lundqvist, and Aare Mällo. 2016. Improving latency in a signal processing system on the epiphany architecture. In Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP’16). IEEE, 796--800.Google ScholarGoogle ScholarCross RefCross Ref
  12. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of IISWC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Thomas Chen, Ram Raghavan, Jason N. Dale, and Eiji Iwata. 2007. Cell broadband engine architecture and its first implementation—A performance view. IBM J. Res. Dev. 51, 5 (2007), 559--572. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Angel Dominguez, Sumesh Udayakumaran, and Rajeev Barua. 2005. Heap data allocation to scratch-pad memory in embedded systems. J. Embedded Comput. 1, 4 (Dec. 2005), 521--540. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bernhard Egger, Jaejin Lee, and Heonshik Shin. 2008. Dynamic scratchpad memory management for code in portable systems with an MMU. ACM Trans. Embed. Comput. Syst. 7, 2 (Jan. 2008), Article 11, 38 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lei Fang, Peng Liu, Qi Hu, Michael C. Huang, and Guofan Jiang. 2013. Building expressive, area-efficient coherence directories. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE Press, Piscataway, NJ, 299--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Poletti Francesco, Paul Marchal, David Atienza, Luca Benini, Francky Catthoor, and Jose M. Mendias. 2004. An integrated hardware/software approach for run-time scratchpad management. In Proceedings of the 41st Annual Design Automation Conference (DAC’04). ACM, New York, 238--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Haohuan Fu et al. 2016. The sunway TaihuLight supercomputer: System and applications. Sci. China, Inf. Sci. (2016).Google ScholarGoogle Scholar
  19. Linley Gwennap. 2011. Adapteva: More flops, less watts. Microprocessor Report (2011).Google ScholarGoogle Scholar
  20. Abdelsalam A. Helal, Abdelsalam A. Heddaya, and Bharat B. Bhargava. 2006. Replication Techniques in Distributed Systems. Vol. 4. Springer Science 8 Business Media.Google ScholarGoogle Scholar
  21. Wei Hu, Gang Wang, Jian Chen, Xueqing Lou, and Tianzhou Chen. 2009. Efficient scratchpad memory management based on multi-thread for MPSoC architecture. In Proceedings of the International Conference on Scalable Computing and Communications; 8th International Conference on Embedded Computing (SCALCOM-EMBEDDEDCOM’09). IEEE, 429--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Andhi Janapsatya, Aleksandar Ignjatović, and Sri Parameswaran. 2006. A novel instruction scratchpad memory optimization method based on concomitance metric. In Proceedings of the 2006 Asia and South Pacific Design Automation Conference. IEEE Press, 612--617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-Chip Networks (2nd ed.). Morgan and Claypool Publishers. 116--116 pages.Google ScholarGoogle Scholar
  24. SA Kalray. 2014. Kalray MPPA Manycore 256.Google ScholarGoogle Scholar
  25. M. Kandemir and A. Choudhary. 2002. Compiler-directed scratch pad memory hierarchy design and management. In Proceedings of the 2002 Design Automation Conference (IEEE Cat. No. 02CH37324). 628--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jussi Kangasharju, James Roberts, and Keith W. Ross. 2002. Object replication strategies in content distribution networks. Comput. Commun. 25, 4 (2002), 376--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Chetana N. Keltcher, Kevin J. McGrath, Ardsher Ahmed, and Pat Conway. 2003. The AMD Opteron processor for multiprocessor servers. IEEE Micro 2 (2003), 66--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jakob Krarup and Peter Mark Pruzan. 1983. The simple plant location problem: Survey and synthesis. Eur. J. Op. Res. 12, 1 (1983), 36--81.Google ScholarGoogle ScholarCross RefCross Ref
  29. Lian Li, Hui Feng, and Jingling Xue. 2009. Compiler-directed scratchpad memory management via graph coloring. ACM Trans. Archit. Code Optim. 6, 3, Article 9 (Oct. 2009), 17 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Amy W. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An affine partitioning algorithm to maximize parallelism and minimize communication. In Proceedings of the 13th International Conference on Supercomputing (ICS’99). ACM, New York, 228--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jing Lu, Ke Bai, and A. Shrivastava. 2013. SSDM: Smart stack data management for software managed multicores (SMMs). In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jing Lu, Ke Bai, and Aviral Shrivastava. 2015. Efficient code assignment techniques for local memory on software managed multicores. ACM Trans. Embed. Comput. Syst. 14, 4 (Dec. 2015), Article 71, 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Timothy G. Mattson, Michael Riepen, Thomas Lehnig, Paul Brett, Werner Haas, Patrick Kennedy, Jason Howard, Sriram Vangal, Nitin Borkar, Greg Ruhl, et al. 2010. The 48-core SCC processor: The programmer’s view. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Nghi Nguyen, Angel Dominguez, and Rajeev Barua. 2009. Memory allocation for embedded systems with a compile-time-unknown scratch-pad size. ACM Trans. Embed. Comput. Syst. 8, 3 (April 2009), Article 21, 32 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Andreas Olofsson, Tomas Nordström, and Zain Ul-Abdin. 2014. Kickstarting high-performance energy-efficient manycore architectures with Epiphany. In Proceedings of the 48th Asilomar Conference on Signals, Systems and Computers.Google ScholarGoogle ScholarCross RefCross Ref
  36. Amit Pabalkar, Aviral Shrivastava, Arun Kannan, and Jongeun Lee. 2008. SDRM: Simultaneous determination of regions and function-to-region mapping for scratchpad memories. In Proceedings of the 15th International Conference on High Performance Computing (HiPC’08). Springer-Verlag, Berlin, 569--582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 2000. On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems. ACM Trans. Des. Autom. Electron. Syst. 5, 3 (July 2000), 682--704. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Louis-Noël Pouchet and T Yuki. 2012. PolyBench/C 3.2.Google ScholarGoogle Scholar
  39. Rajiv A. Ravindran, Pracheeti D. Nagarkar, Ganesh S. Dasika, Eric D. Marsman, Robert M. Senger, Scott A. Mahlke, and Richard B. Brown. 2005. Compiler managed dynamic instruction placement in a low-power code cache. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, Washington, DC, 179--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. David A. Richie and James A. Ross. 2016. OpenCL+ OpenSHMEM hybrid programming model for the Adapteva Epiphany architecture. In Workshop on OpenSHMEM and Related Technologies.Google ScholarGoogle Scholar
  41. Magnus Sjalander, Sally A. McKee, Peter Brauer, David Engdal, and Andras Vajda. 2012. An LTE uplink receiver PHY benchmark and subframe-based power management. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems 8 Software (ISPASS’12). IEEE Computer Society, Washington, DC, 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Avinash Sodani. 2015. Knights landing (KNL): 2nd generation Intel® Xeon Phi processor. In Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS’15). IEEE, 1--24.Google ScholarGoogle ScholarCross RefCross Ref
  43. Vivy Suhendra, Chandrashekar Raghavan, and Tulika Mitra. 2006. Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’06). ACM, New York, 401--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Rohan Tabish, Renato Mancuso, Saud Wasly, Ahmed Alhammad, Sujit S. Phatak, Rodolfo Pellizzoni, and Marco Caccamo. 2016. A real-time scratchpad-centric os for multi-core embedded systems. In Proceedings of the 2016 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’16). IEEE, 1--11.Google ScholarGoogle ScholarCross RefCross Ref
  45. Brown Deer Technology. 2016. COPRTHR2 API Reference. Retrieved January 24, 2019 from https://bit.ly/2SIEvnf.Google ScholarGoogle Scholar
  46. Top 500 The List. 2017. List of Top 500 Supercomputers. Retrieved January 24, 2019 from https://www.top500.org/list/2017/11/.Google ScholarGoogle Scholar
  47. Sumesh Udayakumaran and Rajeev Barua. 2003. Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’03). ACM, New York, 276--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Sumesh Udayakumaran, Angel Dominguez, and Rajeev Barua. 2006. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst. 5, 2 (May 2006), 472--511. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Manish Verma and Peter Marwedel. 2006. Overlay techniques for scratchpad memories in low power embedded processors. IEEE Trans. Very Large Scale Integr. Syst. 14, 8 (Aug. 2006), 802--815. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Manish Verma, Klaus Petzold, Lars Wehmeyer, Heiko Falk, and Peter Marwedel. 2005. Scratchpad sharing strategies for multiprocess embedded systems: A first approach. In Proceedings of the 3rd Workshop on Embedded Systems for Real-Time Multimedia. IEEE, 115--120.Google ScholarGoogle ScholarCross RefCross Ref
  51. Manish Verma, Lars Wehmeyer, and Peter Marwedel. 2004. Cache-aware scratchpad allocation algorithm. In Proceedings Design, Automation and Test in Europe Conference and Exhibition, Vol. 2. 1264--1269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Manish Verma, Lars Wehmeyer, and Peter Marwedel. 2004. Dynamic overlay of scratchpad memory for energy minimization. In Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’04). ACM, New York, 104--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Lars Wehmeyer, Urs Helmig, and Peter Marwedel. 2004. Compiler-optimized usage of partitioned memories. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI’04). ACM, New York, 114--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas. 2010. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 135--146. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scratchpad-Memory Management for Multi-Threaded Applications on Many-Core Architectures

    Recommendations

    Reviews

    Joseph M. Arul

    This paper focuses on improving many-core architectures via software programmable or scratchpad memory (SPM): An SPM contains an array of [static random-access memory, SRAM] cells. A portion of the memory address space is dedicated to the SPM. Any address that falls within this dedicated address space can directly index into the SPM to access the corresponding data. Thus, by maintaining a dedicated area, the "coherency among multiple SPMs" at the software level can be eliminated. This use of software-level access to the data "thereby eliminat[es] the hardware area/power required for cache coherence," as well as cache access. In a many-core architecture environment, data access on many cores can drastically reduce performance due to coherency issues and long delays related to data access from different cores. In a many-core, multi-threaded architecture, as well as on-chip and off-chip, data accesses can lead to nonuniform, long-latency, and irregular data accesses. To overcome these difficulties in nonuniform data accesses, the paper proposes "a compile-time, coordinated data management framework called CDM, for many-core SPMs." For this paper, "the 16-core Epiphany SoC consists of an array of simple RISC processors (eCores) programmable in C connected together in a 2D-mesh NOC and supporting a single shared address space." Because a Xilinx Zynq system on chip (SoC) supports these eCores on the same development board, it is more energy efficient, unlike traditional cache memory. The eCores are not only able to access local memory, but are also capable of accessing remote memory. Several kernel applications from embedded, multithreaded benchmarks are used in the evaluation, including two benchmarks related to the decryption and encryption of data (AESD and AESE) and three long-term evolution (LTE) benchmarks (PHY_ACI, PHY_DEMAP, and PHY_MICF). The authors use a GREEDY approach as their baseline; SNAP-S allows only one copy of data, and SNAP-M uses a replication mechanism. As a result, "the SNAP-M approach provides an average speed-up of 1.84x and an energy reduction of 1.83x when compared to the GREEDY strategy." The SNAP-S approach "provides an average speed-up and energy reduction of 1.09x." Thus, these two approaches effectively speed up as well as reduce the energy usage due to no cache-like memory, which consumes more power when the data is accessed. The authors take advantage of bringing in off-chip data to the on-chip memory and not using cache-like memory; the use of SoC reduces energy consumption. Currently, a new type of memory is on the rise that can drastically reduce power consumption and is faster than DRAM and cache. When such memory comes into use, this paper will be obsolete. The overhead of bringing in off-chip data to the on-chip memory must also be considered. Besides, the SNAP-S speed-up compared to the GREEDY strategy is not significant; only when the data is replicated is significant improvement observed. One would expect a significant reduction in the SNAP-S strategy, because even the remote memory access data is reduced to the local memory accesses; however, that is not seen in the experimental results.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!