Abstract
Contemporary many-core architectures, such as Adapteva Epiphany and Sunway TaihuLight, employ per-core software-controlled Scratchpad Memory (SPM) rather than caches for better performance-per-watt and predictability. In these architectures, a core is allowed to access its own SPM as well as remote SPMs through the Network-On-Chip (NoC). However, the compiler/programmer is required to explicitly manage the movement of data between SPMs and off-chip memory. Utilizing SPMs for multi-threaded applications is even more challenging, as the shared variables across the threads need to be placed appropriately. Accessing variables from remote SPMs with higher access latency further complicates this problem as certain links in the NoC may be heavily contended by multiple threads. Therefore, certain variables may need to be replicated in multiple SPMs to reduce the contention delay and/or the overall access time. We present Coordinated Data Management (CDM), a compile-time framework that automatically identifies shared/private variables and places them with replication (if necessary) to suitable on-chip or off-chip memory, taking NoC contention into consideration. We develop both an exact Integer Linear Programming (ILP) formulation as well as an iterative, scalable algorithm for placing the data variables in multi-threaded applications on many-core SPMs. Experimental evaluation on the Parallella hardware platform confirms that our allocation strategy reduces the overall execution time and energy consumption by 1.84× and 1.83×, respectively, when compared to the existing approaches.
- Adapteva. 2014. Epiphany Architecture Reference Manual - Adapteva. Retrieved on January 24, 2019 from http://www.adapteva.com/docs/epiphany_arch_ref.pdf.Google Scholar
- Nawaaz Ahmed, Nikolay Mateev, and Keshav Pingali. 2001. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. Int. J. Parallel Program. 29, 5 (Oct. 2001), 493--544. Google Scholar
Digital Library
- Federico Angiolini, Francesco Menichelli, Alberto Ferrero, Luca Benini, and Mauro Olivieri. 2004. A post-compiler approach to scratchpad mapping of code. In Proceedings of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’04). ACM, New York, 259--267. Google Scholar
Digital Library
- Oren Avissar, Rajeev Barua, and Dave Stewart. 2002. An optimal memory allocation scheme for scratch-pad-based embedded systems. ACM Trans. Embed. Comput. Syst. 1, 1 (Nov. 2002), 6--26. Google Scholar
Digital Library
- Ke Bai and Aviral Shrivastava. 2010. Heap data management for limited local memory (LLM) multi-core processors. In Proceedings of the 8th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, 317--326. Google Scholar
Digital Library
- Ke Bai and Aviral Shrivastava. 2013. Automatic and efficient heap data management for limited local memory multicore architectures. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE’13). IEEE, 593--598. Google Scholar
Digital Library
- Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan, and Peter Marwedel. 2002. Scratchpad memory: Design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th International Symposium on Hardware/Software Codesign (CODES’02). ACM, New York, 73--78. Google Scholar
Digital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. {n.d.}. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of PACT’08. Google Scholar
Digital Library
- Uday Bondhugula, Aravind Acharya, and Albert Cohen. 2016. The Pluto+ Algorithm: A practical approach for parallelization and locality optimization of affine loop nests. ACM Trans. Program. Lang. Syst. 38, 3 (April 2016), Article 12, 32 pages. Google Scholar
Digital Library
- Shekhar Borkar. 2007. Thousand core chips: A technology perspective. In Proceedings of the 44th Annual Design Automation Conference (DAC’07). ACM, New York, 746--749. Google Scholar
Digital Library
- Peter Brauer, Martin Lundqvist, and Aare Mällo. 2016. Improving latency in a signal processing system on the epiphany architecture. In Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP’16). IEEE, 796--800.Google Scholar
Cross Ref
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of IISWC. Google Scholar
Digital Library
- Thomas Chen, Ram Raghavan, Jason N. Dale, and Eiji Iwata. 2007. Cell broadband engine architecture and its first implementation—A performance view. IBM J. Res. Dev. 51, 5 (2007), 559--572. Google Scholar
Digital Library
- Angel Dominguez, Sumesh Udayakumaran, and Rajeev Barua. 2005. Heap data allocation to scratch-pad memory in embedded systems. J. Embedded Comput. 1, 4 (Dec. 2005), 521--540. Google Scholar
Digital Library
- Bernhard Egger, Jaejin Lee, and Heonshik Shin. 2008. Dynamic scratchpad memory management for code in portable systems with an MMU. ACM Trans. Embed. Comput. Syst. 7, 2 (Jan. 2008), Article 11, 38 pages. Google Scholar
Digital Library
- Lei Fang, Peng Liu, Qi Hu, Michael C. Huang, and Guofan Jiang. 2013. Building expressive, area-efficient coherence directories. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE Press, Piscataway, NJ, 299--308. Google Scholar
Digital Library
- Poletti Francesco, Paul Marchal, David Atienza, Luca Benini, Francky Catthoor, and Jose M. Mendias. 2004. An integrated hardware/software approach for run-time scratchpad management. In Proceedings of the 41st Annual Design Automation Conference (DAC’04). ACM, New York, 238--243. Google Scholar
Digital Library
- Haohuan Fu et al. 2016. The sunway TaihuLight supercomputer: System and applications. Sci. China, Inf. Sci. (2016).Google Scholar
- Linley Gwennap. 2011. Adapteva: More flops, less watts. Microprocessor Report (2011).Google Scholar
- Abdelsalam A. Helal, Abdelsalam A. Heddaya, and Bharat B. Bhargava. 2006. Replication Techniques in Distributed Systems. Vol. 4. Springer Science 8 Business Media.Google Scholar
- Wei Hu, Gang Wang, Jian Chen, Xueqing Lou, and Tianzhou Chen. 2009. Efficient scratchpad memory management based on multi-thread for MPSoC architecture. In Proceedings of the International Conference on Scalable Computing and Communications; 8th International Conference on Embedded Computing (SCALCOM-EMBEDDEDCOM’09). IEEE, 429--434. Google Scholar
Digital Library
- Andhi Janapsatya, Aleksandar Ignjatović, and Sri Parameswaran. 2006. A novel instruction scratchpad memory optimization method based on concomitance metric. In Proceedings of the 2006 Asia and South Pacific Design Automation Conference. IEEE Press, 612--617. Google Scholar
Digital Library
- Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-Chip Networks (2nd ed.). Morgan and Claypool Publishers. 116--116 pages.Google Scholar
- SA Kalray. 2014. Kalray MPPA Manycore 256.Google Scholar
- M. Kandemir and A. Choudhary. 2002. Compiler-directed scratch pad memory hierarchy design and management. In Proceedings of the 2002 Design Automation Conference (IEEE Cat. No. 02CH37324). 628--633. Google Scholar
Digital Library
- Jussi Kangasharju, James Roberts, and Keith W. Ross. 2002. Object replication strategies in content distribution networks. Comput. Commun. 25, 4 (2002), 376--383. Google Scholar
Digital Library
- Chetana N. Keltcher, Kevin J. McGrath, Ardsher Ahmed, and Pat Conway. 2003. The AMD Opteron processor for multiprocessor servers. IEEE Micro 2 (2003), 66--76. Google Scholar
Digital Library
- Jakob Krarup and Peter Mark Pruzan. 1983. The simple plant location problem: Survey and synthesis. Eur. J. Op. Res. 12, 1 (1983), 36--81.Google Scholar
Cross Ref
- Lian Li, Hui Feng, and Jingling Xue. 2009. Compiler-directed scratchpad memory management via graph coloring. ACM Trans. Archit. Code Optim. 6, 3, Article 9 (Oct. 2009), 17 pages. Google Scholar
Digital Library
- Amy W. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An affine partitioning algorithm to maximize parallelism and minimize communication. In Proceedings of the 13th International Conference on Supercomputing (ICS’99). ACM, New York, 228--237. Google Scholar
Digital Library
- Jing Lu, Ke Bai, and A. Shrivastava. 2013. SSDM: Smart stack data management for software managed multicores (SMMs). In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). 1--8. Google Scholar
Digital Library
- Jing Lu, Ke Bai, and Aviral Shrivastava. 2015. Efficient code assignment techniques for local memory on software managed multicores. ACM Trans. Embed. Comput. Syst. 14, 4 (Dec. 2015), Article 71, 24 pages. Google Scholar
Digital Library
- Timothy G. Mattson, Michael Riepen, Thomas Lehnig, Paul Brett, Werner Haas, Patrick Kennedy, Jason Howard, Sriram Vangal, Nitin Borkar, Greg Ruhl, et al. 2010. The 48-core SCC processor: The programmer’s view. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--11. Google Scholar
Digital Library
- Nghi Nguyen, Angel Dominguez, and Rajeev Barua. 2009. Memory allocation for embedded systems with a compile-time-unknown scratch-pad size. ACM Trans. Embed. Comput. Syst. 8, 3 (April 2009), Article 21, 32 pages. Google Scholar
Digital Library
- Andreas Olofsson, Tomas Nordström, and Zain Ul-Abdin. 2014. Kickstarting high-performance energy-efficient manycore architectures with Epiphany. In Proceedings of the 48th Asilomar Conference on Signals, Systems and Computers.Google Scholar
Cross Ref
- Amit Pabalkar, Aviral Shrivastava, Arun Kannan, and Jongeun Lee. 2008. SDRM: Simultaneous determination of regions and function-to-region mapping for scratchpad memories. In Proceedings of the 15th International Conference on High Performance Computing (HiPC’08). Springer-Verlag, Berlin, 569--582. Google Scholar
Digital Library
- Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 2000. On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems. ACM Trans. Des. Autom. Electron. Syst. 5, 3 (July 2000), 682--704. Google Scholar
Digital Library
- Louis-Noël Pouchet and T Yuki. 2012. PolyBench/C 3.2.Google Scholar
- Rajiv A. Ravindran, Pracheeti D. Nagarkar, Ganesh S. Dasika, Eric D. Marsman, Robert M. Senger, Scott A. Mahlke, and Richard B. Brown. 2005. Compiler managed dynamic instruction placement in a low-power code cache. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, Washington, DC, 179--190. Google Scholar
Digital Library
- David A. Richie and James A. Ross. 2016. OpenCL+ OpenSHMEM hybrid programming model for the Adapteva Epiphany architecture. In Workshop on OpenSHMEM and Related Technologies.Google Scholar
- Magnus Sjalander, Sally A. McKee, Peter Brauer, David Engdal, and Andras Vajda. 2012. An LTE uplink receiver PHY benchmark and subframe-based power management. In Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems 8 Software (ISPASS’12). IEEE Computer Society, Washington, DC, 25--34. Google Scholar
Digital Library
- Avinash Sodani. 2015. Knights landing (KNL): 2nd generation Intel® Xeon Phi processor. In Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS’15). IEEE, 1--24.Google Scholar
Cross Ref
- Vivy Suhendra, Chandrashekar Raghavan, and Tulika Mitra. 2006. Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In Proceedings of the 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’06). ACM, New York, 401--410. Google Scholar
Digital Library
- Rohan Tabish, Renato Mancuso, Saud Wasly, Ahmed Alhammad, Sujit S. Phatak, Rodolfo Pellizzoni, and Marco Caccamo. 2016. A real-time scratchpad-centric os for multi-core embedded systems. In Proceedings of the 2016 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’16). IEEE, 1--11.Google Scholar
Cross Ref
- Brown Deer Technology. 2016. COPRTHR2 API Reference. Retrieved January 24, 2019 from https://bit.ly/2SIEvnf.Google Scholar
- Top 500 The List. 2017. List of Top 500 Supercomputers. Retrieved January 24, 2019 from https://www.top500.org/list/2017/11/.Google Scholar
- Sumesh Udayakumaran and Rajeev Barua. 2003. Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’03). ACM, New York, 276--286. Google Scholar
Digital Library
- Sumesh Udayakumaran, Angel Dominguez, and Rajeev Barua. 2006. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst. 5, 2 (May 2006), 472--511. Google Scholar
Digital Library
- Manish Verma and Peter Marwedel. 2006. Overlay techniques for scratchpad memories in low power embedded processors. IEEE Trans. Very Large Scale Integr. Syst. 14, 8 (Aug. 2006), 802--815. Google Scholar
Digital Library
- Manish Verma, Klaus Petzold, Lars Wehmeyer, Heiko Falk, and Peter Marwedel. 2005. Scratchpad sharing strategies for multiprocess embedded systems: A first approach. In Proceedings of the 3rd Workshop on Embedded Systems for Real-Time Multimedia. IEEE, 115--120.Google Scholar
Cross Ref
- Manish Verma, Lars Wehmeyer, and Peter Marwedel. 2004. Cache-aware scratchpad allocation algorithm. In Proceedings Design, Automation and Test in Europe Conference and Exhibition, Vol. 2. 1264--1269. Google Scholar
Digital Library
- Manish Verma, Lars Wehmeyer, and Peter Marwedel. 2004. Dynamic overlay of scratchpad memory for energy minimization. In Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’04). ACM, New York, 104--109. Google Scholar
Digital Library
- Lars Wehmeyer, Urs Helmig, and Peter Marwedel. 2004. Compiler-optimized usage of partitioned memories. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture (WMPI’04). ACM, New York, 114--120. Google Scholar
Digital Library
- Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas. 2010. SPACE: Sharing pattern-based directory coherence for multicore scalability. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 135--146. Google Scholar
Digital Library
Index Terms
Scratchpad-Memory Management for Multi-Threaded Applications on Many-Core Architectures
Recommendations
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and ManycoresAchieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil
Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time-consuming, ...








Comments