Abstract
A common multi-core pattern consists of processors communicating through shared, multi-banked on-chip memory. Two approaches exist: Interleaved address mapping, which spreads consecutive data over all banks, and contiguous address mapping, which stores consecutive data on a single bank.
In this work, we compare both approaches on the Kalray MPPA-256 platform. For contiguous mapping, we propose an algorithm, based on graph colouring techniques, to automatically perform the assignment of data blocks to memory banks with the goal of minimising access collisions and delays. Experiments with representative, parallel real-world benchmarks show that 69% of the tested configurations, when optimised for contiguous mapping by our algorithm, run up to 86% faster on average than with interleaved mapping.
- Matthias Becker, Dakshina Dasari, Borislav Nicolic, Benny Akesson, Vincent Nelis, and Thomas Nolte. 2016. Contention-Free Execution of Automotive Applications on a Clustered Many-Core Platform. In 2016 28th Euromicro Conference on Real-Time Systems (ECRTS). IEEE.Google Scholar
Cross Ref
- L. Benini, E. Flamand, D. Fuin, and D. Melpignano. 2012. P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE 2012). IEEE, 983--987. Google Scholar
Digital Library
- Thomas Carle, Manel Djemal, Dumitru Potop-Butucaru, Robert de Simone, and Zhen Zhang. 2014. Static Mapping of Real-Time Applications onto Massively Parallel Processor Arrays. In 2014 14th International Conference on Application of Concurrency to System Design. IEEE. Google Scholar
Digital Library
- G. J. Chaitin. 1982. Register allocation 8 spilling via graph coloring. In Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction - SIGPLAN'82. ACM Press, 98--105. Google Scholar
Digital Library
- Vishwanathan Chandru and Frank Mueller. 2016. Reducing NoC and Memory Contention for Manycores. In Architecture of Computing Systems -- ARCS 2016. Springer International Publishing, 293--305. Google Scholar
Digital Library
- Jeonghun Cho, Yunheung Paek, and David Whalley. 2002. Efficient register and memory assignment for non-orthogonal architectures via graph coloring and MST algorithms. In Proceedings of the Joint Conference on Languages, Compilers and Tools for Embedded Systems Software and Compilers for Embedded Systems (LCTES/SCOPES’02). ACM Press, New York, NY, USA, 130--138. Google Scholar
Digital Library
- Francesco Conti. CConvNet open source project. Retrieved July 10, 2017 from https://micrel-web-services.dei.unibo.it/brain-inspired/cconvnet-release.Google Scholar
- Francesco Conti, Antonio Pullini, and Luca Benini. 2014. Brain-Inspired Classroom Occupancy Monitoring on a Low-Power Mobile Platform. In 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE. Google Scholar
Digital Library
- Francesco Conti, Davide Rossi, Antonio Pullini, Igor Loi, and Luca Benini. 2015. PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision. Journal of Signal Processing Systems 84, 3 Google Scholar
Digital Library
- Benoit Dupont de Dinechin, Duco van Amstel, Marc Poulhies, and Guillaume Lager. 2014. Time-critical computing on a single-chip massively parallel processor. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE), 2014. IEEE Conference Publications, 1--6. Google Scholar
Digital Library
- S. Gautham and Erik Rainey. 2014. The Khronos OpenVXTM 1.0 Specification. https://www.khronos.org/openvx/.Google Scholar
- Georgia Giannopoulou, Nikolay Stoimenov, Pengcheng Huang, and Lothar Thiele. 2014. Mapping mixed-criticality applications on multi-core architectures. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE), 2014. IEEE Conference Publications, 98:1--98:6. Google Scholar
Digital Library
- Andrés Goens, Jeronimo Castrillon, Maximilian Odendahl, and Rainer Leupers. 2016. An optimal allocation of memory buffers for complex multicore platforms. Journal of Systems Architecture 66--67. Google Scholar
Digital Library
- Min Kyu Jeong, Doe Hyun Yoon, Dam Sunwoo, Mike Sullivan, Ikhwan Lee, and Mattan Erez. 2012. Balancing DRAM locality and parallelism in shared memory CMP systems. In IEEE International Symposium on High-Performance Comp Architecture. IEEE, 1--12. Google Scholar
Digital Library
- Taewhan Kim and Jungeun Kim. 2007. Integration of Code Scheduling, Memory Allocation, and Array Binding for Memory-Access Optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26, 1 (jan 2007), 142--151. Google Scholar
Digital Library
- Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, and Yunheung Paek. 2010. Operation and data mapping for CGRAs with multi-bank memory. ACM SIGPLAN Notices 45, 4 Google Scholar
Digital Library
- Ming-Yung Ko and Shuvra S. Bhattacharyya. 2003. Partitioning for DSP Software Synthesis. Springer Berlin Heidelberg, Berlin, Heidelberg, 344--358.Google Scholar
- R. Leupers and D. Kotte. 2001. Variable partitioning for dual memory bank DSPs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), Vol. 2. IEEE, 1121--1124 vol. 2. Google Scholar
Digital Library
- Lei Liu, Zehan Cui, Mingjie Xing, Yungang Bao, Mingyu Chen, and Chengyong Wu. 2012. A software memory partition approach for eliminating bank-level interference in multicore systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques - PACT'12. ACM Press, 367--376. Google Scholar
Digital Library
- Wei Mi, Xiaobing Feng, Jingling Xue, and Yaocang Jia. 2010. Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors. In Network and Parallel Computing. LNCS, Vol. 6289. Springer Berlin Heidelberg, 329--343. Google Scholar
Digital Library
- Alastair Murray and Björn Franke. 2008. Fast source-level data assignment to dual memory banks. In Proceedings of the 11th international workshop on Software 8 compilers for embedded systems - SCOPES'08. ACM Press, 43--52. Google Scholar
Cross Ref
- Vincent Nélis, Patrick Meumeu Yomsi, and Luis Miguel Pinho. 2016. The variability of application execution times on a multi-core platform. In 16th International Workshop on Worst-Case Execution Time Analysis (WCET 2016). http://www.cister.isep.ipp.pt/docs/the_variability_of_application_execution_times_on_a_multi_core_platform/1224/attach.pdf.Google Scholar
- Andreas Olofsson, Roman Trogan, Oleg Raikhman, and Lexington Adapteva. 2011. A 1024-core 70 GFLOP/W floating point manycore microprocessor. In Poster on 15th Workshop on High Performance Embedded Computing (HPEC 2011). http://www.adapteva.com/wp-content/uploads/2011/10/adapteva_hpec11.pdf.Google Scholar
- Xing Pan, Yasaswini Jyothi Gownivaripalli, and Frank Mueller. 2016. TintMalloc: Reducing Memory Access Divergence via Controller-Aware Coloring. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 363--372.Google Scholar
Cross Ref
- Quentin Perret, Pascal Maurere, Eric Noulard, Claire Pagetti, Pascal Sainrat, and Benoit Triquet. 2016. Temporal Isolation of Hard Real-Time Applications on Many-Core Processors. In 2016 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE, 1--11.Google Scholar
- B. Ramakrishna Rau. 1991. Pseudo-randomly interleaved memory. ACM SIGARCH Computer Architecture News 19, 3 Google Scholar
Digital Library
- Jan Reineke, Isaac Liu, Hiren D. Patel, Sungjun Kim, and Edward A. Lee. 2011. PRET DRAM controller: bank privatization for predictability and temporal isolation. In Proceedings of the Seventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis - CODES+ISSS'11. ACM Press, 99--108. Google Scholar
Digital Library
- Hamza Rihani, Matthieu Moy, Claire Maiza, Robert I. Davis, and Sebastian Altmeyer. 2016. Response Time Analysis of Synchronous Data Flow Programs on a Many-Core Processor. In Proceedings of the 24th International Conference on Real-Time Networks and Systems - RTNS'16. ACM Press. Google Scholar
Digital Library
- Mazen A. R. Saghir, Paul Chow, and Corinna G. Lee. 1996. Exploiting dual data-memory banks in digital signal processors. ACM SIGOPS Operating Systems Review 30, 5Google Scholar
Digital Library
- K. Shyam and R. Govindarajan. 2007. An Array Allocation Scheme for Energy Reduction in Partitioned Memory Architectures. Springer Berlin Heidelberg, Berlin, Heidelberg, 32--47. Google Scholar
Digital Library
- Viera Sipkova. 2003. Efficient Variable Allocation to Dual Memory Banks of DSPs. Springer Berlin Heidelberg, Berlin, Heidelberg, 359--372.Google Scholar
- Maria Soto, Marc Sevaux, André Rossi, and Johann Laurent. 2013. Memory Allocation Problems in Embedded Systems: Optimization Methods. Wiley-ISTE. 256 pages. https://hal.archives-ouvertes.fr/hal-00767031.Google Scholar
- StreamIt Benchmark Suite. Retrieved July 10, 2017 from http://groups.csail.mit.edu/cag/streamit/shtml/benchmarks.shtml.Google Scholar
- Andreas Tretter, Pratyush Kumar, and Lothar Thiele. 2015. Interleaved Multi-Bank Scratchpad Memories: A Probabilistic Description of Access Conflicts. In Proceedings of the 52nd Annual Design Automation Conference on - DAC'15. ACM Press. Google Scholar
Digital Library
- Andreas Tretter, Harshavardhan Pandit, Pratyush Kumar, and Lothar Thiele. 2014. Deterministic memory sharing in Kahn process networks: Ultrasound imaging as a case study. In 2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia). IEEE.Google Scholar
Cross Ref
- Prathap Kumar Valsan and Heechul Yun. 2015. MEDUSA: A Predictable and High-Performance DRAM Controller for Multicore Based Embedded Systems. In 2015 IEEE 3rd International Conference on Cyber-Physical Systems, Networks, and Applications. IEEE, 86--93. Google Scholar
Digital Library
- Zheng Pei Wu, Yogen Krish, and Rodolfo Pellizzoni. 2013. Worst Case Analysis of DRAM Latency in Multi-requestor Systems. In 2013 IEEE 34th Real-Time Systems Symposium. IEEE, 372--383. Google Scholar
Digital Library
- Heechul Yun, Renato Mancuso, Zheng-Pei Wu, and Rodolfo Pellizzoni. 2014. PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms. In 2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE, 155--166.Google Scholar
Cross Ref
- Lei Zhang, Meikang Qiu, Edwin H.-M. Sha, and Qingfeng Zhuge. 2011. Variable assignment and instruction scheduling for processor with multi-module memory. Microprocessors and Microsystems 35, 3 Google Scholar
Digital Library
Index Terms
Minimising Access Conflicts on Shared Multi-Bank Memory
Recommendations
Simple Memory Machine Models for GPUs
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD ForumThe main contribution of this paper is to introduce two parallel memory machines, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). Unlike well studied theoretical parallel computational models such as PRAMs, these parallel memory ...
Performance characterization of a DRAM-NVM hybrid memory architecture for HPC applications using intel optane DC persistent memory modules
MEMSYS '19: Proceedings of the International Symposium on Memory SystemsNon-volatile, byte-addressable memory (NVM) has been introduced by Intel in the form of NVDIMMs named Intel® Optane™ DC PMM. This memory module has the ability to persist the data stored in it without the need for power. This expands the memory ...
Minimizing write activities to non-volatile memory via scheduling and recomputation
SASP '10: Proceedings of the 2010 IEEE 8th Symposium on Application Specific Processors (SASP)Non-volatile memories, such as flash memory, Phase Change Memory (PCM), and Magnetic Random Access Memory (MRAM), have many desirable characteristics for embedded DSP systems to employ them as main memory. These characteristics include low-cost, shock-...






Comments