Abstract
The current trend in embedded computing consists in increasing the number of processing resources on a chip. Following this paradigm, cluster-based many-core accelerators with a shared hierarchical memory have emerged. Handling synchronizations on these architectures is critical since parallel implementations speed-ups of embedded applications strongly depend on the ability to exploit the largest possible number of cores while limiting task management overhead. This article presents the combination of a low-overhead complete runtime software and a flexible hardware accelerator for synchronizations called HARS (Hardware-Assisted Runtime Software). Experiments on a multicore test chip showed that the hardware accelerator for synchronizations has less than 1% area overhead compared to a cluster of the chip while reducing synchronization latencies (up to 2.8 times compared to a test-and-set implementation) and contentions. The runtime software part offers basic features like memory management but also optimized execution engines to allow the easy and efficient extraction of the parallelism in applications with multiple programming models. By using the hardware acceleration as well as a very low overhead task scheduling software technique, we show that HARS outperforms an optimized state-of-the-art task scheduler by 13% for the execution of a parallel application.
- B. E. S. Akgul and V. J. Mooney III. 2002. The system-on-a-chip lock cache. Design Automation Embed. Syst. 7, 139--174. Google Scholar
Digital Library
- Apple Inc. 2010. Grand Central Dispatch (GCD) reference. http://developer.apple.com.Google Scholar
- K. Arvind and R. S. Nikhil. 1990. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Comput. 39, 3, 300--318. Google Scholar
Digital Library
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. 2011. Starpu: A unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exper. 23, 187--198. Google Scholar
Digital Library
- M. Bariani, P. Lambruschini, and M. Raggio. 2010. Vc-1 decoder on stmicroelectronics p2012 architecture. In Proceedings of the 8th International Workshop STreaming Day.Google Scholar
- L. Benini, E. Flamand, D. Fuin, and D. Melpignano. 2012. P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. 983--987. Google Scholar
Digital Library
- S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. 2008. Corey: an operating system for many cores. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, 43--57. Google Scholar
Digital Library
- S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. 2010. An analysis of Linux scalability to many cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI'10). USENIX Association, Berkeley, CA, 1--8. Google Scholar
Digital Library
- M.-C. Chiang. 1991. Memory system design for bus-based multiprocessors. Ph.D. thesis, Madison, WI, UMI Order No. GAX92-09300. Google Scholar
Digital Library
- E. V. Gutiérrez. 2010. Architectural support for parallel computers with fair reader/writer synchronization. Ph.D. thesis, University of Cantabria.Google Scholar
- M. Herlihy and N. Shavit. 2008. The Art of Multiprocessor Programming. Elsevier. Google Scholar
Digital Library
- J. Holt, A. Agarwal, S. Brehmer, M. Domeika, P. Griffin, and F. Schirrmeister. 2009. Software standards for the multicore era. IEEE Micro 29, 40--51. Google Scholar
Digital Library
- Y. Lhuillier and D. Couroussé. 2012. Embedded system memory allocator optimization using dynamic code generation. In Proceedings of the Workshop Dynamic Compilation Everywhere, in Conjunction with the 7th HiPEAC Conference. H.-P. Charles, P. Clauss, and F. Pétrot, Eds.Google Scholar
- S. F. Lundstrom and G. H. Barnes. 1980. A controllable MIMD architecture. In Proceedings of the International Conference on Parallel Processing. 19--27.Google Scholar
- G. Mariani, G. Palermo, C. Silvano, and V. Zaccaria. 2011. Arte: An application-specific run-time management framework for multi-core systems. In Proceedings of the IEEE 9th Symposium on Application Specific Processors. 86--93. Google Scholar
Digital Library
- A. Marongiu, P. Burgio, and L. Benini. 2012. Fast and lightweight support for nested parallelism on cluster-based embedded many-cores. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. 105--110. Google Scholar
Digital Library
- R. McIlroy, P. Dickman, and J. Sventek. 2008. Efficient dynamic heap allocation of scratch-pad memory. In Proceedings of the 7th International Symposium on Memory Management (ISMM'08). ACM, New York, 31--40. Google Scholar
Digital Library
- J. M. Mellor-Crummey and M. L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9, 21--65. Google Scholar
Digital Library
- D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit. 2012. Platform 2012, a many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications. In Proceedings of the 49th Annual Design Automation Conference (DAC'12). ACM, New York, 1137--1142. Google Scholar
Digital Library
- M. Monchiero, G. Palermo, C. Silvano, and O. Villa. 2006. Efficient synchronization for embedded on-chip multiprocessors. IEEE Trans. VLSI Syst. 14, 10, 1049--1062. Google Scholar
Digital Library
- M. Ojail, R. David, K. Ben Chehida, Y. Lhuillier, and L. Benini. 2011. Synchronous reactive fine grain tasks management for homogeneous many-core architectures. In Proceedings of the 24th International Conference on Architecture of Computing Systems.Google Scholar
- M. Ojail, R. David, Y. Lhuillier, and A. Guerre. 2013. ARTM: A lightweight fork-join framework for many-core embedded systems. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. Google Scholar
Digital Library
- Plurality. 2012. Hypercore processor. http://plurality.com/hypercore.html.Google Scholar
- A. Rahimi, I. Loi, M. Kakoee, and L. Benini. 2011. A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. 1--6.Google Scholar
- S. Saez, J. Vila, A. Crespo, and A. Garcia. 1999. A hardware scheduler for complex real-time systems. In Proceedings of the IEEE International Symposium on Industrial Electronics (ISIE'99). Vol. 1. 43--48.Google Scholar
- B. Saha, A.-R. Adl-Tabatabai, A. Ghuloum et al. 2007. Enabling scalability and performance in a large scale cmp environment. SIGOPS Oper. Syst. Rev. 41, 73--86. Google Scholar
Digital Library
- STMicroelectronics and CEA. 2010. Platform 2012: A many-core programmable accelerator for ultra-efficient embedded computing in nanometer technology. In Proceedings of the Research Workshop on STMicroelectronics Platform.Google Scholar
- J. Stone, D. Gohara, and G. Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 3, 66--73. Google Scholar
Digital Library
- S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Putnam, K. Michelson, M. Oskin, and S. J. Eggers. 2007. The WaveScalar architecture. ACM Trans. Comput. Syst. 25, 4:1--4:54. Google Scholar
Digital Library
- F. Thabet, Y. Lhuillier, C. Andriamisaina, J.-M. Philippe, and R. David. 2013. An efficient and flexible hardware support for accelerating synchronization operations on the STHORM many-core architecture. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. Google Scholar
Digital Library
- W. Thies, M. Karczmarek, and S. P. Amarasinghe. 2002. Streamit: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction (CC'02). Springer, 179--196. Google Scholar
Digital Library
- E. Vallejo, R. Beivide, A. Cristal, T. Harris, F. Vallejo, O. Unsal, and M. Valero. 2010. Architectural support for fair reader-writer locking. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. 275--286. Google Scholar
Digital Library
- P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'01). Vol. 1. I--511 -- I--518.Google Scholar
- C. Yu and P. Petrov. 2010. Low-cost and energy-efficient distributed synchronization for embedded multiprocessors. IEEE Trans. VLSI Syst. 18, 8, 1257--1261. Google Scholar
Digital Library
- W. Zhu, V. C. Sreedhar, Z. Hu, and G. R. Gao. 2007. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA'07). ACM, New York, 35--45. Google Scholar
Digital Library
Index Terms
HARS: A hardware-assisted runtime software for embedded many-core architectures
Recommendations
Benchmarking Performance of a Hybrid Intel Xeon/Xeon Phi System for Parallel Computation of Similarity Measures Between Large Vectors
The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the ...
HARS: a heterogeneity-aware runtime system for self-adaptive multithreaded applications
DAC '15: Proceedings of the 52nd Annual Design Automation ConferenceHeterogeneous multi-processing (HMP) is rapidly emerging as a promising solution for high-performance and low-power computing. Despite extensive prior work, system-software support for self-adaptive multithreaded applications has been little explored in ...
Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the Intel Xeon Phi
The emergence of new manycore architectures, such as the Intel Xeon Phi, poses new challenges in how to adapt existing libraries and applications to this type of systems. In particular, the exploitation of manycore accelerators requires a holistic ...






Comments