skip to main content
research-article

HARS: A hardware-assisted runtime software for embedded many-core architectures

Published:28 March 2014Publication History
Skip Abstract Section

Abstract

The current trend in embedded computing consists in increasing the number of processing resources on a chip. Following this paradigm, cluster-based many-core accelerators with a shared hierarchical memory have emerged. Handling synchronizations on these architectures is critical since parallel implementations speed-ups of embedded applications strongly depend on the ability to exploit the largest possible number of cores while limiting task management overhead. This article presents the combination of a low-overhead complete runtime software and a flexible hardware accelerator for synchronizations called HARS (Hardware-Assisted Runtime Software). Experiments on a multicore test chip showed that the hardware accelerator for synchronizations has less than 1% area overhead compared to a cluster of the chip while reducing synchronization latencies (up to 2.8 times compared to a test-and-set implementation) and contentions. The runtime software part offers basic features like memory management but also optimized execution engines to allow the easy and efficient extraction of the parallelism in applications with multiple programming models. By using the hardware acceleration as well as a very low overhead task scheduling software technique, we show that HARS outperforms an optimized state-of-the-art task scheduler by 13% for the execution of a parallel application.

References

  1. B. E. S. Akgul and V. J. Mooney III. 2002. The system-on-a-chip lock cache. Design Automation Embed. Syst. 7, 139--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apple Inc. 2010. Grand Central Dispatch (GCD) reference. http://developer.apple.com.Google ScholarGoogle Scholar
  3. K. Arvind and R. S. Nikhil. 1990. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Comput. 39, 3, 300--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. 2011. Starpu: A unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exper. 23, 187--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Bariani, P. Lambruschini, and M. Raggio. 2010. Vc-1 decoder on stmicroelectronics p2012 architecture. In Proceedings of the 8th International Workshop STreaming Day.Google ScholarGoogle Scholar
  6. L. Benini, E. Flamand, D. Fuin, and D. Melpignano. 2012. P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. 983--987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. 2008. Corey: an operating system for many cores. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, 43--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. 2010. An analysis of Linux scalability to many cores. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI'10). USENIX Association, Berkeley, CA, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M.-C. Chiang. 1991. Memory system design for bus-based multiprocessors. Ph.D. thesis, Madison, WI, UMI Order No. GAX92-09300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. V. Gutiérrez. 2010. Architectural support for parallel computers with fair reader/writer synchronization. Ph.D. thesis, University of Cantabria.Google ScholarGoogle Scholar
  11. M. Herlihy and N. Shavit. 2008. The Art of Multiprocessor Programming. Elsevier. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Holt, A. Agarwal, S. Brehmer, M. Domeika, P. Griffin, and F. Schirrmeister. 2009. Software standards for the multicore era. IEEE Micro 29, 40--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Lhuillier and D. Couroussé. 2012. Embedded system memory allocator optimization using dynamic code generation. In Proceedings of the Workshop Dynamic Compilation Everywhere, in Conjunction with the 7th HiPEAC Conference. H.-P. Charles, P. Clauss, and F. Pétrot, Eds.Google ScholarGoogle Scholar
  14. S. F. Lundstrom and G. H. Barnes. 1980. A controllable MIMD architecture. In Proceedings of the International Conference on Parallel Processing. 19--27.Google ScholarGoogle Scholar
  15. G. Mariani, G. Palermo, C. Silvano, and V. Zaccaria. 2011. Arte: An application-specific run-time management framework for multi-core systems. In Proceedings of the IEEE 9th Symposium on Application Specific Processors. 86--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Marongiu, P. Burgio, and L. Benini. 2012. Fast and lightweight support for nested parallelism on cluster-based embedded many-cores. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. 105--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. McIlroy, P. Dickman, and J. Sventek. 2008. Efficient dynamic heap allocation of scratch-pad memory. In Proceedings of the 7th International Symposium on Memory Management (ISMM'08). ACM, New York, 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. M. Mellor-Crummey and M. L. Scott. 1991. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9, 21--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit. 2012. Platform 2012, a many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications. In Proceedings of the 49th Annual Design Automation Conference (DAC'12). ACM, New York, 1137--1142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Monchiero, G. Palermo, C. Silvano, and O. Villa. 2006. Efficient synchronization for embedded on-chip multiprocessors. IEEE Trans. VLSI Syst. 14, 10, 1049--1062. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Ojail, R. David, K. Ben Chehida, Y. Lhuillier, and L. Benini. 2011. Synchronous reactive fine grain tasks management for homogeneous many-core architectures. In Proceedings of the 24th International Conference on Architecture of Computing Systems.Google ScholarGoogle Scholar
  22. M. Ojail, R. David, Y. Lhuillier, and A. Guerre. 2013. ARTM: A lightweight fork-join framework for many-core embedded systems. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Plurality. 2012. Hypercore processor. http://plurality.com/hypercore.html.Google ScholarGoogle Scholar
  24. A. Rahimi, I. Loi, M. Kakoee, and L. Benini. 2011. A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. 1--6.Google ScholarGoogle Scholar
  25. S. Saez, J. Vila, A. Crespo, and A. Garcia. 1999. A hardware scheduler for complex real-time systems. In Proceedings of the IEEE International Symposium on Industrial Electronics (ISIE'99). Vol. 1. 43--48.Google ScholarGoogle Scholar
  26. B. Saha, A.-R. Adl-Tabatabai, A. Ghuloum et al. 2007. Enabling scalability and performance in a large scale cmp environment. SIGOPS Oper. Syst. Rev. 41, 73--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. STMicroelectronics and CEA. 2010. Platform 2012: A many-core programmable accelerator for ultra-efficient embedded computing in nanometer technology. In Proceedings of the Research Workshop on STMicroelectronics Platform.Google ScholarGoogle Scholar
  28. J. Stone, D. Gohara, and G. Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12, 3, 66--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Putnam, K. Michelson, M. Oskin, and S. J. Eggers. 2007. The WaveScalar architecture. ACM Trans. Comput. Syst. 25, 4:1--4:54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. Thabet, Y. Lhuillier, C. Andriamisaina, J.-M. Philippe, and R. David. 2013. An efficient and flexible hardware support for accelerating synchronization operations on the STHORM many-core architecture. In Proceedings of the Conference and Exhibition on Design, Automation and Test in Europe. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. W. Thies, M. Karczmarek, and S. P. Amarasinghe. 2002. Streamit: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction (CC'02). Springer, 179--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. E. Vallejo, R. Beivide, A. Cristal, T. Harris, F. Vallejo, O. Unsal, and M. Valero. 2010. Architectural support for fair reader-writer locking. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. 275--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'01). Vol. 1. I--511 -- I--518.Google ScholarGoogle Scholar
  34. C. Yu and P. Petrov. 2010. Low-cost and energy-efficient distributed synchronization for embedded multiprocessors. IEEE Trans. VLSI Syst. 18, 8, 1257--1261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. W. Zhu, V. C. Sreedhar, Z. Hu, and G. R. Gao. 2007. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA'07). ACM, New York, 35--45. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HARS: A hardware-assisted runtime software for embedded many-core architectures

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!