Abstract
Dataflow programming languages facilitate the design of data intensive programs such as streaming applications commonly found in embedded systems. They also expose parallelism that can be exploited using multicore processors which are now part of the mobile landscape. In recent years a shift has occurred towards heterogeneity ( ARM big.LITTLE) and reconfigurability. Dynamic Multicore Processors (DMPs) bridge the gap between fully reconfigurable processors and homogeneous multicore systems. They can re-allocate their resources at runtime to create larger more powerful logical processors fine-tuned to the workload. Unfortunately, there exists no accurate method to determine how to partition the cores in a DMP among application threads. Often programmers rely on analyzing the application manually and using a set of hand picked heuristics. This leads to sub-optimal performance, reducing the potential of DMPs. What is needed is a way to determine the optimal partitioning and grouping of resources to maximize performance. As a first step, this paper studies the effect of thread partitioning and hardware resource allocation on a set of StreamIt applications. We show that the resulting space is not trivial and exhibits a large performance variation depending on the combination of parameters. We introduce a machine-learning based methodology to tackle the space complexity. Our machine-learning model is able to directly predict the best combination of parameters using static code features. The predicted set of parameters leads to performance on-par with the best performance found in a space of more than 32,000 configurations per application.
- J. Auerbach, D. Bacon, I. Burcea, P. Cheng, S. Fink, R. Rabbah, and S. Shukla. A compiler and runtime for heterogeneous computing. In DAC, 2012, pages 271–276, June 2012. Google Scholar
Digital Library
- S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook. Tile64 - processor: A 64-core soc with mesh interconnect. In ISSCC 2008. IEEE International, pages 88–598, Feb 2008.Google Scholar
Cross Ref
- F. Bower, D. Sorin, and L. Cox. The impact of dynamically heterogeneous multicore processors on thread scheduling. Micro, IEEE, 28(3): 17–25, May 2008. ISSN 0272-1732.. Google Scholar
Digital Library
- I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for gpus: Stream computing on graphics hardware. In ACM SIGGRAPH 2004, pages 777–786, New York, NY, USA, 2004. ACM. Google Scholar
Digital Library
- P. M. Carpenter, A. Ramirez, and E. Ayguade. Mapping stream programs onto heterogeneous multiprocessor systems. In CASES ’09, pages 57–66, New York, NY, USA, 2009. ACM. Google Scholar
- J. Chen, M. I. Gordon, W. Thies, M. Zwicker, K. Pulli, and F. Durand. A reconfigurable architecture for load-balanced rendering. In HWWS ’05, pages 71–80, New York, NY, USA, 2005. ACM. Google Scholar
- S. Eyerman and L. Eeckhout. Modeling critical sections in amdahl’s law and its implications for multicore design. SIGARCH Comput. Archit. News, 38(3):362–370, June 2010.. Google Scholar
Digital Library
- S. M. Farhad, Y. Ko, B. Burgstaller, and B. Scholz. Profile-guided deployment of stream programs on multicores. LCTES ’12, pages 79–88, New York, NY, USA, 2012. ACM.. Google Scholar
Digital Library
- M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. SIGARCH Comput. Archit. News, 30(5):291–303, Oct. 2002. ISSN 0163-5964.. Google Scholar
Digital Library
- M. Govindan, B. Robatmili, D. Li, B. Maher, A. Smith, S. W. Keckler, and D. Burger. Scaling power and performance via processor composability. IEEE Transactions on Computers, 63(8):2025–2038, 2014. Google Scholar
Digital Library
- D. P. Gulati, C. Kim, S. Sethumadhavan, S. W. Keckler, and D. Burger. Multitasking workload scheduling on flexible core chip multiprocessors. SIGARCH Comput. Archit. News, 36(2):46–55, May 2008. Google Scholar
Digital Library
- E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core fusion: Accommodating software diversity in chip multiprocessors. SIGARCH Comput. Archit. News, 35(2):186–197, June 2007. Google Scholar
Digital Library
- C. Kim, S. Sethumadhavan, M. S. Govindan, N. Ranganathan, D. Gulati, D. Burger, and S. W. Keckler. Composable lightweight processors. In MICRO ’07, pages 381–394, Washington, DC, USA, 2007. IEEE Computer Society. Google Scholar
- M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. SIGPLAN Not., 43(6):114–124, June 2008. Google Scholar
Digital Library
- R. R. Newton, L. D. Girod, M. B. Craig, S. R. Madden, and J. G. Morrisett. Design and evaluation of a compiler for embedded stream programs. In LCTES ’08, pages 131–140, New York, NY, USA, 2008. ACM. Google Scholar
- U. of Edinburgh. Edinburgh compute and data facility web site, 1 August 2007, accessed 4th of April. 2016. www.ecdf.ed.ac.uk.Google Scholar
- P. Santos, G. Nazar, F. Anjam, S. Wong, D. Matos, and L. Carro. A fully dynamic reconfigurable noc-based mpsoc: The advantages of total reconfiguration. In HiPEAC ’13, Berlin, Germany, January 2013.Google Scholar
- M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. SIGPLAN Not., 44(3):253–264, Mar. 2009. Google Scholar
Digital Library
- W. Thies and S. Amarasinghe. An empirical characterization of stream programs and its implications for language and compiler design. In PACT ’10, pages 365–376, New York, NY, USA, 2010. ACM. Google Scholar
- W. Thies, M. Karczmarek, and S. P. Amarasinghe. Streamit: A language for streaming applications. In CC, pages 179–196, London, UK, UK, 2002. Springer-Verlag. Google Scholar
Digital Library
- R. W. Vuduc. Automatic Performance Tuning of Sparse Matrix Kernels. PhD thesis, 2003. AAI3121741. Google Scholar
Digital Library
- E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring it all to software: Raw machines. Computer, 30 (9):86–93, Sep 1997. Google Scholar
Digital Library
- Z. Wang and M. F. P. O’boyle. Using machine learning to partition streaming programs. ACM Trans. Archit. Code Optim., 10(3):20:1– 20:25, Sept. 2008. Google Scholar
Digital Library
- Y. Watanabe, J. D. Davis, and D. A. Wood. Widget: Wisconsin decoupled grid execution tiles. SIGARCH Comput. Archit. News, 38 (3):2–13, June 2010. Google Scholar
Digital Library
- P. M. Wells, K. Chakraborty, and G. S. Sohi. Dynamic heterogeneity and the need for multicore virtualization. SIGOPS Oper. Syst. Rev., 43 (2):5–14, Apr. 2009. Google Scholar
Digital Library
- Y. Zhou and D. Wentzlaff. The sharing architecture: Sub-core configurability for iaas clouds. SIGPLAN Not., 49(4):559–574, Feb. 2014. Google Scholar
Digital Library
Index Terms
A machine learning approach to mapping streaming workloads to dynamic multicore processors
Recommendations
A machine learning approach to mapping streaming workloads to dynamic multicore processors
LCTES 2016: Proceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory for Embedded SystemsDataflow programming languages facilitate the design of data intensive programs such as streaming applications commonly found in embedded systems. They also expose parallelism that can be exploited using multicore processors which are now part of the ...
A Study of Dynamic Phase Adaptation Using a Dynamic Multicore Processor
Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017Heterogeneous processors such as ARM’s big.LITTLE have become popular for embedded systems. They offer a choice between running workloads on a high performance core or a low-energy core leading to increased energy efficiency. However, the core ...
Performance and portability with OpenCL for throughput-oriented HPC workloads across accelerators, coprocessors, and multicore processors
ScalA '14: Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale SystemsEver since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these ...







Comments