skip to main content
research-article
Open Access
Artifacts Available
Artifacts Evaluated & Functional

Static placement of computation on heterogeneous devices

Published:12 October 2017Publication History
Skip Abstract Section

Abstract

Heterogeneous architectures characterize today hardware ranging from super-computers to smartphones. However, in spite of this importance, programming such systems is still challenging. In particular, it is challenging to map computations to the different processors of a heterogeneous device. In this paper, we provide a static analysis that mitigates this problem. Our contributions are two-fold: first, we provide a semi-context-sensitive algorithm, which analyzes the program's call graph to determine the best processor for each calling context. This algorithm is parameterized by a cost model, which takes into consideration processor's characteristics and data transfer time. Second, we show how to use simulated annealing to calibrate this cost model for a given heterogeneous architecture. We have used our ideas to build Etino, a tool that annotates C programs with OpenACC or OpenMP 4.0 directives. Etino generates code for a CPU-GPU architecture without user intervention. Experiments on classic benchmarks reveal speedups of up to 75x. Moreover, our calibration process lets avoid slowdowns of up to 720x which trivial parallelization approaches would yield.

Skip Supplemental Material Section

Supplemental Material

References

  1. Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison Wesley, Boston, MA, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Péricles Alves, Fabian Gruber, Johannes Doerfert, Alexandros Lamprineas, Tobias Grosser, Fabrice Rastello, and Fernando Magno Quintão Pereira. 2015. Runtime Pointer Disambiguation. SIGPLAN Not. 50, 10 (2015), 589–606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Amini, C. Ancourt, F. Coelho, B. Creusillet, S. Guelton, F. Irigoin, P. Jouvelot, R. Keryell, and P. Villalon. 2012. PIPS Is not (only) Polyhedral Software. Technical Report. IMPACT.Google ScholarGoogle Scholar
  4. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. 2010. A View of Cloud Computing. CACM 53, 4 (2010), 50–58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cedric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-Andre Wacrenier. 2011. StarP U: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurr. Comput. : Pract. Exper. 23, 2 (2011), 187–198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Eduard Ayguadé, Rosa M. Badia, Francisco D. Igual, Jesús Labarta, Rafael Mayo, and Enrique S. Quintana-Ortí. 2009. An Extension of the StarSs Programming Model for Platforms with Multiple GP Us. In Euro-Par. Springer, Heidelberg, Germany, 851–862.Google ScholarGoogle Scholar
  7. David Bacon, Rodric Rabbah, and Sunil Shukla. 2013. FPGA Programming for the Masses. Queue 11, 2 (2013), 40:40–40:52.Google ScholarGoogle Scholar
  8. Enes Bajrovic, Robert Mijakovic, Jiri Dokulil, Siegfried Benkner, and Michael Gerndt. 2016. Tuning OpenCL Applications with the Periscope Tuning Framework. In HICSS. IEEE, New York, NY, USA, 5752–5761. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Rajkishore Barik, Naila Farooqui, Brian T. Lewis, Chunling Hu, and Tatiana Shpeisman. 2016. A Black-box Approach to Energy-aware Scheduling on Integrated CP U-GP U Systems. In CGO. ACM, New York, NY, USA, 70–81.Google ScholarGoogle Scholar
  10. Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA Code Generation for Affine Programs. In CC. Springer, Heidelberg, Germany, 244–263.Google ScholarGoogle Scholar
  11. Protonu Basu, Mary Hall, Malik Khan, Suchit Maindola, Saurav Muralidharan, Shreyas Ramalingam, Axel Rivera, Manu Shantharam, and Anand Venkat. 2013. Towards Making Autotuning Mainstream. Int. J. High Perform. Comput. Appl. 27, 4 (2013), 379–393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Richard Ernest Bellman. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Carlo Bertolli, Samuel Antao, Alexandre Eichenberger, Kevin O’Brien, Zehra Sura, Arpith Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GP U Threads for OpenMP 4.0 in LLVM. In LLVM-HPC. IEEE, New York, NY, USA, 12–21.Google ScholarGoogle Scholar
  14. Ting Cao, Stephen M. Blackburn, Tiejun Gao, and Kathryn S. McKinley. 2012. The Yin and Yang of power and performance for asymmetric hardware and managed software. In ISCA. ACM, New York, NY, USA, 225–236. Google ScholarGoogle ScholarCross RefCross Ref
  15. Vlado Cerny. 1985. Thermodynamical approach to the traveling salesman problem. Journal of Optimization Theory and Applications 45, 1 (1985), 41–51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Stephen Chong, Jed Liu, Andrew C Myers, Xin Qi, Krishnaprasad Vikram, Lantian Zheng, and Xin Zheng. 2007. Secure web applications via automatic partitioning. ACM SIGOPS Operating Systems Review 41, 6 (2007), 31–44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An Analysis of Accelerator Coupling in Heterogeneous Architectures. In DAC. ACM, New York, NY, USA, 202:1–202:6.Google ScholarGoogle Scholar
  18. Leonardo Luiz Padovani Da Mata, Fernando Magno QuintãO Pereira, and Renato Ferreira. 2013. Automatic Parallelization of Canonical Loops. Sci. Comput. Program. 78, 8 (2013), 1193–1206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gregory F. Diamos and Sudhakar Yalamanchili. 2008. Harmony: An Execution Model and Runtime for Heterogeneous Many Core Systems. In HPDC. ACM, New York, NY, USA, 197–200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Michael Garland. 2008. Parallel Computing Experiences with CUDA. IEEE Micro 28 (2008), 13–27. Issue 4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alan Gibbons. 1988. Efficient Parallel Algorithms. Cambridge University Press, Cambridge, UK.Google ScholarGoogle Scholar
  22. Serge Guelton, Mehdi Amini, and Béatrice Creusillet. 2012. Beyond Do Loops: Data Transfer Generation with Convex Array Regions. In LCPC. Springer, Heidelberg, Germany, 249–263.Google ScholarGoogle Scholar
  23. Bhargav S. Gulavani and Sumit Gulwani. 2008. A Numerical Abstract Domain Based on Expression Abstraction and Max Operator with Application in Timing Analysis. In CAV. Springer, Heidelberg, Germany, 370–384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Albert Hartono, Boyana Norris, and P. Sadayappan. 2009. Annotation-based Empirical Performance Tuning Using Orio. In IPDPS. IEEE, New York, NY, USA, 1–11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ivan Jibaja, Ting Cao, Stephen M. Blackburn, and Kathryn S. McKinley. 2016. Portable Performance on Asymmetric Multicore Processors. In CGO. ACM, New York, NY, USA, 24–35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. David Kirk. 2007. NVIDIA Cuda Software and GP U Parallel Computing Architecture. In ISMM. ACM, New York, NY, USA, 103–104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, and Ahmed Fasih. 2012. PyCUDA and PyOpenCL: A Scripting-based Approach to GP U Run-time Code Generation. Parallel Comput. 38, 3 (2012), 157–174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ron Kohavi. 1995. A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In IJCAI. Morgan Kaufmann, Burlington, MA, USA, 1137–1143.Google ScholarGoogle Scholar
  29. Ahmad Lashgar, Alireza Majidi, and Amirali Baniasadi. 2014. IPMACC: Open Source OpenACC to CUDA/OpenCL Translator. CoRR abs/1412.1127 (2014), 1–9.Google ScholarGoogle Scholar
  30. Chris Lattner and Vikram S. Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO. IEEE, New York, NY, USA, 75–88. Google ScholarGoogle ScholarCross RefCross Ref
  31. Ondřej Lhoták and Laurie Hendren. 2006. Context-Sensitive Points-to Analysis: Is It Worth It?. In CC. Springer, Heidelberg, Germany, 47–64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Chih-Sheng Lin, Shih-Meng Teng, and Pao-Ann Hsiung. 2016. Auto-tuning for GPGP U Applications Using Performance and Energy Model. J. Syst. Archit. 62, C (2016), 40–53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Christos Margiolas and Michael F. P. O’Boyle. 2016. Portable and Transparent Software Managed Scheduling on Accelerators for Fair Resource Sharing. In CGO. ACM, New York, NY, USA, 82–93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Gleison Mendonça, Breno Guimarães, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quintão Pereira. 2017. DawnCC: Automatic Annotation for Data Parallelism and Offloading. TACO 14, 2 (2017), 13:1–13:25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Gleison Mendonça, Breno Guimaraes, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quint ao Pereira. 2016. Automatic Insertion of Copy Annotation in Data-Parallel Programs. In SBAC-PAD. ACM, New York, NY, USA, 1–8.Google ScholarGoogle Scholar
  36. Matthew Might, Yannis Smaragdakis, and David Van Horn. 2010. Resolving and Exploiting the k-CFA Paradox: Illuminating Functional vs. Object-oriented Program Analysis. In PLDI. ACM, New York, NY, USA, 305–315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Saurav Muralidharan, Amit Roy, Mary Hall, Michael Garland, and Piyush Rai. 2016. Architecture-Adaptive Code Variant Tuning. SIGPLAN Not. 51, 4 (2016), 325–338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. John Nickolls and William J. Dally. 2010. The GP U Computing Era. Micro 30 (2010), 56–69.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. John Nickolls and David Kirk. 2009. Graphics and Computing GPUs. Computer Organization and Design, (Patterson and Hennessy) (4th ed.). Elsevier, Amsterdam, Netherlands, Chapter A, A.1 – A.77.Google ScholarGoogle Scholar
  40. Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. 2005. Principles of program analysis. Springer, Heidelberg, Germany.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rajiv Nishtala, Paul Carpenter, Vinicius Petrucci, and Xavier Martorell. 2017. Hipster: Hybrid Task Manager for LatencyCritical Cloud Workloads. In HPCA. ACM, New York, NY, USA, 1–11. Google ScholarGoogle ScholarCross RefCross Ref
  42. Cedric Nugteren and Henk Corporaal. 2014. Bones: An Automatic Skeleton-Based C-to-CUDA Compiler for GP Us. TACO 11, 4 (2014), 35:1–35:25.Google ScholarGoogle Scholar
  43. Vinicius Petrucci, Michael A. Laurenzano, John Doherty, Yunqi Zhang, Daniel Mossé, Jason Mars, and Lingjia Tang. 2015. Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers. In HPCA. ACM, New York, NY, USA, 246–258. Google ScholarGoogle ScholarCross RefCross Ref
  44. PGI. 2016. Compiler User’s Guide. (2016). http://www.pgroup.com/ doc/pgiug.pdf.Google ScholarGoogle Scholar
  45. Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa, Edson Borin, and Fernando M. Quintão Pereira. 2014. Compiler Support for Selective Page Migration in NUMA Architectures. In PACT. ACM, New York, NY, USA, 369–380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Radu Rugina and Martin C. Rinard. 2005. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. TOPLAS 27, 2 (2005), 185–235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Diogo Sampaio, Rafael Martins de Souza, Sylvain Collange, and Fernando Magno Quintão Pereira. 2014. Divergence Analysis. ACM Trans. Program. Lang. Syst. 35, 4, Article 13 (2014), 36 pages.Google ScholarGoogle Scholar
  48. Mahadev Satyanarayanan. 2011. Mobile Computing: The Next Decade. SIGMOBILE 15, 2 (2011), 2–10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Olin Shivers. 1988. Control-Flow Analysis in Scheme. In PLDI. ACM, New York, NY, USA, 164–174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGP U Applications. In PPoPP. ACM, New York, NY, USA, 11–22.Google ScholarGoogle Scholar
  51. Amit Kumar Singh, Muhammad Shafique, Akash Kumar, and Jörg Henkel. 2013. Mapping on Multi/Many-core Systems: Survey of Current and Emerging Trends. In DAC. ACM, New York, NY, USA, Article 1, 10 pages.Google ScholarGoogle Scholar
  52. Shuaiwen Song, Chunyi Su, Barry Rountree, and Kirk W. Cameron. 2013. A Simplified and Accurate Model of PowerPerformance Efficiency on Emergent GP U Architectures. In IPDPS. IEEE, New York, NY, USA, 673–686.Google ScholarGoogle Scholar
  53. Victor Hugo Sperle Campos, Péricles Rafael Alves, Henrique Nazaré Santos, and Fernando Magno Quintão Pereira. 2016. Restrictification of Function Arguments. In CC. ACM, New York, NY, USA, 163–173.Google ScholarGoogle Scholar
  54. John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Des. Test 12, 3 (2010), 66–73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. TACO 9, 4 (2013), 54:1–54:23.Google ScholarGoogle Scholar
  56. Yuan Wen and Michael F.P. O’Boyle. 2017. Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CP U/GP U Platforms. In GPGPU-10. ACM, New York, NY, USA, 22–31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. John Whaley and Monica S. Lam. 2004. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In PLDI. ACM, New York, NY, USA, 131–144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Michael Wolfe. 1996. High Performance Compilers for Parallel Computing (1st ed.). Adison-Wesley, Boston, MA, USA.Google ScholarGoogle Scholar
  59. Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, and Jack Dongarra. 2015. Hierarchical DAG Scheduling for Hybrid Distributed Systems. In IPDPS. IEEE, New York, NY, USA, 156–165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Mohamed Zahran. 2016. Heterogeneous Computing: Here to Stay. Queue 14, 6 (2016), 40:31–40:42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Hui Zeng, Matt Yourst, Kanad Ghose, and Dmitry Ponomarev. 2009. MPTLsim: A Cycle-accurate, Full-system Simulator for x86-64 Multicore Architectures with Coherent Caches. SIGARCH Comput. Archit. News 37, 2 (2009), 2–9. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Static placement of computation on heterogeneous devices

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!