Abstract
We propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs. Our representation, which we call HPVM, is a hierarchical dataflow graph with shared memory and vector instructions. HPVM supports three important capabilities for programming heterogeneous systems: a compiler intermediate representation (IR), a virtual instruction set (ISA), and a basis for runtime scheduling; previous systems focus on only one of these capabilities. As a compiler IR, HPVM aims to enable effective code generation and optimization for heterogeneous systems. As a virtual ISA, it can be used to ship executable programs, in order to achieve both functional portability and performance portability across such systems. At runtime, HPVM enables flexible scheduling policies, both through the graph structure and the ability to compile individual nodes in a program to any of the target devices on a system. We have implemented a prototype HPVM system, defining the HPVM IR as an extension of the LLVM compiler IR, compiler optimizations that operate directly on HPVM graphs, and code generators that translate the virtual ISA to NVIDIA GPUs, Intel's AVX vector units, and to multicore X86-64 processors. Experimental results show that HPVM optimizations achieve significant performance improvements, HPVM translators achieve performance competitive with manually developed OpenCL code for both GPUs and vector hardware, and that runtime scheduling policies can make use of both program and runtime information to exploit the flexible compilation capabilities. Overall, we conclude that the HPVM representation is a promising basis for achieving performance portability and for implementing parallelizing compilers for heterogeneous parallel systems.
- R. Allen and K. Kennedy. 2002. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, Inc., San Francisco, CA. Google Scholar
Digital Library
- Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman Amarasinghe. 2009. PetaBricks: A Language and Compiler for Algorithmic Choice (PLDI). Google Scholar
Digital Library
- E. A. Ashcroft and W. W. Wadge. 1977. Lucid, a Nonprocedural Language with Iteration. Commun. ACM (1977). Google Scholar
Digital Library
- CÃl'dric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-AndrÃl' Wacrenier. 2011. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience (2011). Google Scholar
Digital Library
- Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, Tobias Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, Adam Betts, Alas-tair F. Donaldson, Jeroen Ketema, Javed Absar, Sven van Haastregt, Alexey Kravets, Anton Lokhmotov, Robert David, and Elnar Hajiyev. 2015. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society, Washington, DC, USA, 138--149. Google Scholar
Digital Library
- Michael Bauer, Sean Treichler, Elliot Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions (SC). Google Scholar
Digital Library
- Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17). ACM, New York, NY, USA, 235--248. Google Scholar
Digital Library
- Nicolas Benoit and Stéphane Louise. 2010. Extending GCC with a Multi-grain Parallelism Adaptation Framework for MPSoCs. In 2nd Int'l Workshop on GCC Research Opportunities.Google Scholar
- Nicolas Benoit and Stéphane Louise. 2016. Using an Intermediate Representation to Map Workloads on Heterogeneous Parallel Systems. In 24th Euromicro Conference.Google Scholar
Cross Ref
- Zoran Budimlic, Michael Burke, Vincent CavÃl', Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, and Sagnak Tasirlar. 2010. Concurrent Collections. Scientific Programming 18, 3--4 (2010), 203--217. Google Scholar
Digital Library
- Li-wen Chang, Abdul Dakkak, Christopher I. Rodrigues, and Wen mei Hwu. 2015. Tangram: a High-level Language for Performance Portable Code Synthesis (MULTIPROG 2015).Google Scholar
- D.E. Culler, S.C. Goldstein, K.E. Schauser, and T. Voneicken. 1993. TAM - A Compiler Controlled Threaded Abstract Machine. Parallel and Distributed Computing. Google Scholar
Digital Library
- Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy (SC). Google Scholar
Digital Library
- HSA Foundation. 2015. HSAIL. (2015). Retrieved January 17, 2018from http://www.hsafoundation.com/standards/Google Scholar
- Vladimir Gajinov, Srdjan Stipic, Osman S. Unsal, Tim Harris, Eduard Ayguadé, and Adrián Cristal. 2012. Integrating Dataflow Abstractions into the Shared Memory Model. In 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing. 243--251. Google Scholar
Digital Library
- Vladimir Gajinov, Srdjan Stipic, Osman S. Unsal, Tim Harris, Eduard Ayguadé, and Adrián Cristal. 2012. Supporting Stateful Tasks in a Dataflow Graph. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT '12). ACM, New York, NY, USA, 435--436. Google Scholar
Digital Library
- Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidyalingam S. Sunderam. 1994. PVM: A Users' Guide and Tutorial for Networked Parallel Computing. MIT press. Google Scholar
Digital Library
- Google. 2013. Google Cloud Dataflow. (2013). Retrieved January 17, 2018 from https://cloud.google.com/dataflow/Google Scholar
- Dounia Khaldi, Pierre Jouvelot, Francois Irigoin, and Corinne Ancourt. 2012. SPIRE: A Methodology for Sequential to Parallel Intermediate Representation Extension (CPC).Google Scholar
- Khronos Group. 2012. SPIR 1.2 Specification. https://www.khronos.org/registry/spir/specs/spir_spec-1.2.pdf. (2012).Google Scholar
- Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In Proc. Conf. on Code Generation and Optimization. San Jose, CA, USA, 75--88. Google Scholar
Digital Library
- Li-wen Chang. 2015. Personal Communication. (2015).Google Scholar
- D. Majeti and V. Sarkar. 2015. Heterogeneous Habanero-C (H2C): A Portable Programming Model for Heterogeneous Processors (IPDPS Workshop). Google Scholar
Digital Library
- Tim Mattson, Romain Cledat, Zoran Budimlic, Vincent Cave, Sanjay Chatterjee, Bala Seshasayee, Wijngaart Rob van der, and Vivek Sarkar. 2015. OCR: The Open Community Runtime Interface. Technical Report.Google Scholar
- Takamichi Miyamoto, Saori Asaka, Hiroki Mikami, Masayoshi Mase, Yasutaka Wada, Hirofumi Nakano, Keiji Kimura, and Hironori Kasahara. 2008. Parallelization with Automatic Parallelizing Compiler Generating Consumer Electronics Multicore API. In 2008 IEEE International Symposium on Parallel and Distributed Processing with Applications. IEEE. Google Scholar
Digital Library
- Rishiyur S. Nikhil. 1993. The Parallel Programming Language Id and its compilation for parallel machines (IJHSC).Google Scholar
- NVIDIA. 2009. PTX: Parallel Thread Execution ISA. http://docs.nvidia.com/cuda/parallel-thread-execution/index.html. (2009).Google Scholar
- NVIDIA. 2013. NVVM IR. http://docs.nvidia.com/cuda/nvvm-ir-spec. (2013).Google Scholar
- M. Okamoto, K. Yamashita, H. Kasahara, and S. Narita. 1995. Hierarchical macro-dataflow computation scheme. In IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing. Proceedings. IEEE.Google Scholar
- LLVM Project. 2003. LLVM Language Reference Manual. (2003). Retrieved January 17, 2018from http://llvm.org/docs/LangRef.htmlGoogle Scholar
- Qualcomm Technologies, Inc. 2014. MARE: Enabling Applications for Heterogeneous Mobile Devices. Technical Report.Google Scholar
- Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2017. Tapir: Embedding Fork-Join Parallelism into LLVM's Intermediate Representation (PPoPP). Google Scholar
Digital Library
- John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report.Google Scholar
- Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. 2014. Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages (ACM TECS). Google Scholar
Digital Library
- William Thies, Michal Karczmarek, and Saman Amarasinghe. 2002. StreamIt: A Language for Streaming Applications (International Conference on Compiler Construction). Google Scholar
Digital Library
- Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, and Hironori Kasahara. 2011. A Parallelizing Compiler Cooperative Heterogeneous Multicore Processor Architecture. Springer Berlin Heidelberg, Berlin, Heidelberg.Google Scholar
- Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar. 2009. Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement. In Proceedings of the 22Nd International Conference on Languages and Compilers for Parallel Computing (LCPC'09). Springer-Verlag, Berlin, Heidelberg, 172--187. Google Scholar
Digital Library
- Jin Zhou and Brian Demsky. 2010. Bamboo: A Data-centric, Object-oriented Approach to Many-core Software. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '10). ACM, New York, NY, USA, 388--399. Google Scholar
Digital Library
Index Terms
HPVM: heterogeneous parallel virtual machine
Recommendations
ApproxHPVM: a portable compiler IR for accuracy-aware optimizations
We propose ApproxHPVM, a compiler IR and system designed to enable accuracy-aware performance and energy tuning on heterogeneous systems with multiple compute units and approximation methods. ApproxHPVM automatically translates end-to-end application-...
HPVM: heterogeneous parallel virtual machine
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingWe propose a parallel program representation for heterogeneous systems, designed to enable performance portability across a wide range of popular parallel hardware, including GPUs, vector instruction sets, multicore CPUs and potentially FPGAs. Our ...
Leveraging GPUs using cooperative loop speculation
Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a ...







Comments