Abstract
We present Tangram, a programming system for writing performance-portable programs. The language enables programmers to write computation and composition codelets, supported by tuning knobs and primitives for expressing data parallelism and work decomposition. The compiler and runtime use a set of techniques such as hierarchical composition, coarsening, data placement, tuning, and runtime selection based on input characteristics and micro-profiling. The resulting performance is competitive with optimized vendor libraries.
- B. Jang et al. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Trans. Parallel Distrib. Syst., 22(1):105--118, 2011. Google Scholar
Digital Library
- D. Merrill et al. Policy-based tuning for performance portability and library co-optimization. In InPar, pages 1--10, 2012.Google Scholar
Cross Ref
- G. Blelloch. NESL: A nested data-parallel language. Technical report, Pittsburgh, PA, USA, 1992. Google Scholar
Digital Library
- G. Chen et al. PORPLE: An extensible optimizer for portable data placement on GPU. In MICRO, pages 88--100, 2014. Google Scholar
Digital Library
- H.-S. Kim et al. Locality-centric thread scheduling for bulk-synchronous programming models on cpu architectures. In CGO, pages 257--268, 2015. Google Scholar
Digital Library
- J. Ansel et al. Petabricks: A language and compiler for algorithmic choice. In PLDI, pages 38--49, 2009. Google Scholar
Digital Library
- R. Karrenberg and S. Hack. Improving Performance of OpenCL on CPUs. In CC, pages 1--20, 2012. Google Scholar
Digital Library
- L.-W. Chang et al. Tangram: a high-level language for performance portable code synthesis. In In Programmability Issues for Heterogeneous Multicores, 2015.Google Scholar
- L.-W. Chang et al. Dysel: Lightweight dynamic selection for kernel-based data-parallel programming model. In ASPLOS, 2016 (in press).Google Scholar
Digital Library
- M. Püschel et al. Spiral: A generator for platform-adapted libraries of signal processing alogorithms. International Journal of High Performance Computing Applications, 18(1):21--45, 2004. Google Scholar
Digital Library
- P. Jääskeläinen et al. pocl: A performance-portable OpenCL implementation, 2014.Google Scholar
- R. C. Whaley et el. Automated empirical optimizations of software and the atlas project. Parallel Computing, 27(1):3--35, 2001.Google Scholar
- S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, pages 44--54, 2009. Google Scholar
Digital Library
Recommendations
A programming system for future proofing performance critical libraries
PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingWe present Tangram, a programming system for writing performance-portable programs. The language enables programmers to write computation and composition codelets, supported by tuning knobs and primitives for expressing data parallelism and work ...
Improving Performance of GPU Specific OpenCL Program on CPUs
PDCAT '12: Proceedings of the 2012 13th International Conference on Parallel and Distributed Computing, Applications and TechnologiesOpenCL provides unified programming interface for various parallel computing platforms. The OpenCL framework manifests good functional portability, the programs can be run on platforms supporting OpenCL programming without any modification. However, ...
An insightful program performance tuning chain for GPU computing
ICA3PP'12: Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part IIt is challenging to optimize GPU kernels because this progress requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of ...






Comments