ABSTRACT
Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks.
We present an extension to Halide to support distributed-memory parallel execution of complex stencil pipelines. These extensions compose with the existing scheduling constructs in Halide, allowing expression of complex computation and communication strategies. Existing Halide applications can be distributed with minimal changes, allowing programmers to explore the tradeoff between recomputation and communication with little effort. Approximately 10 new of lines code are needed even for a 200 line, 99 stage application. On nine image processing benchmarks, our extensions give up to a 1.4× speedup on a single node over regular multithreaded execution with the same number of cores, by mitigating the effects of non-uniform memory access. The distributed benchmarks achieve up to 18× speedup on a 16 node testing machine and up to 57× speedup on 64 nodes of the NERSC Cori supercomputer.
- Digitized Sky Survey. URL http://archive.stsci.edu/dss/.Google Scholar
- Canon 250 Megapixel Image Sensor, Press Release. URL http://www.canon.com/news/2015/sep07e.html.Google Scholar
- Cori Supercomputer System. URL http://www.nersc.gov/users/computational-systems/cori/.Google Scholar
- The OpenCV Library. URL http://code.opencv.org.Google Scholar
- Intel PMU Profiling Tools. URL https://github.com/andikleen/pmu-tools.Google Scholar
- E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden. Pyramid methods in image processing. RCA engineer, 29(6): 33--41, 1984.Google Scholar
- A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. Loggp: incorporating long messages into the logp modelone step closer towards a realistic model for parallel computation. In Proc. of Parallel algorithms and architectures, pages 95--105. ACM, 1995. Google Scholar
Digital Library
- A. Amoura, E. Bampis, C. Kenyon, and Y. Manoussakis. Scheduling independent multiprocessor tasks. Algorithmica, 32(2):247--261. Google Scholar
Digital Library
- J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O'Reilly, and S. Amarasinghe. Opentuner: An extensible framework for program autotuning. In Proc. of Parallel architectures and compilation, pages 303--316. ACM, 2014. Google Scholar
Digital Library
- M. D. Beynon, T. Kurc, U. Catalyurek, C. Chang, A. Sussman, and J. Saltz. Distributed processing of very large datasets with DataCutter. Parallel Comput., 27(11):1457--1478, 2001. Google Scholar
Digital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proc. of Object-oriented Prog., Systems, Languages, and Applications, OOPSLA '05, pages 519--538. ACM, 2005. Google Scholar
Digital Library
- J. Chen, S. Paris, and F. Durand. Real-time edge-aware image processing with the bilateral grid. In ACM Transactions on Graphics (TOG), volume 26, page 103. ACM, 2007. Google Scholar
Digital Library
- M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In IPDPS, pages 676--687, May 2011. Google Scholar
Digital Library
- S. Darbha and D. Agrawal. Optimal scheduling algorithm for distributed-memory machines. Parallel and Distributed Systems, IEEE Transactions on, 9(1):87--95, 1998. Google Scholar
Digital Library
- R. Dathathri, C. Reddy, T. Ramashekar, and U. Bondhugula. Generating efficient data movement code for heterogeneous architectures with distributed-memory. In PACT, pages 375--386. IEEE, 2013. Google Scholar
Digital Library
- P.-F. Dutot, G. Mounié, and D. Trystram. Scheduling parallel tasks: Approximation algorithms. Handbook of Scheduling: Algorithms, Models, and Performance Analysis, pages 26--1, 2004.Google Scholar
- T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. A stencil compiler for short-vector SIMD architectures. In Proc. of Intl. conf. on supercomputing, pages 13--24. ACM, 2013. Google Scholar
Digital Library
- P. N. Hilfinger, D. Bonachea, D. Gay, S. Graham, B. Liblit, G. Pike, and K. Yelick. Titanium language reference manual. Technical report, Berkeley, CA, USA, 2001. Google Scholar
Digital Library
- K. Jansen. Scheduling malleable parallel tasks: An asymptotic fully polynomial-time approximation scheme. In Lecture Notes in Computer Science, pages 562--574. Springer Berlin Heidelberg, 2002. Google Scholar
Digital Library
- K. Jansen and L. Porkolab. General multiprocessor task scheduling: Approximate solutions in linear time. In Lecture Notes in Computer Science, pages 110--121. Springer Berlin Heidelberg, 1999. Google Scholar
Digital Library
- S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, pages 1--12, 2010.Google Scholar
Cross Ref
- F. B. Kjolstad and M. Snir. Ghost cell pattern. In Proc. of Parallel Programming Patterns, ParaPLoP '10, pages 4:1--4:9. ACM, 2010. Google Scholar
Digital Library
- X. Li, B. Veeravalli, and C. Ko. Distributed image processing on a network of workstations. Intl. Journal of Computers and Applications, 25(2):136--145, 2003.Google Scholar
Cross Ref
- N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka. Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In High Performance Computing, Networking, Storage and Analysis, pages 1--12. IEEE, 2011. Google Scholar
Digital Library
- R. T. Mullapudi, V. Vasista, and U. Bondhugula. Polymage: Automatic optimization for image processing pipelines. In Proc. of Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 429--443. ACM, 2015. Google Scholar
Digital Library
- D. K. Panda et al. OSU Microbenchmarks v5.1. URL http://mvapich.cse.ohio-state.edu/benchmarks/.Google Scholar
- S. Paris, S. W. Hasinoff, and J. Kautz. Local Laplacian filters: edge-aware image processing with a Laplacian pyramid. ACM Trans. Graph., 30(4):68, 2011. Google Scholar
Digital Library
- J. Ragan-Kelley. Decoupling Algorithms from the Organization of Computation for High Performance Image Processing. PhD Thesis, MIT, 2014.Google Scholar
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proc. of Programming Language Design and Implementation, PLDI '13, pages 519--530. ACM, 2013. Google Scholar
Digital Library
- M. Ravishankar, J. Holewinski, and V. Grover. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs. In Proc. of General Purpose Processing Using GPUs, GPGPU-8, pages 109--120. ACM, 2015. Google Scholar
Digital Library
- Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In Proc. of Parallelism in Algorithms and Architectures, SPAA '11, pages 117--128. ACM, 2011. Google Scholar
Digital Library
- J. S. Vetter and M. O. McCracken. Statistical scalability analysis of communication operations in distributed applications. In Proc. of Principles and Practices of Parallel Programming, PPoPP '01, pages 123--132. ACM, 2001. Google Scholar
Digital Library
- S. Wholey. Automatic data mapping for distributed-memory parallel computers. In Proceedings of the 6th international conference on Supercomputing, pages 25--34. ACM, 1992. Google Scholar
Digital Library
Index Terms
Distributed Halide
Recommendations
Distributed Halide
PPoPP '16Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language ...
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and ImplementationImage processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. ...
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
PLDI '13Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. ...





Comments