skip to main content
10.1145/2851141.2851157acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Open Access

Distributed Halide

Published:27 February 2016Publication History

ABSTRACT

Many image processing tasks are naturally expressed as a pipeline of small computational kernels known as stencils. Halide is a popular domain-specific language and compiler designed to implement image processing algorithms. Halide uses simple language constructs to express what to compute and a separate scheduling co-language for expressing when and where to perform the computation. This approach has demonstrated performance comparable to or better than hand-optimized code. Until now, however, Halide has been restricted to parallel shared memory execution, limiting its performance for memory-bandwidth-bound pipelines or large-scale image processing tasks.

We present an extension to Halide to support distributed-memory parallel execution of complex stencil pipelines. These extensions compose with the existing scheduling constructs in Halide, allowing expression of complex computation and communication strategies. Existing Halide applications can be distributed with minimal changes, allowing programmers to explore the tradeoff between recomputation and communication with little effort. Approximately 10 new of lines code are needed even for a 200 line, 99 stage application. On nine image processing benchmarks, our extensions give up to a 1.4× speedup on a single node over regular multithreaded execution with the same number of cores, by mitigating the effects of non-uniform memory access. The distributed benchmarks achieve up to 18× speedup on a 16 node testing machine and up to 57× speedup on 64 nodes of the NERSC Cori supercomputer.

References

  1. Digitized Sky Survey. URL http://archive.stsci.edu/dss/.Google ScholarGoogle Scholar
  2. Canon 250 Megapixel Image Sensor, Press Release. URL http://www.canon.com/news/2015/sep07e.html.Google ScholarGoogle Scholar
  3. Cori Supercomputer System. URL http://www.nersc.gov/users/computational-systems/cori/.Google ScholarGoogle Scholar
  4. The OpenCV Library. URL http://code.opencv.org.Google ScholarGoogle Scholar
  5. Intel PMU Profiling Tools. URL https://github.com/andikleen/pmu-tools.Google ScholarGoogle Scholar
  6. E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden. Pyramid methods in image processing. RCA engineer, 29(6): 33--41, 1984.Google ScholarGoogle Scholar
  7. A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. Loggp: incorporating long messages into the logp modelone step closer towards a realistic model for parallel computation. In Proc. of Parallel algorithms and architectures, pages 95--105. ACM, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Amoura, E. Bampis, C. Kenyon, and Y. Manoussakis. Scheduling independent multiprocessor tasks. Algorithmica, 32(2):247--261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O'Reilly, and S. Amarasinghe. Opentuner: An extensible framework for program autotuning. In Proc. of Parallel architectures and compilation, pages 303--316. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. D. Beynon, T. Kurc, U. Catalyurek, C. Chang, A. Sussman, and J. Saltz. Distributed processing of very large datasets with DataCutter. Parallel Comput., 27(11):1457--1478, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Proc. of Object-oriented Prog., Systems, Languages, and Applications, OOPSLA '05, pages 519--538. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Chen, S. Paris, and F. Durand. Real-time edge-aware image processing with the bilateral grid. In ACM Transactions on Graphics (TOG), volume 26, page 103. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In IPDPS, pages 676--687, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Darbha and D. Agrawal. Optimal scheduling algorithm for distributed-memory machines. Parallel and Distributed Systems, IEEE Transactions on, 9(1):87--95, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Dathathri, C. Reddy, T. Ramashekar, and U. Bondhugula. Generating efficient data movement code for heterogeneous architectures with distributed-memory. In PACT, pages 375--386. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P.-F. Dutot, G. Mounié, and D. Trystram. Scheduling parallel tasks: Approximation algorithms. Handbook of Scheduling: Algorithms, Models, and Performance Analysis, pages 26--1, 2004.Google ScholarGoogle Scholar
  17. T. Henretty, R. Veras, F. Franchetti, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan. A stencil compiler for short-vector SIMD architectures. In Proc. of Intl. conf. on supercomputing, pages 13--24. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. N. Hilfinger, D. Bonachea, D. Gay, S. Graham, B. Liblit, G. Pike, and K. Yelick. Titanium language reference manual. Technical report, Berkeley, CA, USA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Jansen. Scheduling malleable parallel tasks: An asymptotic fully polynomial-time approximation scheme. In Lecture Notes in Computer Science, pages 562--574. Springer Berlin Heidelberg, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Jansen and L. Porkolab. General multiprocessor task scheduling: Approximate solutions in linear time. In Lecture Notes in Computer Science, pages 110--121. Springer Berlin Heidelberg, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. An auto-tuning framework for parallel multicore stencil computations. In IPDPS, pages 1--12, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  22. F. B. Kjolstad and M. Snir. Ghost cell pattern. In Proc. of Parallel Programming Patterns, ParaPLoP '10, pages 4:1--4:9. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. X. Li, B. Veeravalli, and C. Ko. Distributed image processing on a network of workstations. Intl. Journal of Computers and Applications, 25(2):136--145, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  24. N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka. Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In High Performance Computing, Networking, Storage and Analysis, pages 1--12. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. T. Mullapudi, V. Vasista, and U. Bondhugula. Polymage: Automatic optimization for image processing pipelines. In Proc. of Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 429--443. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. K. Panda et al. OSU Microbenchmarks v5.1. URL http://mvapich.cse.ohio-state.edu/benchmarks/.Google ScholarGoogle Scholar
  27. S. Paris, S. W. Hasinoff, and J. Kautz. Local Laplacian filters: edge-aware image processing with a Laplacian pyramid. ACM Trans. Graph., 30(4):68, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Ragan-Kelley. Decoupling Algorithms from the Organization of Computation for High Performance Image Processing. PhD Thesis, MIT, 2014.Google ScholarGoogle Scholar
  29. J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proc. of Programming Language Design and Implementation, PLDI '13, pages 519--530. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Ravishankar, J. Holewinski, and V. Grover. Forma: A DSL for image processing applications to target GPUs and multi-core CPUs. In Proc. of General Purpose Processing Using GPUs, GPGPU-8, pages 109--120. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The Pochoir stencil compiler. In Proc. of Parallelism in Algorithms and Architectures, SPAA '11, pages 117--128. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. S. Vetter and M. O. McCracken. Statistical scalability analysis of communication operations in distributed applications. In Proc. of Principles and Practices of Parallel Programming, PPoPP '01, pages 123--132. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Wholey. Automatic data mapping for distributed-memory parallel computers. In Proceedings of the 6th international conference on Supercomputing, pages 25--34. ACM, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Distributed Halide

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PPoPP '16: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
      February 2016
      420 pages
      ISBN:9781450340922
      DOI:10.1145/2851141

      Copyright © 2016 Owner/Author

      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 February 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate230of1,014submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader