skip to main content
10.1145/1346281.1346318acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Merge: a programming model for heterogeneous multi-core systems

Published:01 March 2008Publication History

ABSTRACT

In this paper we propose the Merge framework, a general purpose programming model for heterogeneous multi-core systems. The Merge framework replaces current ad hoc approaches to parallel programming on heterogeneous platforms with a rigorous, library-based methodology that can automatically distribute computation across heterogeneous cores to achieve increased energy and performance efficiency. The Merge framework provides (1) a predicate dispatch-based library system for managing and invoking function variants for multiple architectures; (2) a high-level, library-oriented parallel language based on map-reduce; and (3) a compiler and runtime which implement the map-reduce language pattern by dynamically selecting the best available function implementations for a given input and machine configuration. Using a generic sequencer architecture interface for heterogeneous accelerators, the Merge framework can integrate function variants for specialized accelerators, offering the potential for to-the-metal performance for a wide range of heterogeneous architectures, all transparent to the user. The Merge framework has been prototyped on a heterogeneous platform consisting of an Intel Core 2 Duo CPU and an 8-core 32-thread Intel Graphics and Media Accelerator X3000, and a homogeneous 32-way Unisys SMP system with Intel Xeon processors. We implemented a set of benchmarks using the Merge framework and enhanced the library with X3000 specific implementations, achieving speedups of 3.6x -- 8.5x using the X3000 and 5.2x -- 22x using the 32-way system relative to the straight C reference implementation on a single IA32 core.

Skip Supplemental Material Section

Supplemental Material

Video

References

  1. E. Allen, D. Chase, J. Hallet, V. Luchangco, J.-W. Maessen, S. Ryu, G. L. Steele, and S. Tobin-Hochstadt. The Fortress language specification version 1.0beta. Technical report, Sun Microsystems, 2007.Google ScholarGoogle Scholar
  2. M. Annavram, E. Grochowski, and J. Shen. Mitigating Amdahl's law through EPI throttling. In Proc. of ISCA, pages 298--309, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Barret and S. Berezin. CVC lite: A new implementation of cooperating validity checker. In Proc. of Conf. on Computer Aided Verification, pages 515--518, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  4. I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: stream computing on graphics hardware. ACM Trans. Graph., 23(3):777--786, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chiba. A metaobject protocol for C++. In Proc. of OOPSLA, pages 285--299, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proc. of OSDI, pages 137--149, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Dubey. Recognition, mining and synthesis moves computers to the era of tera. [email protected] Magazine, 2005.Google ScholarGoogle Scholar
  8. M. Ernst, C. Kaplan, and C. Chambers. Predicate dispatching: A unified theory of dispatch. In European Conf. on Object-Oriented Programming, pages 186--211, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequioa: Programming the memory hierarchy. In Proc. of ACM/IEEE Conf. on Supercomputing, page 83, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Grochowski, R. Ronen, J. Shen, and H. Wang. Best of both latency and throughput. In Proc. of ICCD, pages 236--243, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Z. Guyer and C. Lin. Annotation language for optimizing software libraries. In Proc. of Conf. on Domain Specific Languages, pages 39--52, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Hankins, G. Chinya, J. D. Collins, P. Wang, R. Rakvic, H. Wang, and J. Shen. Multiple instruction stream processor. In Proc. of ISCA, pages 114--127, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Intel. Intel C++ compiler. http://www3.intel.com/cd/software/products/asmo-na/eng/compilers/284132.htm.Google ScholarGoogle Scholar
  14. Intel. Intel threading building blocks. http://www3.intel.com/cd/software/products/asmo-na/eng/294797.htm.Google ScholarGoogle Scholar
  15. U. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. Owens. Programmable stream processors. IEEE Computer, 36(8):54--62, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Kennedy, B. Broom, A. Chauhan, R. Fowler, J. Garvin, C. Koelbel, C. McCosh, and J. Mellor-Crummey. Telescoping languages: A system for automatic generation of domain languages. Proc. of the IEEE, 93:378--408, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  17. V. Kuncak, P. Lam, and M. Rinard. Role analysis. ACM SIGPLAN Notices, 37:17--32, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Lyman and H. R. Varian. How much information. http://www.sims.berkeley.edu/how-much-info-2003, 2003.Google ScholarGoogle Scholar
  19. W. Mark, R. Glanville, K. Akeley, and M. Kilgard. Cg: A system for programming graphics hardware in a C-like language. ACM Trans. Graph., 22(3):896--907, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. McCool and S. Toit. Metaprogramming GPUs with Sh. A K Peters, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. McCool, K. Wadleigh, B. Henderson, and H. Y. Lin. Performance evaluation of GPUs using the RapidMind development platform. In Proc. of ACM/IEEE Conf. on Supercomputing, page 81, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Millstein. Practical predicate dispatch. In Proc. of OOPSLA, pages 345--264, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Nayak, M. Haldar, A. Kanhere, P. Joisha, N. Shenoy, A. Choudhary, and P. Banerjee. A library based compiler to execute MATLAB programs on a heterogeneous platform. In Proc. of Conf. on Parallel and Distributed Computing Systems, 2000.Google ScholarGoogle Scholar
  24. NVidia. Cuda. http://developer.nvidia.com/object/cuda.html.Google ScholarGoogle Scholar
  25. Peakstream. The PeakStream platform: High productivity software development for multi-core processors. Technical report, PeakStream Inc., 2006.Google ScholarGoogle Scholar
  26. D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa. The design and implementation of a first-generation CELL processor. In Proc. of ISSCC, pages 184--185, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  27. C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proc. of HPCA, pages 13--24, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Segal and M. Peercy. A performance-oriented data parallel virtual machines for GPUs. Technical report, ATI Technologies, 2006.Google ScholarGoogle Scholar
  29. D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using data parallelism to program GPUs for general-purpose uses. In Proc. of ASPLOS, pages 325--335, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. Thies, M. Karczmarek, and S. Amarainghe. StreamIt: A language for streaming applications. In Proc. of Conf. on Compiler Construction, pages 49--84, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. H. Wang, J. D. Collins, G. N. Chinya, H. Jiang, X. Tian, M. Girkar, N. Y. Yang, G.-Y. Lueh, and H. Wang. EXOCHI: Architecture and programming environment for a heterogeneous multi-core multithreaded system. In Proc. of PLDI, pages 156--166, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Z. S. Zumsteg, C. Kemere, S. O'Driscoll, G. Santhanam, R. E. Ahmed, K. V. Shenoy, and T. H. Meng. Power feasibility of implantable digital spike sorting circuits for neural prosthetic systems. IEEE Trans Neural Syst Rehabil Eng, 13(3):272--279, 2005.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Merge: a programming model for heterogeneous multi-core systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!