skip to main content
10.1145/1128022.1128030acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
Article

Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip

Published:03 May 2006Publication History

ABSTRACT

This paper presents our experience mapping OpenMP parallel programming model to the IBM Cyclops-64 (C64) architecture. The C64 employs a many-core-on-a-chip design that integrates processing logic (160 thread units), embedded memory (5MB) and communication hardware on the same die. Such a unique architecture presents new opportunities for optimization. Specifically, we consider the following three areas: (1) a memory aware runtime library that places frequently used data structures in scratchpad memory; (2) a unique spin lock algorithm for shared memory synchronization based on in-memory atomic instructions and native support for thread level execution; (3) a fast barrier that directly uses C64 hardware support for collective synchronization. All three optimizations together, result in an 80% overhead reduction for language constructs in OpenMP. We believe that such a drastic reduction in the cost of managing parallelism makes OpenMP more amenable for writing parallel programs on the C64 platform.

References

  1. George S. Almási, Eduard Ayguadé, Călin Caşcaval, José Castaños, Jesús Labarta, Francisco Martíinez, Xavier Martorell, and José Moreira. Evaluation of Open MP for the Cyclops multithreaded architecture. In OpenMP Shared Memory Parallel Programming: International Workshop on OpenMP Applications and Tools, WOMPAT 2003, volume 2716 of Lecture Notes in Computer Science, pages 69--83, Toronto, Canada, June 26--27, 2003.]]Google ScholarGoogle Scholar
  2. George S. Almási, Călin Caşcaval, José G. Castaños, Monty Denneau, Wilm Donath, Maria Eleftheriou, Mark Giampapa, Howard Ho, Derek Lieber, JoséE. Moreira, Dennis Newns, Marc Snir, and Henry S. Warren, Jr. Demonstrating the scalability of a molecular dynamics application on a petaflops computer. International Journal of Parallel Programming, 30(4):317--351, August 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Thomas E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6--16, January 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Rudolf Berrendorf and Guido Nieken. Performance characteristics for Open MP constructs on different parallel computer architectures. Concurrency - Practice and Experience, 12(12):1261--1273, 2000.]]Google ScholarGoogle Scholar
  5. J. Mark Bull. Measuring synchronization and scheduling overheads in Open MP. In Proceedings of the First European Workshop on Open MP, Lund, Sweden, September 30 - October 1, 1999.]]Google ScholarGoogle Scholar
  6. Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. FAST: A functionally accurate simulation toolset for the C yclops64 cellular architecture. In Proceedings of the Workshop on Modeling, Benchmarking and Simulation, pages 11--20, Madison, Wisconsin, June 4, 2005. Held in conjunction with the 32nd Annual International Symposium on Computer Architecture.]]Google ScholarGoogle Scholar
  7. Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. Toward a software infrastructure for the C yclops-64 cellular architecture. In Proceedings of the 20th International Symposium on High Performance Computing Systems and Applications, St. John's, Newfoundland and Labrador, Canada, May 14--17, 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nathan R. Fredrickson, Ahmad Afsahi, and Ying Qian. Performance characteristics of Open MP constructs, and application benchmarks on a large symmetric multiprocessor. In Proceedings of the 2003 International Conference on Supercomputing, pages 140--149, New York, June 23--26 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gary Graunke and Shreekant Thakkar. Synchronization algorithms for shared-memory multiprocessors. Computer, 23:60--69, June 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Michael B. Greenwald. Non-blocking synchronization and system design. PhD thesis, Stanford University, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Timothy L. Harris. A pragmatic implementation of non-blocking linked-lists. In Proceedings of the 15th International Conference on Distributed Computing, number 2180 in Lecture Notes in Computer Science, pages 300--314, Lisbon, Portugal, October 3--5, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Danny Hendler, Nir Shavit, and Lena Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the 16th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 206--215, Barcelona, Spain, June 27--30, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Maurice Herlihy, Victor Luchangco, Paul Martin, and Mark Moir. Nonblocking memory management support for dynamic-sized data structures. ACM Transactions on Computer Systems, 23(2):146--196, May 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 289--300, San Diego, California, May 17--19, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. IBM. IBM system/370 extended architecture, Principle of operation. 1983. Publication no. SA22-7085.]]Google ScholarGoogle Scholar
  16. Sanjeev Kumar, Dongming Jiang, Rohit Chandra, and Jaswinder Pal Singh. Evaluating synchronization on shared address space multiprocessors: Methodology and performance. ACM SIGMETRICS Performance Evaluation Review, 27(1):23--34, June 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kazuhiro Kusano, Shigehisa Satoh, and Mitsuhisa Sato. Performance evaluation of the O mni Open MP compiler. In Proceedings of the 3rd International Symposium on High Performance Computing, volume 1940 of Lecture Notes in Computer Science, pages 403--414, Tokyo, Japan, October 16--18, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Vladimir Lanin and Dennis Shasha. Concurrent set manipulation without locking. In the 7th ACM Symposium on Principles of Database Systems, pages 211--220, March 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, February 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Maged M. Michael. High performance dynamic lock-free hash tables and list-based sets. In the 14th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 73--82, August 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Maged M. Michael. CAS -based lock-free algorithm for shared deques. In the 9th Euro-Par Conference on Parallel Processing, pages 651--660, August 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  22. Maged M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans. Parallel Distrib. Syst, 15(6):491--504, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Maged M. Michael and Michael L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, pages 267--275, New York, USA, May 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Open MP Architecture Review Board. Open MP FORTRAN application program interface. Technical Report 2.0, Open MP Architecture Review Board, November 2000.]]Google ScholarGoogle Scholar
  25. Open MP Architecture Review Board. Open MP C and C ++ application program interface. Technical Report 2.0, Open MP Architecture Review Board, March 2002.]]Google ScholarGoogle Scholar
  26. Achal Prabhakar, Vladimir Getov, and Barbara Chapman. Performance comparisons of basic Open MP constructs. In Proceedings of the 4th International Symposium on High Performance Computing, number 2327 in Lecture Notes in Computer Science, pages 413--424, Kansai Science City, Japan, May 15--17, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. David Ródenas, Xavier Martorell, Eduard Ayguadé, Jesús Labarta, George Almási, Călin Caşcaval, José Castaños, and José Moreira. Optimizing NANOS Open MP for the IBM Cyclops multithreaded architecture. In Proceedings of the 19th International Parallel and Distributed Processing Symposium, page 110, Denver, Colorado, April 4--8, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Larry Rudolph and Zary Segall. Dynamic decentralized cache schemes for MIMD parallel processors. In Proceedings of the 11th Annual International Symposium on Computer Architecture, pages 340--347, Ann Arbor, Michigan, June 5--7, 1984.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. John D. Valois. Lock-free linked lists using compare-and-swap. In Proceedings of the 14th Annual ACM Symposium of Distributed Computing, pages 214--222, Ottawa, Ontario, Canada, August 2--23, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!