ABSTRACT
This paper presents our experience mapping OpenMP parallel programming model to the IBM Cyclops-64 (C64) architecture. The C64 employs a many-core-on-a-chip design that integrates processing logic (160 thread units), embedded memory (5MB) and communication hardware on the same die. Such a unique architecture presents new opportunities for optimization. Specifically, we consider the following three areas: (1) a memory aware runtime library that places frequently used data structures in scratchpad memory; (2) a unique spin lock algorithm for shared memory synchronization based on in-memory atomic instructions and native support for thread level execution; (3) a fast barrier that directly uses C64 hardware support for collective synchronization. All three optimizations together, result in an 80% overhead reduction for language constructs in OpenMP. We believe that such a drastic reduction in the cost of managing parallelism makes OpenMP more amenable for writing parallel programs on the C64 platform.
- George S. Almási, Eduard Ayguadé, Călin Caşcaval, José Castaños, Jesús Labarta, Francisco Martíinez, Xavier Martorell, and José Moreira. Evaluation of Open MP for the Cyclops multithreaded architecture. In OpenMP Shared Memory Parallel Programming: International Workshop on OpenMP Applications and Tools, WOMPAT 2003, volume 2716 of Lecture Notes in Computer Science, pages 69--83, Toronto, Canada, June 26--27, 2003.]]Google Scholar
- George S. Almási, Călin Caşcaval, José G. Castaños, Monty Denneau, Wilm Donath, Maria Eleftheriou, Mark Giampapa, Howard Ho, Derek Lieber, JoséE. Moreira, Dennis Newns, Marc Snir, and Henry S. Warren, Jr. Demonstrating the scalability of a molecular dynamics application on a petaflops computer. International Journal of Parallel Programming, 30(4):317--351, August 2002.]] Google Scholar
Digital Library
- Thomas E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1):6--16, January 1990.]] Google Scholar
Digital Library
- Rudolf Berrendorf and Guido Nieken. Performance characteristics for Open MP constructs on different parallel computer architectures. Concurrency - Practice and Experience, 12(12):1261--1273, 2000.]]Google Scholar
- J. Mark Bull. Measuring synchronization and scheduling overheads in Open MP. In Proceedings of the First European Workshop on Open MP, Lund, Sweden, September 30 - October 1, 1999.]]Google Scholar
- Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. FAST: A functionally accurate simulation toolset for the C yclops64 cellular architecture. In Proceedings of the Workshop on Modeling, Benchmarking and Simulation, pages 11--20, Madison, Wisconsin, June 4, 2005. Held in conjunction with the 32nd Annual International Symposium on Computer Architecture.]]Google Scholar
- Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. Toward a software infrastructure for the C yclops-64 cellular architecture. In Proceedings of the 20th International Symposium on High Performance Computing Systems and Applications, St. John's, Newfoundland and Labrador, Canada, May 14--17, 2006.]] Google Scholar
Digital Library
- Nathan R. Fredrickson, Ahmad Afsahi, and Ying Qian. Performance characteristics of Open MP constructs, and application benchmarks on a large symmetric multiprocessor. In Proceedings of the 2003 International Conference on Supercomputing, pages 140--149, New York, June 23--26 2003.]] Google Scholar
Digital Library
- Gary Graunke and Shreekant Thakkar. Synchronization algorithms for shared-memory multiprocessors. Computer, 23:60--69, June 1990.]] Google Scholar
Digital Library
- Michael B. Greenwald. Non-blocking synchronization and system design. PhD thesis, Stanford University, 1999.]] Google Scholar
Digital Library
- Timothy L. Harris. A pragmatic implementation of non-blocking linked-lists. In Proceedings of the 15th International Conference on Distributed Computing, number 2180 in Lecture Notes in Computer Science, pages 300--314, Lisbon, Portugal, October 3--5, 2001.]] Google Scholar
Digital Library
- Danny Hendler, Nir Shavit, and Lena Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the 16th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 206--215, Barcelona, Spain, June 27--30, 2004.]] Google Scholar
Digital Library
- Maurice Herlihy, Victor Luchangco, Paul Martin, and Mark Moir. Nonblocking memory management support for dynamic-sized data structures. ACM Transactions on Computer Systems, 23(2):146--196, May 2005.]] Google Scholar
Digital Library
- Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 289--300, San Diego, California, May 17--19, 1993.]] Google Scholar
Digital Library
- IBM. IBM system/370 extended architecture, Principle of operation. 1983. Publication no. SA22-7085.]]Google Scholar
- Sanjeev Kumar, Dongming Jiang, Rohit Chandra, and Jaswinder Pal Singh. Evaluating synchronization on shared address space multiprocessors: Methodology and performance. ACM SIGMETRICS Performance Evaluation Review, 27(1):23--34, June 1999.]] Google Scholar
Digital Library
- Kazuhiro Kusano, Shigehisa Satoh, and Mitsuhisa Sato. Performance evaluation of the O mni Open MP compiler. In Proceedings of the 3rd International Symposium on High Performance Computing, volume 1940 of Lecture Notes in Computer Science, pages 403--414, Tokyo, Japan, October 16--18, 2000.]] Google Scholar
Digital Library
- Vladimir Lanin and Dennis Shasha. Concurrent set manipulation without locking. In the 7th ACM Symposium on Principles of Database Systems, pages 211--220, March 1988.]] Google Scholar
Digital Library
- John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems, 9(1):21--65, February 1991.]] Google Scholar
Digital Library
- Maged M. Michael. High performance dynamic lock-free hash tables and list-based sets. In the 14th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 73--82, August 2002.]] Google Scholar
Digital Library
- Maged M. Michael. CAS -based lock-free algorithm for shared deques. In the 9th Euro-Par Conference on Parallel Processing, pages 651--660, August 2003.]]Google Scholar
Cross Ref
- Maged M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Trans. Parallel Distrib. Syst, 15(6):491--504, 2004.]] Google Scholar
Digital Library
- Maged M. Michael and Michael L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, pages 267--275, New York, USA, May 1996.]] Google Scholar
Digital Library
- Open MP Architecture Review Board. Open MP FORTRAN application program interface. Technical Report 2.0, Open MP Architecture Review Board, November 2000.]]Google Scholar
- Open MP Architecture Review Board. Open MP C and C ++ application program interface. Technical Report 2.0, Open MP Architecture Review Board, March 2002.]]Google Scholar
- Achal Prabhakar, Vladimir Getov, and Barbara Chapman. Performance comparisons of basic Open MP constructs. In Proceedings of the 4th International Symposium on High Performance Computing, number 2327 in Lecture Notes in Computer Science, pages 413--424, Kansai Science City, Japan, May 15--17, 2002.]] Google Scholar
Digital Library
- David Ródenas, Xavier Martorell, Eduard Ayguadé, Jesús Labarta, George Almási, Călin Caşcaval, José Castaños, and José Moreira. Optimizing NANOS Open MP for the IBM Cyclops multithreaded architecture. In Proceedings of the 19th International Parallel and Distributed Processing Symposium, page 110, Denver, Colorado, April 4--8, 2005.]] Google Scholar
Digital Library
- Larry Rudolph and Zary Segall. Dynamic decentralized cache schemes for MIMD parallel processors. In Proceedings of the 11th Annual International Symposium on Computer Architecture, pages 340--347, Ann Arbor, Michigan, June 5--7, 1984.]] Google Scholar
Digital Library
- John D. Valois. Lock-free linked lists using compare-and-swap. In Proceedings of the 14th Annual ACM Symposium of Distributed Computing, pages 214--222, Ottawa, Ontario, Canada, August 2--23, 1995.]] Google Scholar
Digital Library
Index Terms
Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip
Recommendations
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud ComputingMany modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Performance Gaps between OpenMP and OpenCL for Multi-core CPUs
ICPPW '12: Proceedings of the 2012 41st International Conference on Parallel Processing WorkshopsOpenCL and OpenMP are the most commonly used programming models for multi-core processors. They are also fundamentally different in their approach to parallelization. In this paper, we focus on comparing the performance of OpenCL and OpenMP. We select ...
An application-centric evaluation of OpenCL on multi-core CPUs
Although designed as a cross-platform parallel programming model, OpenCL remains mainly used for GPU programming. Nevertheless, a large amount of applications are parallelized, implemented, and eventually optimized in OpenCL. Thus, in this paper, we ...





Comments