Abstract
This paper investigates the synchronization power of coalesced memory accesses, a family of memory access mechanisms introduced in recent large multicore architectures like the CUDA graphics processors. We first design three memory access models to capture the fundamental features of the new memory access mechanisms. Subsequently, we prove the exact synchronization power of these models in terms of their consensus numbers. These tight results show that the coalesced memory access mechanisms can facilitate strong synchronization between the threads of multicore processors, without the need of synchronization primitives other than reads and writes.
Moreover, based on the intrinsic features of recent GPU architectures, we construct strong synchronization objects like wait-free and t-resilient read-modify-write objects for a general model of recent GPU architectures without strong hardware synchronization primitives like test-and-set and compare-and-swap. Accesses to the wait-free objects have time complexity O(N), where N is the number of processes. Our result demonstrates that it is possible to construct waitfree synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs.
- Cell Broadband Engine Architecture, version 1.01. IBM, Sony and Toshiba Corporations, 2006.Google Scholar
- NVIDIA CUDA Compute Unified Device Architecture, Programming Guide, version 1.1. NVIDIA Corporation, 2007.Google Scholar
- S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. Computer, 29(12):66--76, 1996. Google Scholar
Digital Library
- H. Attiya and J. Welch. Distributed Computing: Fundamentals, Simulations, and Advanced Topics. JohnWiley and Sons, Inc., 2004. Google Scholar
Digital Library
- E. Borowsky and E. Gafni. Generalized flp impossibility result for t-resilient asynchronous computations. In STOC '93: Proceedings of the twenty-fifth annual ACM symposium on Theory of computing, pages 91--100, 1993. Google Scholar
Digital Library
- H. Buhrman, A. Panconesi, R. Silvestri, and P. Vitanyi. On the importance of having an identity or, is consensus really universal? Distrib. Comput., 18(3):167--176, 2006. Google Scholar
Digital Library
- I. Castano and P. Micikevicius. Personal communication. NVIDIA, 2008.Google Scholar
- T. Chandra, V. Hadzilacos, P. Jayanti, and S. Toueg. Generalized irreducibility of consensus and the equivalence of tresilient and wait-free implementations of consensus. SIAM Journal on Computing, 34(2):333--357, 2005. Google Scholar
Digital Library
- D. Dolev, C. Dwork, and L. Stockmeyer. On the minimal synchronism needed for distributed consensus. J. ACM, 34(1):77--97, 1987. Google Scholar
Digital Library
- C. Dwork and M. Herlihy. Bounded round number. In Proc. of Symp. on Principles of Distributed Computing (PODC), pages 53--64, 1993. Google Scholar
Digital Library
- M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374--382, 1985. Google Scholar
Digital Library
- P. H. Ha, P. Tsigas, and O. J. Anshus. The synchronization power of coalesced memory accesses. In Proc. of the Intl. Symp. on Distributed Computing (DISC), pages 320--334, 2008. Google Scholar
Digital Library
- P. H. Ha, P. Tsigas, and O. J. Anshus. The synchronization power of coalesced memory accesses. Technical report CS:2008-68, University of Tromsø, Norway, 2008.Google Scholar
- P. H. Ha, P. Tsigas, and O. J. Anshus. Wait-free programming for general purpose computations on graphics processors. In Proc. of the IEEE Intl. Parallel and Distributed Processing Symp. (IPDPS), pages 1--12, 2008. Google Scholar
Digital Library
- M. Herlihy. Randomized wait-free concurrent objects (extended abstract). In Proc. of Symp. on Principles of Distributed Computing (PODC), pages 11--21, 1991. Google Scholar
Digital Library
- M. Herlihy. Wait-free synchronization. ACM Transaction on Programming and Systems, 11(1):124--149, Jan. 1991. Google Scholar
Digital Library
- L. Lamport. Concurrent reading and writing. Commun. ACM, 20(11):806--811, 1977. Google Scholar
Cross Ref
- L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess progranm. IEEE Trans. Comput., 28(9):690--691, 1979. Google Scholar
Digital Library
- S. S. Lumetta and D. E. Culler. Managing concurrent access for shared memory active messages. In Proc. of the Intl. Parallel Processing Symp. (IPPS), page 272, 1998. Google Scholar
Digital Library
- M. M. Michael and M. L. Scott. Relative performance of preemption-safe locking and non-blocking synchronization on multiprogrammed shared memory multiprocessors. In Proc. of the IEEE Intl. Parallel Processing Symp. (IPPS, pages 267--273, 1997. Google Scholar
Digital Library
- J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, 2007.Google Scholar
Cross Ref
- G. L. Peterson. Concurrent reading while writing. ACM Trans. Program. Lang. Syst., 5(1):46--55, 1983. Google Scholar
Digital Library
- D. Pham and et.al. The design and implementation of a firstgeneration cell processor. In Solid-State Circuits Conference, 2005. Digest of Technical Papers. ISSCC. 2005 IEEE International, pages 184--185, 2005.Google Scholar
- E. Ruppert. Determining consensus numbers. In Proc. of Symp. on Principles of Distributed Computing (PODC), pages 93--99, 1997. Google Scholar
Digital Library
- P. Tsigas and Y. Zhang. Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors. In Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 320--321, 2001. Google Scholar
Digital Library
- P. Tsigas and Y. Zhang. Integrating non-blocking synchronisation in parallel applications: Performance advantages and methodologies. In Proceedings of the 3rd ACM Workshop on Software and Performance (WOSP'02), pages 55--67, July 2002. Google Scholar
Digital Library
Index Terms
Non-blocking programming on multi-core graphics processors: (extended asbtract)
Recommendations
All-pairs computations on many-core graphics processors
Developing high-performance applications on emerging multi- and many-core architectures requires efficient mapping techniques and architecture-specific tuning methodologies to realize performance closer to their peak compute capability and memory ...
Relational query coprocessing on graphics processors
Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs ...
Multi-Core Processors: New Way to Achieve High System Performance
PARELEC '06: Proceedings of the international symposium on Parallel Computing in Electrical EngineeringMulti-core processors represent an evolutionary change in conventional computing as well setting the new trend for high performance computing (HPC) - but parallelism is nothing new. Intel has a long history with the concept of parallelism and the ...






Comments