ABSTRACT
To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only one thread accesses shared data at any given time. Critical sections can serialize the execution of threads, which significantly reduces performance and scalability.
This paper proposes Accelerated Critical Sections (ACS), a technique that leverages the high-performance core(s) of an Asymmetric Chip Multiprocessor (ACMP) to accelerate the execution of critical sections. In ACS, selected critical sections are executed by a high-performance core, which can execute the critical section faster than the other, smaller cores. As a result, ACS reduces serialization: it lowers the likelihood of threads waiting for a critical section to finish. Our evaluation on a set of 12 critical-section-intensive workloads shows that ACS reduces the average execution time by 34% compared to an equal-area 32T-core symmetric CMP and by 23% compared to an equal-area ACMP. Moreover, for 7 out of the 12 workloads, ACS improves scalability by increasing the number of threads at which performance saturates.
- MySQL database engine 5.0.1. http://www.mysql.com, 2008.Google Scholar
- Opening Tables scalability in MySQL. MySQL Performance Blog. http://www.mysqlperformanceblog.com/2006/11/21/opening--tablesscalability, 2006.Google Scholar
- SQLite database engine version 3.5.8. http:/www.sqlite.org, 2008.Google Scholar
- SysBench: a system performance benchmark version 0.4.8. http://sysbench.sourceforge.net, 2008.Google Scholar
- S. Adve et al. Replacing locks by higher-level primitives. Technical Report TR94-237, Rice University, 1994.Google Scholar
- G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS, 1967. Google Scholar
Digital Library
- D. H. Bailey et al. NAS parallel benchmarks. Technical Report Tech. Rep. RNR-94-007, NASA Ames Research Center, 1994.Google Scholar
- A. D. Birrell and B. J. Nelson. Implementing remote procedure calls. ACM Trans. Comput. Syst., 2(1):39--59, 1984. Google Scholar
Digital Library
- C. Brunschen et al. OdinMP/CCp -- a portable implementation of OpenMP for C. Concurrency: Prac. and Exp., 12(12), 2000.Google Scholar
- D. Culler, J. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1998. Google Scholar
Digital Library
- A. J. Dorta et al. The OpenMP source code repository. In Euromicro, 2005. Google Scholar
Digital Library
- S. Gochman et al. The Intel Pentium M processor: Microarchitecture and performance. 7(2):21--36, May 2003.Google Scholar
- G. Grohoski. Distinguished Engineer, Sun Microsystems. Personal communication, November 2007.Google Scholar
- M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In ISCA-20, 1993. Google Scholar
Digital Library
- M. Hill and M. Marty. Amdahl's law in the multicore era. IEEE Computer, 41(7), 2008. Google Scholar
Digital Library
- R. Hoffmann et al. Using hardware operations to reduce the synchronization overhead of task pools. ICPP, 2004 Google Scholar
Digital Library
- Intel. Prescott New Instructions Software Dev. Guide. http://cachewww.intel.com/cd/00/00/06/67/66753 66753.pdf, 2004.Google Scholar
- Intel. Source code for Intel threading building blocks.Google Scholar
- Intel. Pentium Processor User's Manual Volume 1: Pentium Processor Data Book, 1993.Google Scholar
- Intel. IA-32 Intel Architecture Software Dev. Guide, 2008.Google Scholar
- E. Ipek et al. Core fusion: accommodating software diversity in chip multiprocessors. In ISCA-34, 2007. Google Scholar
Digital Library
- P. Kongetira et al. Niagara: A 32-Way Multithreaded SPARC Processor. IEEE Micro, 25(2):21--29, 2005. Google Scholar
Digital Library
- H. Kredel. Source code for traveling salesman problem (tsp). http://krum.rz.uni-mannheim.de/ba-pp-2007/java/index.html.Google Scholar
- R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessors. IEEE Computer, 38(11), 2005. Google Scholar
Digital Library
- L. Lamport. A new solution of Dijkstra's concurrent programming problem. CACM, 17(8):453--455, August 1974. Google Scholar
Digital Library
- J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In ISCA, pages 241--251, 1997. Google Scholar
Digital Library
- E. L. Lawler and D. E. Wood. Branch-and-bound methods: A survey. Operations Research, 14(4):699--719, 1966.Google Scholar
Digital Library
- C. Liao et al. OpenUH: an optimizing, portable OpenMP compiler. Concurr. Comput. : Pract. Exper., 19(18):2317--2332, 2007. Google Scholar
Digital Library
- J. F. Martínez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. In ASPLOS-X, 2002.Google Scholar
Digital Library
- T. Morad et al. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. Comp Arch Lttrs, 2006. Google Scholar
Digital Library
- R. Narayanan et al. MineBench: A Benchmark Suite for Data Mining Workloads. In IISWC, 2006.Google Scholar
Cross Ref
- Y. Nishitani et al. Implementation and evaluation of OpenMP for Hitachi SR8000. In ISHPC-3, 2000. Google Scholar
Digital Library
- R. Rajwar and J. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In MICRO-34, 2001. Google Scholar
Digital Library
- R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In ASPLOS-X, 2002. Google Scholar
Digital Library
- P. Ranganathan et al. The interaction of software prefetching with ILP processors in shared-memory systems. In ISCA-24, 1997. Google Scholar
Digital Library
- C. Rossbach et al. TxLinux: using and managing hardware transactional memory in an operating system. In SOSP'07, 2007. Google Scholar
Digital Library
- M. Sato et al. Design of OpenMP compiler for an SMP cluster. In EWOMP, Sept. 1999.Google Scholar
- L. Seiler et al. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 2008. Google Scholar
Digital Library
- S. Sridharan et al. Thread migration to improve synchronization performance. In Workshop on OSIHPA, 2006.Google Scholar
- The Standard Performance Evaluation Corporation. Welcome to SPEC. http://www.specbench.org/.Google Scholar
- M. Suleman et al. ACMP: Balancing Hardware Efficiency and Programmer Efficiency. Technical report, HPS, February 2007.Google Scholar
- M. Suleman et al. An Asymmetric Multi-core Architecture for Accelerating Critical Sections. Technical Report TR-HPS-2008-003, 2008.Google Scholar
- M. Suleman et al. Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs. In ASPLOS XIII, 2008. Google Scholar
Digital Library
- J. M. Tendler et al. POWER4 system microarchitecture. IBM Journal of Research and Development, 46(1):5--26, 2002. Google Scholar
Digital Library
- Tornado Web Server. Source code. http://tornado.sourceforge.net/.Google Scholar
- P. Trancoso and J. Torrellas. The impact of speeding up critical sections with data prefetching and forwarding. In ICPP, 1996.Google Scholar
Cross Ref
- M. Tremblay et al. A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC Processor. In ISSCC, 2008.Google Scholar
Cross Ref
- D. M. Tullsen et al. Simultaneous multithreading: Maximizing onchip parallelism. In ISCA-22, 1995. Google Scholar
Digital Library
- M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable high speed ip routing lookups. In SIGCOMM, 1997. Google Scholar
Digital Library
- Wikipedia. Fifteen puzzle. http://en.wikipedia.org/wiki/Fifteen puzzle.Google Scholar
- S. C. Woo et al. The SPLASH-2 programs: Characterization and methodological considerations. In ISCA-22, 1995. Google Scholar
Digital Library
- P. Zhao and J. N. Amaral. Ablego: a function outlining and partial inlining framework. Softw. Pract. Exper., 37(5):465--491, 2007. Google Scholar
Digital Library
Index Terms
Accelerating critical section execution with asymmetric multi-core architectures
Recommendations
Utility-based acceleration of multithreaded applications on asymmetric CMPs
ICSA '13Asymmetric Chip Multiprocessors (ACMPs) are becoming a reality. ACMPs can speed up parallel applications if they can identify and accelerate code segments that are critical for performance. Proposals already exist for using coarse-grained thread ...
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
Accelerating critical section execution with asymmetric multi-core architectures
ASPLOS 2009To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...








Comments