ABSTRACT
The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid- and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses.
In this paper we describe the design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units. The primary advantage of using the PMU infrastructure is that it is fine-grained (down to the cache line) and has relatively low overhead. We have implemented our scheme in Linux running on an 8-way Power5 SMP-CMP-SMT multi-processor. For commercial multithreaded server workloads (VolanoMark, SPECjbb, and RUBiS), we are able to demonstrate reductions in cross-chip cache accesses of up to 70%. These reductions lead to application-reported performance improvements of up to 7%.
- C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. Treadmarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2):18--28, Feb 1996. Google Scholar
Digital Library
- R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In Intl. Conf. on Supercomputing, 2005. Google Scholar
Digital Library
- F. Bellosa. Follow-on scheduling: Using TLB information to reduce cache misses. In Symp. on Operating Systems Principles - Work in Progress Session, 1997.Google Scholar
- F. Bellosa and M. Steckermeier. The performance implications of locality information usage in shared-memory multiprocessors. J. of Parallel and Distributed Computing, 37(1):113--121, Aug 1996. Google Scholar
Digital Library
- J. R. Bulpin and I. A. Pratt. Hyper-threading aware process scheduling heuristics. In Usenix Annual Technical Conf., 2005. Google Scholar
Digital Library
- A. El-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas. Compatible phase co-scheduling on a CMP of multi-threaded processors. In Intl. Parallel and Distributed Processing Symp., 2006. Google Scholar
Digital Library
- A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In Usenix Annual Technical Conf., 2005. Google Scholar
Digital Library
- A. Fedorova, C. Small, D. Nussbaum, and M. Seltzer. Chip multithreading systems need a new operating system scheduler. In SIGOPS European Workshop, 2004. Google Scholar
Digital Library
- S. Harizopoulos and A. Ailamaki. STEPS towards cache-resident transaction processing. In Conf. on Very Large Data Bases, 2004. Google Scholar
Digital Library
- A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264--323, 1999. Google Scholar
Digital Library
- P. Koka and M. H. Lipasti. Opportunities for cache friendly process scheduling. In Workshop on Interaction Between Operating Systems and Computer Architecture, 2005.Google Scholar
- J. Larus and M. Parkes. Using cohort scheduling to enhance server performance. In Usenix Annual Technical Conf., 2002. Google Scholar
Digital Library
- R. L. McGregor, C. D. Antonopoulos, and D. S. Nikolopoulos. Scheduling algorithms for effective thread pairing on hybrid multiprocessors. In Intl. Parallel and Distributed Processing Symp., 2005. Google Scholar
Digital Library
- J. Nakajima and V. Pallipadi. Enhancements for Hyper-Threading technology in the operating system - seeking the optimal micro-architectural scheduling. In Workshop on Industrial Experiences with Systems Software, 2002. Google Scholar
Digital Library
- S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-sensitive scheduling for SMT processors. Technical report, Dept. of Computer Science & Engineering, Univ. of Washington, 2000.Google Scholar
- J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In Conf. on Architectural Support for Programming Languages and Operating Systems, 1996. Google Scholar
Digital Library
- A. Settle, J. Kihm, A. Janiszewski, and D. A. Connors. Architectural support for enhanced SMT job scheduling. In Symp. on Parallel Architectures and Compilation Techniques, 2004. Google Scholar
Digital Library
- A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Conf. on Architectural Support for Programming Languages and Operating Systems, 2000. Google Scholar
Digital Library
- S. Sridharan, B. Keck, R. Murphy, S. Chandra, and P. Kogge. Thread migration to improve synchronization performance. In Workshop on Operating System Interference in High Performance Applications, 2006.Google Scholar
- E. G. Suh, L. Rudolph, and S. Devadas. Effects of memory performance on parallel job scheduling. In D. G. Feitelson and L. Rudolph, editors, Workshop on Job Scheduling Strategies for Parallel Processing, volume 2221 of Lecture Notes in Computer Science, pages 116--132, Cambridge, MA, Jun 16 2001. Springer-Verlag. Google Scholar
Digital Library
- E. G. Suh, L. Rudolph, and S. Devadas. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Symp. on High-Performance Computer Architecture, 2002. Google Scholar
Digital Library
- R. Thekkah and S. J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Intl. Symp. on Computer Architecture, 1994. Google Scholar
Digital Library
- B. Weissman. Performance counters and state sharing annotations: a unified approach to thread locality. In Conf. on Architectural Support for Programming Languages and Operating Systems, 1998. Google Scholar
Digital Library
- M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable internet services. In Symp. on Operating Systems Principles, 2001. Google Scholar
Digital Library
Index Terms
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors
Recommendations
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors
EuroSys'07 Conference ProceedingsThe major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory ...
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors
Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...






Comments