Contact The DL Team Contact Us | Switch to tabbed view

top of pageABSTRACT

Multiprocessor operating systems (OSs) pose several unique and conflicting challenges to System Virtual Machines (System VMs). For example, most existing system VMs resort to gang scheduling a guest OS's virtual processors (VCPUs) to avoid OS synchronization overhead. However, gang scheduling is infeasible for some application domains, and is inflexible in other domains.In an overcommitted environment, an individual guest OS has more VCPUs than available physical processors (PCPUs), precluding the use of gang scheduling. In such an environment, we demonstrate a more than two-fold increase in runtime when transparently virtualizing a chip-multiprocessor's cores. To combat this problem, we propose a hardware technique to detect several cases when a VCPU is not performing useful work, and suggest preempting that VCPU to run a different, more productive VCPU. Our technique can dramatically reduce cycles wasted on OS synchronization, without requiring any semantic information from the software.We then present a case study, typical of server consolidation, to demonstrate the potential of more flexible scheduling policies enabled by our technique. We propose one such policy that logically partitions the CMP cores between guest VMs. This policy increases throughput by 10-25% for consolidated server workloads due to improved cache locality and core utilization, and substantially improves performance isolation in private caches.
Advertisements



top of pageAUTHORS



Author image not provided  Philip M. Wells

No contact information provided yet.

Bibliometrics: publication history
Publication years1983-2012
Publication count10
Citation Count129
Available for download7
Downloads (6 Weeks)23
Downloads (12 Months)280
Downloads (cumulative)6,536
Average downloads per article933.71
Average citations per article12.90
View colleagues of Philip M. Wells


Author image not provided  Koushik Chakraborty

No contact information provided yet.

Bibliometrics: publication history
Publication years2006-2016
Publication count36
Citation Count126
Available for download26
Downloads (6 Weeks)75
Downloads (12 Months)874
Downloads (cumulative)7,244
Average downloads per article278.62
Average citations per article3.50
View colleagues of Koushik Chakraborty


Author image not provided  Gurindar S. Sohi

No contact information provided yet.

Bibliometrics: publication history
Publication years1985-2014
Publication count97
Citation Count2,793
Available for download68
Downloads (6 Weeks)91
Downloads (12 Months)1,482
Downloads (cumulative)36,792
Average downloads per article541.06
Average citations per article28.79
View colleagues of Gurindar S. Sohi

top of pageREFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
PostgreSQL. http://www.postgresql.org/.
 
2
Advanced Micro Devices. AMD64 Architecture Programmer's Manual Volume 2: System Programming, Dec 2005.
 
3
4
 
5
6
7
8
 
9
 
10
11
12
 
13
 
14
 
15
 
16
 
17
J. K. Ousterhout. Scheduling techniques for concurrent systems. In Distributed Computing Systems, 1982.
18
19
 
20
 
21
J. E. Smith, S. S. Sastry, T. Heil, and T. M. Bezenek. Achieving high performance via co-designed virtual machines. In International Workshop on Innovative Architecture, 1999.
22
 
23
Sun Microsystems, Inc. Sun enterprise 10000 server: Dynamic system domains. http://www.sun.com/servers/white-papers/domains.html. Viewed 6/23/2006.
 
24
 
25
 
26
 
27
VMWare. ESX Server - best practices using VMware virtual SMP. www.vmware.com/pdf/vsmp best practices.pdf. Viewed 6/23/2006.
 
28
29
 
30
 
31
 
32

top of pageCITED BY

14 Citations

 
 
 
 
 
 

top of pageINDEX TERMS

The ACM Computing Classification System (CCS rev.2012)

Note: Larger/Darker text within each node indicates a higher relevance of the materials to the taxonomic classification.

top of pagePUBLICATION

Title PACT '06 Proceedings of the 15th international conference on Parallel architectures and compilation techniques table of contents
General Chairs Erik Altman IBM Research, USA
Program Chairs Kevin Skadron University of Virginia, USA
Ben Zorn Microsoft Research, USA
Pages 124-133
Publication Date2006-09-16 (yyyy-mm-dd)
Sponsor ACM Association for Computing Machinery
PublisherACM New York, NY, USA ©2006
ISBN: 1-59593-264-X Order Number: 415062 doi>10.1145/1152154.1152176
Conference PACTParallel Architectures and Compilation Techniques PACT logo
Overall Acceptance Rate 244 of 1,069 submissions, 23%
Year Submitted Accepted Rate
PACT '95 85 39 46%
PACT '08 159 30 19%
PACT '10 266 46 17%
PACT '12 207 39 19%
PACT '13 208 36 17%
PACT '14 144 54 38%
Overall 1,069 244 23%

top of pageREVIEWS


Reviews are not available for this item
Computing Reviews logo

top of pageCOMMENTS

Be the first to comment To Post a comment please sign in or create a free Web account

top of pageTable of Contents

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Table of Contents
Experiences with MapReduce, an abstraction for large-scale computation
Jeffrey Dean
Pages: 1-1
doi>10.1145/1152154.1152155
Full text: PDFPDF

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a Map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a Reduce function that ...
expand
SESSION: Multi-core design I
Architectural support for operating system-driven CMP cache management
Nauman Rafique, Won-Taek Lim, Mithuna Thottethodi
Pages: 2-12
doi>10.1145/1152154.1152160
Full text: PDFPDF

The role of the operating system (OS) in managing shared resources such as CPU time, memory, peripherals, and even energy is well motivated and understood [23]. Unfortunately, one key resource—lower-level shared cache in chip multi-processors—is ...
expand
Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource
Lisa R. Hsu, Steven K. Reinhardt, Ravishankar Iyer, Srihari Makineni
Pages: 13-22
doi>10.1145/1152154.1152161
Full text: PDFPDF

As chip multiprocessors (CMPs) become increasingly mainstream, architects have likewise become more interested in how best to share a cache hierarchy among multiple simultaneous threads of execution. The complexity of this problem is exacerbated as the ...
expand
Core architecture optimization for heterogeneous chip multiprocessors
Rakesh Kumar, Dean M. Tullsen, Norman P. Jouppi
Pages: 23-32
doi>10.1145/1152154.1152162
Full text: PDFPDF

Previous studies have demonstrated the advantages of single-ISA heterogeneous multi-core architectures for power and performance. However, none of those studies examined how to design such a processor; instead, they started with an assumed combination ...
expand
SESSION: Program analysis and optimization
Compiling for stream processing
Abhishek Das, William J. Dally, Peter Mattson
Pages: 33-42
doi>10.1145/1152154.1152164
Full text: PDFPDF

This paper describes a compiler for stream programs that efficiently schedules computational kernels and stream memory operations, and allocates on-chip storage. Our compiler uses information about the program structure and estimates of kernel and memory ...
expand
Region array SSA
Silvius Rus, Guobin He, Christophe Alias, Lawrence Rauchwerger
Pages: 43-52
doi>10.1145/1152154.1152165
Full text: PDFPDF

Static Single Assignment (SSA) has become the intermediate program representation of choice in most modern compilers because it enables efficient data flow analysis of scalars and thus leads to better scalar optimizations. Unfortunately not much progress ...
expand
A two-phase escape analysis for parallel java programs
Kyungwoo Lee, Samuel P. Midkiff
Pages: 53-62
doi>10.1145/1152154.1152166
Full text: PDFPDF

Thread escape analysis conservatively determines which objects may be accessed in more than one thread. Thread escape analysis is useful for a variety of purposes—finding races in multi-threaded programs, removing useless synchronization, allocating ...
expand
Challenges and opportunities in the post single-thread-processor era
Steve Scott
Pages: 63-63
doi>10.1145/1152154.1152156
Full text: PDFPDF

The age of the single thread juggernaut has ended, due to a variety of factors. Multi-core processors are coming on strong, and scaling in being stressed more than ever. This presents a number of architectural, hardware and software challenges. This ...
expand
SESSION: Security and correctness
Self-checking instructions: reducing instruction redundancy for concurrent error detection
Sumeet Kumar, Aneesh Aggarwal
Pages: 64-73
doi>10.1145/1152154.1152168
Full text: PDFPDF

With reducing feature size, increasing chip capacity, and increasing clock speed, microprocessors are becoming increasingly susceptible to transient (soft) errors. Redundant multi-threading (RMT) is an attractive approach for concurrent error ...
expand
A low-cost memory remapping scheme for address bus protection
Lan Gao, Jun Yang, Marek Chrobak, Youtao Zhang, San Nguyen, Hsien-Hsin S. Lee
Pages: 74-83
doi>10.1145/1152154.1152169
Full text: PDFPDF

The address sequence on the processor-memory bus can reveal abundant information about the control flow of a program. This can lead to critical information leakage such as encryption keys or proprietary algorithms. Addresses can be observed by attaching ...
expand
Efficient data protection for distributed shared memory multiprocessors
Brian Rogers, Milos Prvulovic, Yan Solihin
Pages: 84-94
doi>10.1145/1152154.1152170
Full text: PDFPDF

Data security in computer systems has recently become an increasing concern, and hardware-based attacks have emerged. As a result, researchers have investigated hardware encryption and authentication mechanisms as a means of addressing this security ...
expand
SESSION: Characterizing program behavior
Wavelet-based phase classification
Ted Huffmire, Tim Sherwood
Pages: 95-104
doi>10.1145/1152154.1152172
Full text: PDFPDF

Phase analysis has proven to be a useful method of summarizing the time-varying behavior of programs, with uses ranging from reducing simulation time to guiding run-time optimizations. Although phase classification techniques based on basic block vectors ...
expand
Complexity-based program phase analysis and classification
Chang-Burm Cho, Tao Li
Pages: 105-113
doi>10.1145/1152154.1152173
Full text: PDFPDF

Modeling and analysis of program behavior are at the foundation of computer system design and optimization. As computer systems become more adaptive, their efficiency increasingly depends on program dynamic characteristics. Previous studies have revealed ...
expand
Performance prediction based on inherent program similarity
Kenneth Hoste, Aashish Phansalkar, Lieven Eeckhout, Andy Georges, Lizy K. John, Koen De Bosschere
Pages: 114-122
doi>10.1145/1152154.1152174
Full text: PDFPDF

A key challenge in benchmarking is to predict the performance of an application of interest on a number of platforms in order to determine which platform yields the best performance. This paper proposes an approach for doing this. We measure a number ...
expand
Deep computing in biology: challenges and progress
Ajay Royyuru
Pages: 123-123
doi>10.1145/1152154.1152157
Full text: PDFPDF

The Computational Biology Center at IBM Research pursues basic and exploratory research at the interface of information technology and biology. Information technology plays a vital role in enabling new science and discovery in biology. Advances in high ...
expand
SESSION: Multi-core design II
Hardware support for spin management in overcommitted virtual machines
Philip M. Wells, Koushik Chakraborty, Gurindar S. Sohi
Pages: 124-133
doi>10.1145/1152154.1152176
Full text: PDFPDF

Multiprocessor operating systems (OSs) pose several unique and conflicting challenges to System Virtual Machines (System VMs). For example, most existing system VMs resort to gang scheduling a guest OS's virtual processors (VCPUs) to avoid OS synchronization ...
expand
Testing implementations of transactional memory
Chaiyasit Manovit, Sudheendra Hangal, Hassan Chafi, Austen McDonald, Christos Kozyrakis, Kunle Olukotun
Pages: 134-143
doi>10.1145/1152154.1152177
Full text: PDFPDF

Transactional memory is an attractive design concept for scalable multiprocessors because it offers efficient lock-free synchronization and greatly simplifies parallel software. Given the subtle issues involved with concurrency and atomicity, ...
expand
Efficient emulation of hardware prefetchers via event-driven helper threading
Ilya Ganusov, Martin Burtscher
Pages: 144-153
doi>10.1145/1152154.1152178
Full text: PDFPDF

The advance of multi-core architectures provides significant benefits for parallel and throughput-oriented computing, but the performance of individual computation threads does not improve and may even suffer a penalty because of the increased contention ...
expand
SESSION: Performance profiling and tuning
DEP: detailed execution profile
Qin Zhao, Joon Edward Sim, Weng-Fai Wong, Larry Rudolph
Pages: 154-163
doi>10.1145/1152154.1152180
Full text: PDFPDF

In many areas of computer architecture design and program development, the knowledge of dynamic program behavior can be very handy. Several challenges beset the accurate and complete collection of dynamic control flow and memory reference information. ...
expand
Whole-program optimization of global variable layout
Nathaniel McIntosh, Sandya Mannarswamy, Robert Hundt
Pages: 164-172
doi>10.1145/1152154.1152181
Full text: PDFPDF

On machines with high-performance processors, the memory system continues to be a performance bottleneck. Compilers insert prefetch operations and reorder data accesses to improve locality, but increasingly seek to modify an application's data layout ...
expand
Fast, automatic, procedure-level performance tuning
Zhelong Pan, Rudolf Eigenmann
Pages: 173-181
doi>10.1145/1152154.1152182
Full text: PDFPDF

This paper presents an automated performance tuning solution, which partitions a program into a number of tuning sections and finds the best combination of compiler options for each section. Our solution builds on prior work on feedback-driven ...
expand
SESSION: Instruction fetch and control flow
Reducing control overhead in dataflow architectures
Andrew Petersen, Andrew Putnam, Martha Mercaldi, Andrew Schwerin, Susan Eggers, Steve Swanson, Mark Oskin
Pages: 182-191
doi>10.1145/1152154.1152184
Full text: PDFPDF

In recent years, computer architects have proposed tiled architectures in response to several emerging problems in processor design, such as design complexity, wire delay, and fabrication reliability. One of these architectures, WaveScalar, uses a dynamic, ...
expand
Power-efficient instruction delivery through trace reuse
Chengmo Yang, Alex Orailoglu
Pages: 192-201
doi>10.1145/1152154.1152185
Full text: PDFPDF

As power dissipation inexorably becomes the major bottleneck in system integration and reliability, the front-end instruction delivery path in a traditional out-of-order superscalar processor needs to deliver high application performance in an energy-effective ...
expand
Branch predictor guided instruction decoding
Oliverio J. Santana, Ayose Falcón, Alex Ramirez, Mateo Valero
Pages: 202-211
doi>10.1145/1152154.1152186
Full text: PDFPDF

Fast instruction decoding is a challenge for the design of CISC microprocessors. A well-known solution to overcome this problem is using a trace cache. It stores and fetches already decoded instructions, avoiding the need for decoding them again. However, ...
expand
SESSION: Application-specific optimizations
Two-level mapping based cache index selection for packet forwarding engines
Kaushik Rajan, R. Govindarajan
Pages: 212-221
doi>10.1145/1152154.1152188
Full text: PDFPDF

Packet forwarding is a memory-intensive application requiring multiple accesses through a trie structure. The efficiency of a cache for this application critically depends on the placement function to reduce conflict misses. Traditional placement functions ...
expand
Program generation for the all-pairs shortest path problem
Sung-Chul Han, Franz Franchetti, Markus Püschel
Pages: 222-232
doi>10.1145/1152154.1152189
Full text: PDFPDF

A recent trend in computing are domain-specific program generators, designed to alleviate the effort of porting and reoptimizing libraries for fast-changing and increasingly complex computing platforms. Examples include ATLAS, SPIRAL, and the codelet ...
expand
Combining analytical and empirical approaches in tuning matrix transposition
Qingda Lu, Sriram Krishnamoorthy, P. Sadayappan
Pages: 233-242
doi>10.1145/1152154.1152190
Full text: PDFPDF

Matrix transposition is an important kernel used in many applications. Even though its optimization has been the subject of many studies, an optimization procedure that targets the characteristics of current processor architectures has not been developed. ...
expand
Processor architecture: too much parallelism?
David B. Kirk
Pages: 243-243
doi>10.1145/1152154.1152158
Full text: PDFPDF

CPUs and GPUs have evolved considerably in the past few years, and the pace of change and evolution in processor architecture is likely to increase. Constraints of excess heat dissipation and power consumption have forced a radical rethinking of microprocessor ...
expand
SESSION: Out-of-order microarchitecture
Adaptive reorder buffers for SMT processors
Joseph Sharkey, Deniz Balkan, Dmitry Ponomarev
Pages: 244-253
doi>10.1145/1152154.1152192
Full text: PDFPDF

In SMT processors, the complex interplay between private and shared datapath resources needs to be considered in order to realize the full performance potential. In this paper, we show that blindly increasing the size of the per-thread reorder buffers ...
expand
SEED: scalable, efficient enforcement of dependences
Francisco J. Mesa-Martínez, Michael C. Huang, Jose Renau
Pages: 254-264
doi>10.1145/1152154.1152193
Full text: PDFPDF

Instruction issue logic is a critical component in modern high-performance out-of-order processors. The ever increasing latencies found in modern processors, mostly associated with memory accesses and longer pipelines, can be attenuated using large issue ...
expand
SPARTAN: speculative avoidance of register allocations to transient values for performance and energy efficiency
Deniz Balkan, Joseph Sharkey, Dmitry Ponomarev, Kanad Ghose
Pages: 265-274
doi>10.1145/1152154.1152194
Full text: PDFPDF

High-performance microprocessors use large, heavily-ported physical register files (RFs) to increase the instruction throughput. The high complexity and power dissipation of such RFs mainly stem from the need to maintain each and every result for a large ...
expand
SESSION: Dependences and register allocation
Overlapping dependent loads with addressless preload
Zhen Yang, Xudong Shi, Feiqi Su, Jih-Kwon Peir
Pages: 275-284
doi>10.1145/1152154.1152196
Full text: PDFPDF

Modern out-of-order processors with non-blocking caches exploit Memory-Level Parallelism (MLP) by overlapping cache misses in a wide instruction window. The exploitation of MLP, however, can be limited due to long-latency operations in producing the ...
expand
Prematerialization: reducing register pressure for free
Ivan D. Baev, Richard E. Hank, David H. Gross
Pages: 285-294
doi>10.1145/1152154.1152197
Full text: PDFPDF

Modern compiler transformations that eliminate redundant computations or reorder instructions, such as partial redundancy elimination and instruction scheduling, are very effective in improving application performance but tend to create longer and potentially ...
expand
An empirical evaluation of chains of recurrences for array dependence testing
J. Birch, R.A. van Engelen, K.A. Gallivan, Y. Shou
Pages: 295-304
doi>10.1145/1152154.1152198
Full text: PDFPDF

Code restructuring compilers rely heavily on program analysis techniques to automatically detect data dependences between program statements. Dependences between statement instances in the iteration space of a loop nest impose ordering constraints that ...
expand

Powered by The ACM Guide to Computing Literature


The ACM Digital Library is published by the Association for Computing Machinery. Copyright © 2016 ACM, Inc.
Terms of Usage   Privacy Policy   Code of Ethics   Contact Us

Useful downloads: Adobe Reader    QuickTime    Windows Media Player    Real Player
Did you know the ACM DL App is now available?
Did you know your Organization can subscribe to the ACM Digital Library?
The ACM Guide to Computing Literature
All Tags
Export Formats
 
 
Save to Binder