|
|
Saving the Planet with Systems Research: Conference Keynote |
| |
Luiz Andre Barroso
|
|
The computing industry has become so successful that almost every economic sector benefits from its advances, while also being impacted by its costs. The energy footprint of computers is one such cost. The dramatic growth of computer deployments, both ...
The computing industry has become so successful that almost every economic sector benefits from its advances, while also being impacted by its costs. The energy footprint of computers is one such cost. The dramatic growth of computer deployments, both in personal devices and servers, has made the energy usage of computers an increasing percentage of the world's electricity budget. In data center installations, energy-related costs are rising as a fraction of total expenses. Either of these trends left unchecked has the potential to hurt our industry, our economy or our planet. In this talk I will discuss these trends, focusing on large data centers, and argue that systems researchers have a pivotal role to play in solving these problems. Luiz André Barroso is a Distinguished Engineer at Google, where he has worked across several engineering areas, ranging from applications and software infrastructure to the design of Google's computing platform. Prior to working at Google, he was a member of the research staff at Compaq and Digital Equipment Corporation, where his group did some of the pioneering work on computer architectures for commercial workloads. That work included the design of Piranha, a system based on an aggressive chip-multiprocessing, which helped inspire some of the multi-core CPUs that are now in the mainstream. Luiz has a Ph.D. degree in computer engineering from the University of Southern California and B.S. and M.S. degrees in electrical engineering from the PontifÃcia Universidade Católica, Rio de Janeiro expand
|
|
|
SESSION: Lessons learned and looking ahead |
|
|
|
|
An evaluation of the TRIPS computer system |
| |
Mark Gebhart,
Bertrand A. Maher,
Katherine E. Coons,
Jeff Diamond,
Paul Gratz,
Mario Marino,
Nitya Ranganathan,
Behnam Robatmili,
Aaron Smith,
James Burrill,
Stephen W. Keckler,
Doug Burger,
Kathryn S. McKinley
|
|
Pages: 1-12 |
|
doi>10.1145/1508244.1508246 |
|
Full text: PDF
|
|
The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model ...
The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model in which blocks are composed of dataflow instructions. The goal of the TRIPS design is to mine concurrency for high performance while tolerating emerging technology scaling challenges, such as increasing wire delays and power consumption. This paper evaluates how well TRIPS meets this goal through a detailed ISA and performance analysis. We compare performance, using cycles counts, to commercial processors. On SPEC CPU2000, the Intel Core 2 outperforms compiled TRIPS code in most cases, although TRIPS matches a Pentium 4. On simple benchmarks, compiled TRIPS code outperforms the Core 2 by 10% and hand-optimized TRIPS code outperforms it by factor of 3. Compared to conventional ISAs, the block-atomic model provides a larger instruction window, increases concurrency at a cost of more instructions executed, and replaces register and memory accesses with more efficient direct instruction-to-instruction communication. Our analysis suggests ISA, microarchitecture, and compiler enhancements for addressing weaknesses in TRIPS and indicates that EDGE architectures have the potential to exploit greater concurrency in future technologies. expand
|
|
|
Architectural implications of nanoscale integrated sensing and computing |
| |
Constantin Pistol,
Wutichai Chongchitmate,
Christopher Dwyer,
Alvin R. Lebeck
|
|
Pages: 13-24 |
|
doi>10.1145/1508244.1508247 |
|
Full text: PDF
|
|
This paper explores the architectural implications of integrating computation and molecular probes to form nanoscale sensor processors (nSP). We show how nSPs may enable new computing domains and automate tasks that currently require expert scientific ...
This paper explores the architectural implications of integrating computation and molecular probes to form nanoscale sensor processors (nSP). We show how nSPs may enable new computing domains and automate tasks that currently require expert scientific training and costly equipment. This new application domain severely constrains nSP size, which significantly impacts the architectural design space. In this context, we explore nSP architectures and present an nSP design that includes a simple accumulator-based ISA, sensors, limited memory and communication transceivers. To reduce the application memory footprint, we introduce the concept of instruction-fused sensing. We use simulation and analytical models to evaluate nSP designs executing a representative set of target applications. Furthermore, we propose a candidate nSP technology based on optical Resonance Energy Transfer (RET) logic that enables the small size required by the application domain; our smallest design is about the size of the largest known virus. We also show laboratory results that demonstrate initial steps towards a prototype. expand
|
|
|
SESSION: Reliable systems I |
|
|
|
|
CTrigger: exposing atomicity violation bugs from their hiding places |
| |
Soyeon Park,
Shan Lu,
Yuanyuan Zhou
|
|
Pages: 25-36 |
|
doi>10.1145/1508244.1508249 |
|
Full text: PDF
|
|
Multicore hardware is making concurrent programs pervasive. Unfortunately, concurrent programs are prone to bugs. Among different types of concurrency bugs, atomicity violation bugs are common and important. Existing techniques to detect atomicity violation ...
Multicore hardware is making concurrent programs pervasive. Unfortunately, concurrent programs are prone to bugs. Among different types of concurrency bugs, atomicity violation bugs are common and important. Existing techniques to detect atomicity violation bugs suffer from one limitation: requiring bugs to manifest during monitored runs, which is an open problem in concurrent program testing. This paper makes two contributions. First, it studies the interleaving characteristics of the common practice in concurrent program testing (i.e., running a program over and over) to understand why atomicity violation bugs are hard to expose. Second, it proposes CTrigger to effectively and efficiently expose atomicity violation bugs in large programs. CTrigger focuses on a special type of interleavings (i.e., unserializable interleavings) that are inherently correlated to atomicity violation bugs, and uses trace analysis to systematically identify (likely) feasible unserializable interleavings with low occurrence-probability. CTrigger then uses minimum execution perturbation to exercise low-probability interleavings and expose difficult-to-catch atomicity violation. We evaluate CTrigger with real-world atomicity violation bugs from four sever/desktop applications (Apache, MySQL, Mozilla, and PBZIP2) and three SPLASH2 applications on 8-core machines. CTrigger efficiently exposes the tested bugs within 1--235 seconds, two to four orders of magnitude faster than stress testing. Without CTrigger, some of these bugs do not manifest even after 7 full days of stress testing. In addition, without deterministic replay support, once a bug is exposed, CTrigger can help programmers reliably reproduce it for diagnosis. Our tested bugs are reproduced by CTrigger mostly within 5 seconds, 300 to over 60000 times faster than stress testing. expand
|
|
|
ASSURE: automatic software self-healing using rescue points |
| |
Stelios Sidiroglou,
Oren Laadan,
Carlos Perez,
Nicolas Viennot,
Jason Nieh,
Angelos D. Keromytis
|
|
Pages: 37-48 |
|
doi>10.1145/1508244.1508250 |
|
Full text: PDF
|
|
Software failures in server applications are a significant problem for preserving system availability. We present ASSURE, a system that introduces rescue points that recover software from unknown faults while maintaining both system integrity and availability, ...
Software failures in server applications are a significant problem for preserving system availability. We present ASSURE, a system that introduces rescue points that recover software from unknown faults while maintaining both system integrity and availability, by mimicking system behavior under known error conditions. Rescue points are locations in existing application code for handling a given set of programmer-anticipated failures, which are automatically repurposed and tested for safely enabling fault recovery from a larger class of (unanticipated) faults. When a fault occurs at an arbitrary location in the program, ASSURE restores execution to an appropriate rescue point and induces the program to recover execution by virtualizing the program's existing error-handling facilities. Rescue points are identified using fuzzing, implemented using a fast coordinated checkpoint-restart mechanism that handles multi-process and multi-threaded applications, and, after testing, are injected into production code using binary patching. We have implemented an ASSURE Linux prototype that operates without application source code and without base operating system kernel changes. Our experimental results on a set of real-world server applications and bugs show that ASSURE enabled recovery for all of the bugs tested with fast recovery times, has modest performance overhead, and provides automatic self-healing orders of magnitude faster than current human-driven patch deployment methods. expand
|
|
|
Recovery domains: an organizing principle for recoverable operating systems |
| |
Andrew Lenharth,
Vikram S. Adve,
Samuel T. King
|
|
Pages: 49-60 |
|
doi>10.1145/1508244.1508251 |
|
Full text: PDF
|
|
We describe a strategy for enabling existing commodity operating systems to recover from unexpected run-time errors in nearly any part of the kernel, including core kernel components. Our approach is dynamic and request-oriented; it isolates the effects ...
We describe a strategy for enabling existing commodity operating systems to recover from unexpected run-time errors in nearly any part of the kernel, including core kernel components. Our approach is dynamic and request-oriented; it isolates the effects of a fault to the requests that caused the fault rather than to static kernel components. This approach is based on a notion of "recovery domains," an organizing principle to enable rollback of state affected by a request in a multithreaded system with minimal impact on other requests or threads. We have applied this approach on v2.4.22 and v2.6.27 of the Linux kernel and it required 132 lines of changed or new code: the other changes are all performed by a simple instrumentation pass of a compiler. Our experiments show that the approach is able to recover from otherwise fatal faults with minimal collateral impact during a recovery event. expand
|
|
|
Anomaly-based bug prediction, isolation, and validation: an automated approach for software debugging |
| |
Martin Dimitrov,
Huiyang Zhou
|
|
Pages: 61-72 |
|
doi>10.1145/1508244.1508252 |
|
Full text: PDF
|
|
Software defects, commonly known as bugs, present a serious challenge for system reliability and dependability. Once a program failure is observed, the debugging activities to locate the defects are typically nontrivial and time consuming. In this paper, ...
Software defects, commonly known as bugs, present a serious challenge for system reliability and dependability. Once a program failure is observed, the debugging activities to locate the defects are typically nontrivial and time consuming. In this paper, we propose a novel automated approach to pin-point the root-causes of software failures. Our proposed approach consists of three steps. The first step is bug prediction, which leverages the existing work on anomaly-based bug detection as exceptional behavior during program execution has been shown to frequently point to the root cause of a software failure. The second step is bug isolation, which eliminates false-positive bug predictions by checking whether the dynamic forward slices of bug predictions lead to the observed program failure. The last step is bug validation, in which the isolated anomalies are validated by dynamically nullifying their effects and observing if the program still fails. The whole bug prediction, isolation and validation process is fully automated and can be implemented with efficient architectural support. Our experiments with 6 programs and 7 bugs, including a real bug in the gcc 2.95.2 compiler, show that our approach is highly effective at isolating only the relevant anomalies. Compared to state-of-art debugging techniques, our proposed approach pinpoints the defect locations more accurately and presents the user with a much smaller code set to analyze. expand
|
|
|
SESSION: Deterministic multiprocessing |
|
|
|
|
Capo: a software-hardware interface for practical deterministic multiprocessor replay |
| |
Pablo Montesinos,
Matthew Hicks,
Samuel T. King,
Josep Torrellas
|
|
Pages: 73-84 |
|
doi>10.1145/1508244.1508254 |
|
Full text: PDF
|
|
While deterministic replay of parallel programs is a powerful technique, current proposals have shortcomings. Specifically, software-based replay systems have high overheads on multiprocessors, while hardware-based proposals focus only on basic hardware-level ...
While deterministic replay of parallel programs is a powerful technique, current proposals have shortcomings. Specifically, software-based replay systems have high overheads on multiprocessors, while hardware-based proposals focus only on basic hardware-level mechanisms, ignoring the overall replay system. To be practical, hardware-based replay systems need to support an environment with multiple parallel jobs running concurrently -- some being recorded, others being replayed and even others running without recording or replay. Moreover, they need to manage limited-size log buffers. This paper addresses these shortcomings by introducing, for the first time, a set of abstractions and a software-hardware interface for practical hardware-assisted replay of multiprocessor systems. The approach, called Capo, introduces the novel abstraction of the Replay Sphere to separate the responsibilities of the hardware and software components of the replay system. In this paper, we also design and build CapoOne, a prototype of a deterministic multiprocessor replay system that implements Capo using Linux and simulated DeLorean hardware. Our evaluation of 4-processor executions shows that CapoOne largely records with the efficiency of hardware-based schemes and the flexibility of software-based schemes. expand
|
|
|
DMP: deterministic shared memory multiprocessing |
| |
Joseph Devietti,
Brandon Lucia,
Luis Ceze,
Mark Oskin
|
|
Pages: 85-96 |
|
doi>10.1145/1508244.1508255 |
|
Full text: PDF
|
|
Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits ...
Current shared memory multicore and multiprocessor systems are nondeterministic. Each time these systems execute a multithreaded application, even if supplied with the same input, they can produce a different output. This frustrates debugging and limits the ability to properly test multithreaded code, becoming a major stumbling block to the much-needed widespread adoption of parallel programming. In this paper we make the case for fully deterministic shared memory multiprocessing (DMP). The behavior of an arbitrary multithreaded program on a DMP system is only a function of its inputs. The core idea is to make inter-thread communication fully deterministic. Previous approaches to coping with nondeterminism in multithreaded programs have focused on replay, a technique useful only for debugging. In contrast, while DMP systems are directly useful for debugging by offering repeatability by default, we argue that parallel programs should execute deterministically in the field as well. This has the potential to make testing more assuring and increase the reliability of deployed multithreaded software. We propose a range of approaches to enforcing determinism and discuss their implementation trade-offs. We show that determinism can be provided with little performance cost using our architecture proposals on future hardware, and that software-only approaches can be utilized on existing systems. expand
|
|
|
Kendo: efficient deterministic multithreading in software |
| |
Marek Olszewski,
Jason Ansel,
Saman Amarasinghe
|
|
Pages: 97-108 |
|
doi>10.1145/1508244.1508256 |
|
Full text: PDF
|
|
Although chip-multiprocessors have become the industry standard, developing parallel applications that target them remains a daunting task. Non-determinism, inherent in threaded applications, causes significant challenges for parallel programmers by ...
Although chip-multiprocessors have become the industry standard, developing parallel applications that target them remains a daunting task. Non-determinism, inherent in threaded applications, causes significant challenges for parallel programmers by hindering their ability to create parallel applications with repeatable results. As a consequence, parallel applications are significantly harder to debug, test, and maintain than sequential programs. This paper introduces Kendo: a new software-only system that provides deterministic multithreading of parallel applications. Kendo enforces a deterministic interleaving of lock acquisitions and specially declared non-protected reads through a novel dynamically load-balanced deterministic scheduling algorithm. The algorithm tracks the progress of each thread using performance counters to construct a deterministic logical time that is used to compute an interleaving of shared data accesses that is both deterministic and provides good load balancing. Kendo can run on today's commodity hardware while incurring only a modest performance cost. Experimental results on the SPLASH-2 applications yield a geometric mean overhead of only 16% when running on 4 processors. This low overhead makes it possible to benefit from Kendo even after an application is deployed. Programmers can start using Kendo today to program parallel applications that are easier to develop, debug, and test. expand
|
|
|
SESSION: Prediction and accounting |
|
|
|
|
Complete information flow tracking from the gates up |
| |
Mohit Tiwari,
Hassan M.G. Wassel,
Bita Mazloom,
Shashidhar Mysore,
Frederic T. Chong,
Timothy Sherwood
|
|
Pages: 109-120 |
|
doi>10.1145/1508244.1508258 |
|
Full text: PDF
|
|
For many mission-critical tasks, tight guarantees on the flow of information are desirable, for example, when handling important cryptographic keys or sensitive financial data. We present a novel architecture capable of tracking all information flow ...
For many mission-critical tasks, tight guarantees on the flow of information are desirable, for example, when handling important cryptographic keys or sensitive financial data. We present a novel architecture capable of tracking all information flow within the machine, including all explicit data transfers and all implicit flows (those subtly devious flows caused by not performing conditional operations). While the problem is impossible to solve in the general case, we have created a machine that avoids the general-purpose programmability that leads to this impossibility result, yet is still programmable enough to handle a variety of critical operations such as public-key encryption and authentication. Through the application of our novel gate-level information flow tracking method, we show how all flows of information can be precisely tracked. From this foundation, we then describe how a class of architectures can be constructed, from the gates up, to completely capture all information flows and we measure the impact of doing so on the hardware implementation, the ISA, and the programmer. expand
|
|
|
RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations |
| |
David K. Tam,
Reza Azimi,
Livio B. Soares,
Michael Stumm
|
|
Pages: 121-132 |
|
doi>10.1145/1508244.1508259 |
|
Full text: PDF
|
|
Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed ...
Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and consequently, their usage for online optimizations has been limited. To address these problems and opportunities, we have developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required. Our technique, called RapidMRC, requires a single probing period of roughly 221 million processor cycles (147 ms), and subsequently 124 million cycles (83 ms) to process the data. We demonstrate its accuracy by comparing the obtained MRCs to the actual L2 MRCs of 30 applications taken from SPECcpu2006, SPECcpu2000, and SPECjbb2000. We show that RapidMRC can be applied to sizing cache partitions, helping to achieve performance improvements of up to 27%. expand
|
|
|
Per-thread cycle accounting in SMT processors |
| |
Stijn Eyerman,
Lieven Eeckhout
|
|
Pages: 133-144 |
|
doi>10.1145/1508244.1508260 |
|
Full text: PDF
|
|
This paper proposes a cycle accounting architecture for Simultaneous Multithreading (SMT) processors that estimates the execution times for each of the threads had they been executed alone, while they are running simultaneously on the SMT processor. ...
This paper proposes a cycle accounting architecture for Simultaneous Multithreading (SMT) processors that estimates the execution times for each of the threads had they been executed alone, while they are running simultaneously on the SMT processor. This is done by accounting each cycle to either a base, miss event or waiting cycle component during multi-threaded execution. Single-threaded alone execution time is then estimated as the sum of the base and miss event components; the waiting cycle component represents the lost cycle count due to SMT execution. The cycle accounting architecture incurs reasonable hardware cost (around 1KB of storage) and estimates single-threaded performance with average prediction errors around 7.2% for two-program workloads and 11.7% for four-program workloads. The cycle accounting architecture has several important applications to system software and its interaction with SMT hardware. For one, the estimated single-thread alone execution time provides an accurate picture to system software of the actually consumed processor cycles per thread. The alone execution time instead of the total execution time (timeslice) may make system software scheduling policies more effective. Second, a new class of thread-progress aware SMT fetch policies based on per-thread progress indicators enable system software level priorities to be enforced at the hardware level. expand
|
|
|
SESSION: Transactional memories |
|
|
|
|
Maximum benefit from a minimal HTM |
| |
Owen S. Hofmann,
Christopher J. Rossbach,
Emmett Witchel
|
|
Pages: 145-156 |
|
doi>10.1145/1508244.1508262 |
|
Full text: PDF
|
|
A minimal, bounded hardware transactional memory implementation significantly improves synchronization performance when used in an operating system kernel. We add HTM to Linux 2.4, a kernel with a simple, coarse-grained synchronization structure. The ...
A minimal, bounded hardware transactional memory implementation significantly improves synchronization performance when used in an operating system kernel. We add HTM to Linux 2.4, a kernel with a simple, coarse-grained synchronization structure. The transactional Linux 2.4 kernel can improve performance of user programs by as much as 40% over the non-transactional 2.4 kernel. It closes 68% of the performance gap with the Linux 2.6 kernel, which has had significant engineering effort applied to improve scalability. We then extend our minimal HTM to a fast, unbounded transactional memory with a novel technique for coordinating hardware transactions and software synchronization. Overflowed transactions run in software, with only a minimal coupling between hardware and software systems. There is no performance penalty for overflow rates of less than 1%. In one instance, at 16 processors and an overflow rate of 4%, performance degrades from an ideal 4.3x to 3.6x. expand
|
|
|
Early experience with a commercial hardware transactional memory implementation |
| |
Dave Dice,
Yossi Lev,
Mark Moir,
Daniel Nussbaum
|
|
Pages: 157-168 |
|
doi>10.1145/1508244.1508263 |
|
Full text: PDF
|
|
We report on our experience with the hardware transactional memory (HTM) feature of two pre-production revisions of a new commercial multicore processor. Our experience includes a number of promising results using HTM to improve performance in a variety ...
We report on our experience with the hardware transactional memory (HTM) feature of two pre-production revisions of a new commercial multicore processor. Our experience includes a number of promising results using HTM to improve performance in a variety of contexts, and also identifies some ways in which the feature could be improved to make it even better. We give detailed accounts of our experiences, sharing techniques we used to achieve the results we have, as well as describing challenges we faced in doing so. expand
|
|
|
SESSION: Reliable systems II |
|
|
|
|
Mixed-mode multicore reliability |
| |
Philip M. Wells,
Koushik Chakraborty,
Gurindar S. Sohi
|
|
Pages: 169-180 |
|
doi>10.1145/1508244.1508265 |
|
Full text: PDF
|
|
Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly execute a single software thread, providing very high coverage from many ...
Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly execute a single software thread, providing very high coverage from many difference sources of faults. This reliability, however, comes at a high price in terms of per-thread IPC and overall system throughput. We make the observation that a user may want to run both applications requiring high reliability, such as financial software, and more fault tolerant applications requiring high performance, such as media or web software, on the same machine at the same time. Yet a traditional DMR system must fully operate in redundant mode whenever any application requires high reliability. This paper proposes a Mixed-Mode Multicore (MMM), which enables most applications, including the system software, to run with high reliability in DMR mode, while applications that need high performance can avoid the penalty of DMR. Though conceptually simple, two key challenges arise: 1) care must be taken to protect reliable applications from any faults occurring to applications running in high performance mode, and 2) the desire to execute additional independent software threads for a performance application complicates the scheduling of computation to cores. After solving these issues, an MMM is shown to improve overall system performance, compared to a traditional DMR system, by approximately 2X when one reliable and one performance application are concurrently executing. expand
|
|
|
ISOLATOR: dynamically ensuring isolation in comcurrent programs |
| |
Sriram Rajamani,
G. Ramalingam,
Venkatesh Prasad Ranganath,
Kapil Vaswani
|
|
Pages: 181-192 |
|
doi>10.1145/1508244.1508266 |
|
Full text: PDF
|
|
In this paper, we focus on concurrent programs that use locks to achieve isolation of data accessed by critical sections of code. We present ISOLATOR, an algorithm that guarantees isolation for well-behaved threads of a program that obey a locking discipline ...
In this paper, we focus on concurrent programs that use locks to achieve isolation of data accessed by critical sections of code. We present ISOLATOR, an algorithm that guarantees isolation for well-behaved threads of a program that obey a locking discipline even in the presence of ill-behaved threads that disobey the locking discipline. ISOLATOR uses code instrumentation, data replication, and virtual memory protection to detect isolation violations and delays ill-behaved threads to ensure isolation. Our instrumentation scheme requires access only to the code of well-behaved threads. We have evaluated ISOLATOR on several benchmark programs and found that ISOLATOR can ensure isolation with reasonable runtime overheads. In addition, we present three general desiderata - safety, isolation, and permissiveness - for any scheme that attempts to ensure isolation, and formally prove that ISOLATOR satisfies all of these desiderata. expand
|
|
|
Efficient online validation with delta execution |
| |
Joseph Tucek,
Weiwei Xiong,
Yuanyuan Zhou
|
|
Pages: 193-204 |
|
doi>10.1145/1508244.1508267 |
|
Full text: PDF
|
|
Software systems are constantly changing. Patches to fix bugs and patches to add features are all too common. Every change risks breaking a previously working system. Hence administrators loathe change, and are willing to delay even critical security ...
Software systems are constantly changing. Patches to fix bugs and patches to add features are all too common. Every change risks breaking a previously working system. Hence administrators loathe change, and are willing to delay even critical security patches until after fully validating their correctness. Compared to off-line validation, on-line validation has clear advantages since it tests against real life workloads. Yet unfortunately it imposes restrictive overheads as it requires running the old and new versions side-by-side. Moreover, due to spurious differences (e.g. event timing, random number generation, and thread interleavings), it is difficult to compare the two for validation. To allow more effective on-line patch validation, we propose a new mechanism, called delta execution, that is based on the observation that most patches are small. Delta execution merges the two side-by-side executions for most of the time and splits only when necessary, such as when they access different data or execute different code. This allows us to perform on-line validation not only with lower overhead but also with greatly reduced spurious differences, allowing us to effectively validate changes. We first validate the feasibility of our idea by studying the characteristics of 240 patches from 4 server programs; our examination shows that 77% of the changes should not be expected to cause large changes and are thereby feasible for Delta execution. We then implemented Delta execution using dynamic instrumentation. Using real world patches from 7 server applications and 3 other programs, we compared our implementation of Delta execution against a traditional side-by-side on-line validation. Delta execution outperformed traditional validation by up to 128%; further, for 3 of the changes, spurious differences caused the traditional validation to fail completely while Delta execution succeeded. This demonstrates that Delta execution can allow administrators to use on-line validation to confidently ensure the correctness of the changes they apply. expand
|
|
|
SESSION: Power and storage in enterprise systems |
|
|
|
|
PowerNap: eliminating server idle power |
| |
David Meisner,
Brian T. Gold,
Thomas F. Wenisch
|
|
Pages: 205-216 |
|
doi>10.1145/1508244.1508269 |
|
Full text: PDF
|
|
Data center power consumption is growing to unprecedented levels: the EPA estimates U.S. data centers will consume 100 billion kilowatt hours annually by 2011. Much of this energy is wasted in idle systems: in typical deployments, server utilization ...
Data center power consumption is growing to unprecedented levels: the EPA estimates U.S. data centers will consume 100 billion kilowatt hours annually by 2011. Much of this energy is wasted in idle systems: in typical deployments, server utilization is below 30%, but idle servers still consume 60% of their peak power draw. Typical idle periods though frequent--last seconds or less, confounding simple energy-conservation approaches. In this paper, we propose PowerNap, an energy-conservation approach where the entire system transitions rapidly between a high-performance active state and a near-zero-power idle state in response to instantaneous load. Rather than requiring fine-grained power-performance states and complex load-proportional operation from each system component, PowerNap instead calls for minimizing idle power and transition time, which are simpler optimization goals. Based on the PowerNap concept, we develop requirements and outline mechanisms to eliminate idle power waste in enterprise blade servers. Because PowerNap operates in low-efficiency regions of current blade center power supplies, we introduce the Redundant Array for Inexpensive Load Sharing (RAILS), a power provisioning approach that provides high conversion efficiency across the entire range of PowerNap's power demands. Using utilization traces collected from enterprise-scale commercial deployments, we demonstrate that, together, PowerNap and RAILS reduce average server power consumption by 74%. expand
|
|
|
Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications |
| |
Adrian M. Caulfield,
Laura M. Grupp,
Steven Swanson
|
|
Pages: 217-228 |
|
doi>10.1145/1508244.1508270 |
|
Full text: PDF
|
|
As our society becomes more information-driven, we have begun to amass data at an astounding and accelerating rate. At the same time, power concerns have made it difficult to bring the necessary processing power to bear on querying, processing, and understanding ...
As our society becomes more information-driven, we have begun to amass data at an astounding and accelerating rate. At the same time, power concerns have made it difficult to bring the necessary processing power to bear on querying, processing, and understanding this data. We describe Gordon, a system architecture for data-centric applications that combines low-power processors, flash memory, and data-centric programming systems to improve performance for data-centric applications while reducing power consumption. The paper presents an exhaustive analysis of the design space of Gordon systems, focusing on the trade-offs between power, energy, and performance that Gordon must make. It analyzes the impact of flash-storage and the Gordon architecture on the performance and power efficiency of data-centric applications. It also describes a novel flash translation layer tailored to data intensive workloads and large flash storage arrays. Our data show that, using technologies available in the near future, Gordon systems can out-perform disk-based clusters by 1.5× and deliver up to 2.5× more performance per Watt. expand
|
|
|
DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings |
| |
Aayush Gupta,
Youngjae Kim,
Bhuvan Urgaonkar
|
|
Pages: 229-240 |
|
doi>10.1145/1508244.1508271 |
|
Full text: PDF
|
|
Recent technological advances in the development of flash-memory based devices have consolidated their leadership position as the preferred storage media in the embedded systems market and opened new vistas for deployment in enterprise-scale storage ...
Recent technological advances in the development of flash-memory based devices have consolidated their leadership position as the preferred storage media in the embedded systems market and opened new vistas for deployment in enterprise-scale storage systems. Unlike hard disks, flash devices are free from any mechanical moving parts, have no seek or rotational delays and consume lower power. However, the internal idiosyncrasies of flash technology make its performance highly dependent on workload characteristics. The poor performance of random writes has been a cause of major concern, which needs to be addressed to better utilize the potential of flash in enterprise-scale environments. We examine one of the important causes of this poor performance: the design of the Flash Translation Layer (FTL), which performs the virtual-to-physical address translations and hides the erase-before-write characteristics of flash. We propose a complete paradigm shift in the design of the core FTL engine from the existing techniques with our Demand-based Flash Translation Layer (DFTL), which selectively caches page-level address mappings. We develop a flash simulation framework called FlashSim. Our experimental evaluation with realistic enterprise-scale workloads endorses the utility of DFTL in enterprise-scale storage systems by demonstrating: (i) improved performance, (ii) reduced garbage collection overhead and (iii) better overload behavior compared to state-of-the-art FTL schemes. For example, a predominantly random-write dominant I/O trace from an OLTP application running at a large financial institution shows a 78% improvement in average response time (due to a 3-fold reduction in operations of the garbage collector), compared to a state-of-the-art FTL scheme. Even for the well-known read-dominant TPC-H benchmark, for which DFTL introduces additional overheads, we improve system response time by 56%. expand
|
|
|
SESSION: Potpourri |
|
|
|
|
Commutativity analysis for software parallelization: letting program transformations see the big picture |
| |
Farhana Aleen,
Nathan Clark
|
|
Pages: 241-252 |
|
doi>10.1145/1508244.1508273 |
|
Full text: PDF
|
|
Extracting performance from many-core architectures requires software engineers to create multi-threaded applications, which significantly complicates the already daunting task of software development. One solution to this problem is automatic compile-time ...
Extracting performance from many-core architectures requires software engineers to create multi-threaded applications, which significantly complicates the already daunting task of software development. One solution to this problem is automatic compile-time parallelization, which can ease the burden on software developers in many situations. Clearly, automatic parallelization in its present form is not suitable for many application domains and new compiler analyses are needed address its shortcomings. In this paper, we present one such analysis: a new approach for detecting commutative functions. Commutative functions are sections of code that can be executed in any order without affecting the outcome of the application, e.g., inserting elements into a set. Previous research on this topic had one significant limitation, in that the results of a commutative functions must produce identical memory layouts. This prevented previous techniques from detecting functions like malloc, which may return different pointers depending on the order in which it is called, but these differing results do not affect the overall output of the application. Our new commutativity analysis correctly identify these situations to better facilitate automatic parallelization. We demonstrate that this analysis can automatically extract significant amounts of parallelism from many applications, and where it is ineffective it can provide software developers a useful list of functions that may be commutative provided semantic program changes that are not automatable. expand
|
|
|
Accelerating critical section execution with asymmetric multi-core architectures |
| |
M. Aater Suleman,
Onur Mutlu,
Moinuddin K. Qureshi,
Yale N. Patt
|
|
Pages: 253-264 |
|
doi>10.1145/1508244.1508274 |
|
Full text: PDF
|
|
To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only ...
To improve the performance of a single application on Chip Multiprocessors (CMPs), the application must be split into threads which execute concurrently on multiple cores. In multi-threaded applications, critical sections are used to ensure that only one thread accesses shared data at any given time. Critical sections can serialize the execution of threads, which significantly reduces performance and scalability. This paper proposes Accelerated Critical Sections (ACS), a technique that leverages the high-performance core(s) of an Asymmetric Chip Multiprocessor (ACMP) to accelerate the execution of critical sections. In ACS, selected critical sections are executed by a high-performance core, which can execute the critical section faster than the other, smaller cores. As a result, ACS reduces serialization: it lowers the likelihood of threads waiting for a critical section to finish. Our evaluation on a set of 12 critical-section-intensive workloads shows that ACS reduces the average execution time by 34% compared to an equal-area 32T-core symmetric CMP and by 23% compared to an equal-area ACMP. Moreover, for 7 out of the 12 workloads, ACS improves scalability by increasing the number of threads at which performance saturates. expand
|
|
|
Producing wrong data without doing anything obviously wrong! |
| |
Todd Mytkowicz,
Amer Diwan,
Matthias Hauswirth,
Peter F. Sweeney
|
|
Pages: 265-276 |
|
doi>10.1145/1508244.1508275 |
|
Full text: PDF
|
|
This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may ...
This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may in fact introduce a significant bias in an evaluation. This phenomenon is called measurement bias in the natural and social sciences. Our results demonstrate that measurement bias is significant and commonplace in computer system evaluation. By significant we mean that measurement bias can lead to a performance analysis that either over-states an effect or even yields an incorrect conclusion. By commonplace we mean that measurement bias occurs in all architectures that we tried (Pentium 4, Core 2, and m5 O3CPU), both compilers that we tried (gcc and Intel's C compiler), and most of the SPEC CPU2006 C programs. Thus, we cannot ignore measurement bias. Nevertheless, in a literature survey of 133 recent papers from ASPLOS, PACT, PLDI, and CGO, we determined that none of the papers with experimental results adequately consider measurement bias. Inspired by similar problems and their solutions in other sciences, we describe and demonstrate two methods, one for detecting (causal analysis) and one for avoiding (setup randomization) measurement bias. expand
|
|
|
SESSION: Managed systems |
|
|
|
|
Leak pruning |
| |
Michael D. Bond,
Kathryn S. McKinley
|
|
Pages: 277-288 |
|
doi>10.1145/1508244.1508277 |
|
Full text: PDF
|
|
Managed languages improve programmer productivity with type safety and garbage collection, which eliminate memory errors such as dangling pointers, double frees, and buffer overflows. However, because garbage collection uses reachability to over-approximate ...
Managed languages improve programmer productivity with type safety and garbage collection, which eliminate memory errors such as dangling pointers, double frees, and buffer overflows. However, because garbage collection uses reachability to over-approximate live objects, programs may still leak memory if programmers forget to eliminate the last reference to an object that will not be used again. Leaks slow programs by increasing collector workload and frequency. Growing leaks eventually crash programs. This paper introduces leak pruning, which keeps programs running by predicting and reclaiming leaked objects at run time. It predicts dead objects and reclaims them based on observing data structure usage patterns. Leak pruning preserves semantics because it waits for heap exhaustion before reclaiming objects and poisons references to objects it reclaims. If the program later tries to access a poisoned reference, the virtual machine (VM) throws an error. We show leak pruning has low overhead in a Java VM and evaluate it on 10 leaking programs. Leak pruning does not help two programs, executes five substantial programs 1.6-81X longer, and executes three programs, including a leak in Eclipse, for at least 24 hours. In the worst case, leak pruning defers fatal errors. In the best case, it keeps leaky programs running with preserved semantics and consistent throughput. expand
|
|
|
Dynamic prediction of collection yield for managed runtimes |
| |
Michal Wegiel,
Chandra Krintz
|
|
Pages: 289-300 |
|
doi>10.1145/1508244.1508278 |
|
Full text: PDF
|
|
The growth in complexity of modern systems makes it increasingly difficult to extract high-performance. The software stacks for such systems typically consist of multiple layers and include managed runtime environments (MREs). In this paper, we investigate ...
The growth in complexity of modern systems makes it increasingly difficult to extract high-performance. The software stacks for such systems typically consist of multiple layers and include managed runtime environments (MREs). In this paper, we investigate techniques to improve cooperation between these layers and the hardware to increase the efficacy of automatic memory management in MREs. General-purpose MREs commonly implement parallel and/or concurrent garbage collection and employ compaction to eliminate heap fragmentation. Moreover, most systems trigger collection based on the amount of heap a program uses. Our analysis shows that in many cases this strategy leads to ineffective collections that are unable to reclaim sufficient space to justify the incurred cost. To avoid such collections, we exploit the observation that dead objects tend to cluster together and form large, never-referenced, regions in the address space that correlate well with virtual pages that have not recently been referenced by the application. We leverage this correlation to design a new, simple and light-weight, yield predictor that estimates the amount of reclaimable space in the heap using hardware page reference bits. Our predictor allows MREs to avoid low-yield collections and thereby improve resource management. We integrate this predictor into three state-of-the-art parallel compactors, implemented in the HotSpot JVM, that represent distinct canonical heap layouts. Our empirical evaluation, based on standard Java benchmarks and open-source applications, indicates that inexpensive and accurate yield prediction can improve performance significantly. expand
|
|
|
TwinDrivers: semi-automatic derivation of fast and safe hypervisor network drivers from guest OS drivers |
| |
Aravind Menon,
Simon Schubert,
Willy Zwaenepoel
|
|
Pages: 301-312 |
|
doi>10.1145/1508244.1508279 |
|
Full text: PDF
|
|
In a virtualized environment, device drivers are often run inside a virtual machine (VM) rather than in the hypervisor, for reasons of safety and reduction in software engineering effort. Unfortunately, this approach results in poor performance for I/O-intensive ...
In a virtualized environment, device drivers are often run inside a virtual machine (VM) rather than in the hypervisor, for reasons of safety and reduction in software engineering effort. Unfortunately, this approach results in poor performance for I/O-intensive devices such as network cards. The alternative approach of running device drivers directly in the hypervisor yields better performance, but results in the loss of safety guarantees for the hypervisor and incurs additional software engineering costs. In this paper we present TwinDrivers, a framework which allows us to semi-automatically create safe and efficient hypervisor drivers from guest OS drivers. The hypervisor driver runs directly in the hypervisor, but its data resides completely in the driver VM address space. A Software Virtual Memory mechanism allows the driver to access its VM data efficiently from the hypervisor running in any guest context, and also protects the hypervisor from invalid memory accesses from the driver. An upcall mechanism allows the hypervisor to largely reuse the driver support infrastructure present in the VM. The TwinDriver system thus combines most of the performance benefits of hypervisor-based driver approaches with the safety and software engineering benefits of VM-based driver approaches. Using the TwinDrivers hypervisor driver, we are able to improve the guest domain networking throughput in Xen by a factor of 2.4 for transmit workloads, and 2.1 for receive workloads, both in CPU-scaled units, and achieve close to 64-67 of native Linux throughput. expand
|
|
|
SESSION: Architectures |
|
|
|
|
Phantom-BTB: a virtualized branch target buffer design |
| |
Ioana Burcea,
Andreas Moshovos
|
|
Pages: 313-324 |
|
doi>10.1145/1508244.1508281 |
|
Full text: PDF
|
|
Modern processors use branch target buffers (BTBs) to predict the target address of branches such that they can fetch ahead in the instruction stream increasing concurrency and performance. Ideally, BTBs would be sufficiently large to capture the entire ...
Modern processors use branch target buffers (BTBs) to predict the target address of branches such that they can fetch ahead in the instruction stream increasing concurrency and performance. Ideally, BTBs would be sufficiently large to capture the entire working set of the application and sufficiently small for fast access and practical on-chip dedicated storage. Depending on the application, these requirements are at odds. This work introduces a BTB design that accommodates large instruction footprints without dedicating expensive onchip resources. In the proposed Phantom-BTB (PBTB) design, a conventional BTB is augmented with a virtual table that collects branch target information as the application runs. The virtual table does not have fixed dedicated storage. Instead, it is transparently allocated, on demand, in the on-chip caches, at cache line granularity. The entries in the virtual table are proactively prefetched and installed in the dedicated conventional BTB, thus, increasing its perceived capacity. Experimental results with commercial workloads under full-system simulation demonstrate that PBTB improves IPC performance over a 1K-entry BTB by 6.9% on average and up to 12.7%, with a storage overhead of only 8%. Overall, the virtualized design performs within 1% of a conventional 4K-entry, single-cycle access BTB, while the dedicated storage is 3.6 times smaller. expand
|
|
|
StreamRay: a stream filtering architecture for coherent ray tracing |
| |
Karthik Ramani,
Christiaan P. Gribble,
Al Davis
|
|
Pages: 325-336 |
|
doi>10.1145/1508244.1508282 |
|
Full text: PDF
|
|
The wide availability of commodity graphics processors has made real-time graphics an intrinsic component of the human/computer interface. These graphics cores accelerate the z-buffer algorithm and provide a highly interactive experience at a relatively ...
The wide availability of commodity graphics processors has made real-time graphics an intrinsic component of the human/computer interface. These graphics cores accelerate the z-buffer algorithm and provide a highly interactive experience at a relatively low cost. However, many applications in entertainment, science, and industry require high quality lighting effects such as accurate shadows, reflection, and refraction. These effects can be difficult to achieve with z-buffer algorithms but are straightforward to implement using ray tracing. Although ray tracing is computationally more complex, the algorithm exhibits excellent scaling and parallelism properties. Nevertheless, ray tracing memory access patterns are difficult to predict and the parallelism speedup promise is therefore hard to achieve. This paper highlights a novel approach to ray tracing based on stream filtering and presents StreamRay, a multicore wide SIMD microarchitecture that delivers interactive frame rates of 15-32 frames/second for scenes of high geometric complexity and exhibits high utilization for SIMD widths ranging from eight to 16 elements. StreamRay consists of two main components: the ray engine, which is responsible for stream assembly and employs address generation units that generate addresses to form large SIMD vectors, and the filter engine, which implements the ray tracing operations with programmable accelerators. Results demonstrate that separating address and data processing reduces data movement and resource contention. Performance improves by 56% while simultaneously providing 11.63% power savings per accelerator core compared to a design which does not use separate resources for address and data computations. expand
|
|
|
Architectural support for SWAR text processing with parallel bit streams: the inductive doubling principle |
| |
Robert D. Cameron,
Dan Lin
|
|
Pages: 337-348 |
|
doi>10.1145/1508244.1508283 |
|
Full text: PDF
|
|
Parallel bit stream algorithms exploit the SWAR (SIMD within a register) capabilities of commodity processors in high-performance text processing applications such as UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression matching. ...
Parallel bit stream algorithms exploit the SWAR (SIMD within a register) capabilities of commodity processors in high-performance text processing applications such as UTF-8 to UTF-16 transcoding, XML parsing, string search and regular expression matching. Direct architectural support for these algorithms in future SWAR instruction sets could further increase performance as well as simplifying the programming task. A set of simple SWAR instruction set extensions are proposed for this purpose based on the principle of systematic support for inductive doubling as an algorithmic technique. These extensions are shown to significantly reduce instruction count in core parallel bit stream algorithms, often providing a 3X or better improvement. The extensions are also shown to be useful for SWAR programming in other application areas, including providing a systematic treatment for horizontal operations. An implementation model for these extensions involves relatively simple circuitry added to the operand fetch components in a pipelined processor. expand
|