|
|
SESSION: Special purpose to warehouse computers |
| |
N. Jouppi
|
|
|
|
|
Anton, a special-purpose machine for molecular dynamics simulation |
| |
David E. Shaw,
Martin M. Deneroff,
Ron O. Dror,
Jeffrey S. Kuskin,
Richard H. Larson,
John K. Salmon,
Cliff Young,
Brannon Batson,
Kevin J. Bowers,
Jack C. Chao,
Michael P. Eastwood,
Joseph Gagliardo,
J. P. Grossman,
C. Richard Ho,
Douglas J. Ierardi,
István Kolossváry,
John L. Klepeis,
Timothy Layman,
Christine McLeavey,
Mark A. Moraes,
Rolf Mueller,
Edward C. Priest,
Yibing Shan,
Jochen Spengler,
Michael Theobald,
Brian Towles,
Stanley C. Wang
|
|
Pages: 1-12 |
|
doi>10.1145/1250662.1250664 |
|
Full text: PDF
|
|
The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macro-molecules could in principle provide answers to some of the most important currently outstanding questions in the fields of biology, ...
The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macro-molecules could in principle provide answers to some of the most important currently outstanding questions in the fields of biology, chemistry and medicine. A wide range of biologically interesting phenomena, however, occur over time scales on the order of a millisecond--about three orders of magnitude beyond the duration of the longest current MD simulations. In this paper, we describe a massively parallel machine called Anton, which should be capable of executing millisecond-scale classical MD simulations of such biomolecular systems. The machine, which is scheduled for completion by the end of 2008, is based on 512 identical MD-specific ASICs that interact in a tightly coupled manner using a specialized high-speed communication network. Anton has been designed to use both novel parallel algorithms and special-purpose logic to dramatically accelerate those calculations that dominate the time required for a typical MD simulation. The remainder of the simulation algorithm is executed by a programmable portion of each chip that achieves a substantial degree of parallelism while preserving the flexibility necessary to accommodate anticipated advances in physical models and simulation methods. expand
|
|
|
Power provisioning for a warehouse-sized computer |
| |
Xiaobo Fan,
Wolf-Dietrich Weber,
Luiz Andre Barroso
|
|
Pages: 13-23 |
|
doi>10.1145/1250662.1250665 |
|
Full text: PDF
|
|
Large-scale Internet services require a computing infrastructure that can beappropriately described as a warehouse-sized computing system. The cost ofbuilding datacenter facilities capable of delivering a given power capacity tosuch a computer can rival ...
Large-scale Internet services require a computing infrastructure that can beappropriately described as a warehouse-sized computing system. The cost ofbuilding datacenter facilities capable of delivering a given power capacity tosuch a computer can rival the recurring energy consumption costs themselves.Therefore, there are strong economic incentives to operate facilities as closeas possible to maximum capacity, so that the non-recurring facility costs canbe best amortized. That is difficult to achieve in practice because ofuncertainties in equipment power ratings and because power consumption tends tovary significantly with the actual computing activity. Effective powerprovisioning strategies are needed to determine how much computing equipmentcan be safely and efficiently hosted within a given power budget. In this paper we present the aggregate power usage characteristics of largecollections of servers (up to 15 thousand) for different classes ofapplications over a period of approximately six months. Those observationsallow us to evaluate opportunities for maximizing the use of the deployed powercapacity of datacenters, and assess the risks of over-subscribing it. We findthat even in well-tuned applications there is a noticeable gap (7 - 16%)between achieved and theoretical aggregate peak power usage at the clusterlevel (thousands of servers). The gap grows to almost 40% in wholedatacenters. This headroom can be used to deploy additional compute equipmentwithin the same power budget with minimal risk of exceeding it. We use ourmodeling framework to estimate the potential of power management schemes toreduce peak power and energy usage. We find that the opportunities for powerand energy savings are significant, but greater at the cluster-level (thousandsof servers) than at the rack-level (tens). Finally we argue that systems needto be power efficient across the activity range, and not only at peakperformance levels. expand
|
|
|
SESSION: Transactions and synchronization |
| |
K. Asanovic
|
|
|
|
|
Making the fast case common and the uncommon case simple in unbounded transactional memory |
| |
Colin Blundell,
Joe Devietti,
E. Christopher Lewis,
Milo M. K. Martin
|
|
Pages: 24-34 |
|
doi>10.1145/1250662.1250667 |
|
Full text: PDF
|
|
Hardware transactional memory has great potential to simplify the creation ofcorrect and efficient multithreaded programs, allowing programmers to exploitmore effectively the soon-to-be-ubiquitous multi-core designs. Several recentproposals have extended ...
Hardware transactional memory has great potential to simplify the creation ofcorrect and efficient multithreaded programs, allowing programmers to exploitmore effectively the soon-to-be-ubiquitous multi-core designs. Several recentproposals have extended the original bounded transactional memory to unboundedtransactional memory, a crucial step toward transactions becoming ageneral-purpose primitive. Unfortunately, supporting the concurrent executionof an unbounded number of unbounded transactions is challenging, and as aresult, many proposed implementations are complex. This paper explores a different approach. First, we introduce thepermissions-only cache to extend the bound at which transactions overflow toallow the fast, bounded case to be used as frequently as possible. Second, wepropose OneTM to simplify the implementation of unbounded transactional memoryby bounding the concurrency of transactions that overflow the cache. Thesemechanisms work synergistically to provide a simple and fast unboundedtransactional memory system. The permissions-only cache efficiently maintains the coherencepermissions-but not data-for blocks read or written transactionally thathave been evicted from the processor's caches. By holding coherencepermissions for these blocks, the regular cache coherence protocol can be usedto detect transactional conflicts using only a few bits of on-chip storage peroverflowed cache block.OneTM allows only one overflowed transaction at a time, relying on thepermissions-only cache to ensure that overflow is infrequent. We present twoimplementations. In OneTM-Serialized, an overflowed transaction simply stallsall other threads in the application. In OneTM-Concurrent, non-overflowedtransactions and non-transactional code can execute concurrently with theoverflowed transaction, providing more concurrency while retaining OneTM's coresimplifying assumption. expand
|
|
|
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures |
| |
Weirong Zhu,
Vugranam C Sreedhar,
Ziang Hu,
Guang R. Gao
|
|
Pages: 35-45 |
|
doi>10.1145/1250662.1250668 |
|
Full text: PDF
|
|
Efficient fine-grain synchronization is extremely important to effectively harness the computational power of many-core architectures. However, designing and implementing finegrain synchronization in such architectures presents several challenges, including ...
Efficient fine-grain synchronization is extremely important to effectively harness the computational power of many-core architectures. However, designing and implementing finegrain synchronization in such architectures presents several challenges, including issues of synchronization induced overhead, storage cost, scalability, and the level of granularity to which synchronization is applicable. This paper proposes the Synchronization State Buffer (SSB), a scalable architectural design for fine-grain synchronization that efficiently performs synchronizations between concurrent threads. The design of SSB is motivated by the following observation: at any instance during the parallel execution only a small fraction of memory locations are actively participating in synchronization. Based on this observation we present a fine-grain synchronization design that records and manages the states of frequently synchronized data using modest hardware support. We have implemented the SSB design in the context of the 160-core IBM Cyclops-64 architecture. Using detailed simulation, we present our experience for a set of benchmarks with different workload characteristics. expand
|
|
|
SESSION: Virtual caches and hierarchies |
| |
M. Martonosi
|
|
|
|
|
Virtual hierarchies to support server consolidation |
| |
Michael R. Marty,
Mark D. Hill
|
|
Pages: 46-56 |
|
doi>10.1145/1250662.1250670 |
|
Full text: PDF
|
|
Server consolidation is becoming an increasingly popular
technique to manage and utilize systems. This paper develops CMP
memory systems for server consolidation where most sharing occurs
within Virtual Machines (VMs). Our memory systems maximize ...
Server consolidation is becoming an increasingly popular
technique to manage and utilize systems. This paper develops CMP
memory systems for server consolidation where most sharing occurs
within Virtual Machines (VMs). Our memory systems maximize shared
memory accesses serviced within a VM, minimize interference among
separate VMs, facilitate dynamic reassignment of VMs to processors
and memory, and support content-based page sharing among VMs. We
begin with a tiled architecture where each of 64 tiles contains a
processor, private L1 caches, and an L2 bank. First, we reveal why
single-level directory designs fail to meet workload consolidation
goals. Second, we develop the paper's central idea of imposing a
two-level virtual (or logical) coherence hierarchy on a physically
flat CMP that harmonizes with VM assignment. Third, we show that
the best of our two virtual hierarchy (VH) variants performs 12-58%
better than the best alternative flat directory protocol when
consolidating Apache, OLTP, and Zeus commel workloads on our
simulated 64-core CMP.
expand
|
|
|
Virtual private caches |
| |
Kyle J. Nesbit,
James Laudon,
James E. Smith
|
|
Pages: 57-68 |
|
doi>10.1145/1250662.1250671 |
|
Full text: PDF
|
|
Virtual Private Machines (VPM) provide a framework for Quality of Service (QoS) in CMP-based computer systems. VPMs incorporate microarchitecture mechanisms that allow shares of hardware resources to be allocated to executing threads, thus providing ...
Virtual Private Machines (VPM) provide a framework for Quality of Service (QoS) in CMP-based computer systems. VPMs incorporate microarchitecture mechanisms that allow shares of hardware resources to be allocated to executing threads, thus providing applications with an upper bound on execution time regardless of other thread activity. Virtual Private Caches (VPCs) are an important element of VPMs. VPC hardware consists of two major components: the VPC Arbiter, which manages shared cache bandwidth, and the VPC Capacity Manager, which manages the cache storage. Both the VPC Arbiter and VPC Capacity Manager provide minimum service guarantees that, when combined, achieve QoS for the cache subsystem. Simulation-based evaluation shows that conventional cache bandwidth management policies allow concurrently executing threads to affect each other significantly in an uncontrollable manner. The evaluation targets cache bandwidth because the effects of cache capacity sharing have been studied elsewhere. In contrast with the conventional policies, the VPC Arbiter meets its QoS performance objectives on all workloads studied and over a range of allocated bandwidth levels. The VPC Arbiter’s fairness policy, which distributes leftover bandwidth, mitigates the effects of cache preemption latencies, thus ensuring threads a high-degree of performance isolation. Furthermore, the VPC Arbiter eliminates negative bandwidth interference which can improve aggregate throughput and resource utilization. expand
|
|
|
SESSION: Transactions |
| |
M. Tremblay
|
|
|
|
|
An effective hybrid transactional memory system with strong isolation guarantees |
| |
Chi Cao Minh,
Martin Trautmann,
JaeWoong Chung,
Austen McDonald,
Nathan Bronson,
Jared Casper,
Christos Kozyrakis,
Kunle Olukotun
|
|
Pages: 69-80 |
|
doi>10.1145/1250662.1250673 |
|
Full text: PDF
|
|
We propose signature-accelerated transactional memory (SigTM), ahybrid TM system that reduces the overhead of software transactions. SigTM uses hardware signatures to track the read-set and write-set forpending transactions and perform conflict detection ...
We propose signature-accelerated transactional memory (SigTM), ahybrid TM system that reduces the overhead of software transactions. SigTM uses hardware signatures to track the read-set and write-set forpending transactions and perform conflict detection between concurrent threads. All other transactional functionality, including dataversioning, is implemented in software. Unlike previously proposed hybrid TM systems, SigTM requires no modifications to the hardware caches, which reduces hardware cost and simplifies support for nested transactions and multithreaded processor cores. SigTM is also the first hybrid TM system to provide strong isolation guarantees between transactional blocks and non-transactional accesses without additional read and write barriers in non-transactional code. Using a set of parallel programs that make frequent use of coarse-grain transactions, we show that SigTM accelerates software transactions by 30% to 280%. For certain workloads, SigTM can match the performance of a full-featured hardware TM system, while for workloads with large read-sets it can be up to two times slower. Overall, we show that SigTM combines the performance characteristics and strong isolation guarantees of hardware TM implementations with the low cost and flexibility of software TM systems. expand
|
|
|
Performance pathologies in hardware transactional memory |
| |
Jayaram Bobba,
Kevin E. Moore,
Haris Volos,
Luke Yen,
Mark D. Hill,
Michael M. Swift,
David A. Wood
|
|
Pages: 81-91 |
|
doi>10.1145/1250662.1250674 |
|
Full text: PDF
|
|
Hardware Transactional Memory (HTM) systems reflect choices from three key design dimensions: conflict detection, version management, and conflict resolution. Previously proposed HTMs represent three points in this design space: lazy conflict detection, ...
Hardware Transactional Memory (HTM) systems reflect choices from three key design dimensions: conflict detection, version management, and conflict resolution. Previously proposed HTMs represent three points in this design space: lazy conflict detection, lazy version management, committer wins (LL); eager conflict detection, lazy version management, requester wins (EL); and eager conflict detection, eager version management, and requester stalls with conservative deadlock avoidance (EE). To isolate the effects of these high-level design decisions, we develop a common framework that abstracts away differences in cache write policies, interconnects, and ISA to compare these three design points. Not surprisingly, the relative performance of these systems depends on the workload. Under light transactional loads they perform similarly, but under heavy loads they differ by up to 80%. None of the systems performs best on all of our benchmarks. We identify seven performance pathologies-interactions between workload and system that degrade performance-as the root cause of many performance differences: FriendlyFire, StarvingWriter, SerializedCommit, FutileStall, StarvingElder, RestartConvoy, and DuelingUpgrades. We discuss when and on which systems these pathologies can occur and show that they actually manifest within TM workloads. The insight provided by these pathologies motivated four enhanced systems that often significantly reduce transactional memory overhead. Importantly, by avoiding transaction pathologies, each enhanced system performs well across our suite of benchmarks. expand
|
|
|
MetaTM/TxLinux: transactional memory for an operating system |
| |
Hany E. Ramadan,
Christopher J. Rossbach,
Donald E. Porter,
Owen S. Hofmann,
Aditya Bhandari,
Emmett Witchel
|
|
Pages: 92-103 |
|
doi>10.1145/1250662.1250675 |
|
Full text: PDF
|
|
This paper quantifies the effect of architectural design decisions onthe performance of TxLinux. TxLinux is a Linux kernel modifiedto use transactions in place of locking primitives in several key subsystems.We run TxLinux on MetaTM, which is a new hardwaretransaction ...
This paper quantifies the effect of architectural design decisions onthe performance of TxLinux. TxLinux is a Linux kernel modifiedto use transactions in place of locking primitives in several key subsystems.We run TxLinux on MetaTM, which is a new hardwaretransaction memory (HTM) model.MetaTM contains features that enable efficient and correct interrupthandling for an x86-like architecture. Live stack overwrites can corrupt non-transactional stack memory and requires a smallchange to the transaction register checkpoint hardware to ensurecorrect operation of the operating system. We also propose stack based early release to reduce spurious conflicts on stack memorybetween kernel code and interrupt handlers.We use MetaTM to examine the performance sensitivity of individualarchitectural features. For TxLinux we find that Polka and SizeMatters are effective contention management policies, someform of backoff on transaction contention is vital for performance,and stalling on a transaction conflict reduces transaction restartrates, but does not improve performance. Transaction write setsare small, and performance is insensitive to transaction abort costsbut sensitive to commit costs. expand
|
|
|
An integrated hardware-software approach to flexible transactional memory |
| |
Arrvindh Shriraman,
Michael F. Spear,
Hemayet Hossain,
Virendra J. Marathe,
Sandhya Dwarkadas,
Michael L. Scott
|
|
Pages: 104-115 |
|
doi>10.1145/1250662.1250676 |
|
Full text: PDF
|
|
There has been considerable recent interest in both hardware andsoftware transactional memory (TM). We present an intermediateapproach, in which hardware serves to accelerate a TM implementation controlled fundamentally by software. Specifically, we ...
There has been considerable recent interest in both hardware andsoftware transactional memory (TM). We present an intermediateapproach, in which hardware serves to accelerate a TM implementation controlled fundamentally by software. Specifically, we describe an alert on update mechanism (AOU) that allows a thread to receive fast, asynchronous notification when previously-identified lines are written by other threads, and a programmable data isolation mechanism (PDI) that allows a thread to hide its speculative writes from other threads, ignoring conflicts, until software decides to make them visible. These mechanisms reduce bookkeeping, validation, and copying overheads without constraining software policy on a host of design decisions. We have used AOU and PDI to implement a hardwareacceleratedsoftware transactional memory system we call RTM. We have also used AOU alone to create a simpler "RTM-Lite". Across a range of microbenchmarks, RTM outperforms RSTM, a publicly available software transactional memory system, by as much as 8.7x (geometric mean of 3.5x) in single-thread mode. At 16 threads, it outperforms RSTM by as much as 5x, with an average speedup of 2x. Performance degrades gracefully when transactions overflow hardware structures. RTM-Lite is slightly faster than RTM for transactions that modify only small objects; full RTM is significantly faster when objects are large. In a strongargument for policy flexibility, we find that the choice between eager (first-access) and lazy (commit-time) conflict detection can lead to significant performance differences in both directions, depending on application characteristics. expand
|
|
|
SESSION: Networks and routers |
| |
M. Taylor
|
|
|
|
|
Rotary router: an efficient architecture for CMP interconnection networks |
| |
Pablo Abad,
Valentin Puente,
José Angel Gregorio,
Pablo Prieto
|
|
Pages: 116-125 |
|
doi>10.1145/1250662.1250678 |
|
Full text: PDF
|
|
The trend towards increasing the number of processor cores and cache capacity in future Chip-Multiprocessors (CMPs), will require scalable packet-switched interconnection networks adapted to the restrictions imposed by the CMP environment. This paper ...
The trend towards increasing the number of processor cores and cache capacity in future Chip-Multiprocessors (CMPs), will require scalable packet-switched interconnection networks adapted to the restrictions imposed by the CMP environment. This paper presents an innovative router design, which successfully addresses CMP cost/performance constraints. The router structure is based on two independent rings, which force packets to circulate either clockwise or anti-clockwise, traveling through every port of the router. It uses a completely decentralized scheduling scheme, which allows the design to: (1) take advantage of wide links, (2) reduce Head of Line blocking, (3) use adaptive routing, (4) be topology agnostic, (5) scale with network degree, and (6) have reasonable power consumption and implementation cost. A thorough comparative performance analysis against competitive conventional routers shows an advantage for our proposal of up to 50 % in terms of raw performance and nearly 60 % in terms of energy-delay product. expand
|
|
|
Flattened butterfly: a cost-efficient topology for high-radix networks |
| |
John Kim,
William J. Dally,
Dennis Abts
|
|
Pages: 126-137 |
|
doi>10.1145/1250662.1250679 |
|
Full text: PDF
|
|
Increasing integrated-circuit pin bandwidth has motivateda corresponding increase in the degree or radix of interconnection networksand their routers. This paper introduces the flattened butterfly, a cost-efficient topology for high-radix networks. ...
Increasing integrated-circuit pin bandwidth has motivateda corresponding increase in the degree or radix of interconnection networksand their routers. This paper introduces the flattened butterfly, a cost-efficient topology for high-radix networks. On benign (load-balanced) traffic, the flattened butterfly approaches the cost/performance of a butterfly network and has roughly half the cost of a comparable performance Clos network.The advantage over the Clos is achieved by eliminating redundant hopswhen they are not needed for load balance. On adversarial traffic, the flattened butterfly matches the cost/performance of a folded-Clos network and provides an order of magnitude better performance than a conventional butterfly.In this case, global adaptive routing is used to switchthe flattened butterfly from minimal to non-minimal routing - usingredundant hops only when they are needed. Minimal and non-minimal, oblivious and adaptive routing algorithms are evaluated on the flattened butterfly.We show that load-balancing adversarial traffic requires non-minimalglobally-adaptive routing and show that sequential allocators are required to avoid transient load imbalance when using adaptive routing algorithms.We also compare the cost of the flattened butterfly to folded-Clos, hypercube,and butterfly networks with identical capacityand show that the flattened butterfly is more cost-efficient thanfolded-Clos and hypercube topologies. expand
|
|
|
A novel dimensionally-decomposed router for on-chip communication in 3D architectures |
| |
Jongman Kim,
Chrysostomos Nicopoulos,
Dongkook Park,
Reetuparna Das,
Yuan Xie,
Vijaykrishnan Narayanan,
Mazin S. Yousif,
Chita R. Das
|
|
Pages: 138-149 |
|
doi>10.1145/1250662.1250680 |
|
Full text: PDF
|
|
Much like multi-storey buildings in densely packed metropolises, three-dimensional (3D) chip structures are envisioned as a viable solution to skyrocketing transistor densities and burgeoning die sizes in multi-core architectures. Partitioning a larger ...
Much like multi-storey buildings in densely packed metropolises, three-dimensional (3D) chip structures are envisioned as a viable solution to skyrocketing transistor densities and burgeoning die sizes in multi-core architectures. Partitioning a larger die into smaller segments and then stacking them in a 3D fashion can significantly reduce latency and energy consumption. Such benefits emanate from the notion that inter-wafer distances are negligible compared to intra-wafer distances. This attribute substantially reduces global wiring length in 3D chips. The work in this paper integrates the increasingly popular idea of packet-based Networks-on-Chip (NoC) into a 3D setting. While NoCs have been studied extensively in the 2D realm, the microarchitectural ramifications of moving into the third dimension have yet to be fully explored. This paper presents a detailed exploration of inter-strata communication architectures in 3D NoCs. Three design options are investigated; a simple bus-based inter-wafer connection, a hop-by-hop standard 3D design, and a full 3D crossbar implementation. In this context, we propose a novel partially-connected 3D crossbar structure, called the 3D Dimensionally-Decomposed (DimDe) Router, which provides a good tradeoff between circuit complexity and performance benefits. Simulation results using (a) a stand-alone cycle-accurate 3D NoC simulator running synthetic workloads, and (b) a hybrid 3D NoC/cache simulation environment running real commercial and scientific benchmarks, indicate that the proposed DimDe design provides latency and throughput improvements of over 20% on average over the other 3D architectures, while remaining within 5% of the full 3D crossbar performance. Furthermore, based on synthesized hardware implementations in 90 nm technology, the DimDe architecture outperforms all other designs -- including the full 3D crossbar -- by an average of 26% in terms of the Energy-Delay Product (EDP). expand
|
|
|
Express virtual channels: towards the ideal interconnection fabric |
| |
Amit Kumar,
Li-Shiuan Peh,
Partha Kundu,
Niraj K. Jha
|
|
Pages: 150-161 |
|
doi>10.1145/1250662.1250681 |
|
Full text: PDF
|
|
Due to wire delay scalability and bandwidth limitations inherent in shared buses and dedicated links, packet-switched on-chip interconnection networks are fast emerging as the pervasive communication fabric to connect different processing elements in ...
Due to wire delay scalability and bandwidth limitations inherent in shared buses and dedicated links, packet-switched on-chip interconnection networks are fast emerging as the pervasive communication fabric to connect different processing elements in many-core chips. However, current state-of-the-art packet-switched networks rely on complex routers which increases the communication overhead and energy consumption as compared to the ideal interconnection fabric. In this paper, we try to close the gap between the state-of-the-art packet-switched network and the ideal interconnect by proposing express virtual channels (EVCs), a novel flow control mechanism which allows packets to virtually bypass intermediate routers along their path in a completely non-speculative fashion, thereby lowering the energy/delay towards that of a dedicated wire while simultaneously approaching ideal throughput with a practical design suitable for on-chip networks. Our evaluation results using a detailed cycle-accurate simulator on a range of synthetic traffic and SPLASH benchmark traces show upto 84% reduction in packet latency and upto 23% improvement in throughput while reducing the average router energy consumption by upto 38% over an existing state-of-the-art packet-switched design. When compared to the ideal interconnect, EVCs add just two cycles to the no-load latency, and are within 14% of the ideal throughput. Moreover, we show that the proposed design incurs a minimal hardware overhead while exhibiting excellent scalability with increasing network sizes. expand
|
|
|
SESSION: Atomic regions and fine-grained parallelism |
| |
M. Martin
|
|
|
|
|
Carbon: architectural support for fine-grained parallelism on chip multiprocessors |
| |
Sanjeev Kumar,
Christopher J. Hughes,
Anthony Nguyen
|
|
Pages: 162-173 |
|
doi>10.1145/1250662.1250683 |
|
Full text: PDF
|
|
Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily. However, in order to harness the additional compute resources of a CMP, applications must expose their thread-level parallelism to the hardware. ...
Chip multiprocessors (CMPs) are now commonplace, and the number of cores on a CMP is likely to grow steadily. However, in order to harness the additional compute resources of a CMP, applications must expose their thread-level parallelism to the hardware. One common approach to doing this is to decompose a program into parallel "tasks" and allow an underlying software layer to schedule these tasks to different threads. Software task scheduling can provide good parallel performance as long as tasks are large compared to the software overheads. We examine a set of applications from an important emerging domain: Recognition, Mining, and Synthesis (RMS). Many RMS applications are compute-intensive and have abundant thread-level parallelism, and are therefore good targets for running on a CMP. However, a significant number have small tasks for which software task schedulers achieve only limited parallel speedups. We propose Carbon, a hardware technique to accelerate dynamic task scheduling on scalable CMPs. Carbon has relatively simple hardware, most of which can be placed far from the cores. We compare Carbon to some highly tuned software task schedulers for a set of RMS benchmarks with small tasks. Carbon delivers significant performance improvements over the best software scheduler: on average for 64 cores, 68% faster on a set of loop-parallel benchmarks, and 109% faster on aset of task-parallel benchmarks. expand
|
|
|
Hardware atomicity for reliable software speculation |
| |
Naveen Neelakantam,
Ravi Rajwar,
Suresh Srinivas,
Uma Srinivasan,
Craig Zilles
|
|
Pages: 174-185 |
|
doi>10.1145/1250662.1250684 |
|
Full text: PDF
|
|
Speculative compiler optimizations are effective in improving both single-thread performance and reducing power consumption, but their implementation introduces significant complexity, which can limit their adoption, limit their optimization scope, and ...
Speculative compiler optimizations are effective in improving both single-thread performance and reducing power consumption, but their implementation introduces significant complexity, which can limit their adoption, limit their optimization scope, and negatively impact the reliability of the compilers that implement them. To eliminate much of this complexity, as well as increase the effectiveness of these optimizations, we propose that microprocessors provide architecturally-visible hardware primitives for atomic execution. These primitives provide to the compiler the ability to optimize the program's hot path in isolation, allowing the use of non-speculative formulations of optimization passes to perform speculative optimizations. Atomic execution guarantees that if a speculation invariant does not hold, the speculative updates are discarded, the register state is restored, and control is transferred to a non-speculative version of the code, thereby relieving the compiler from the responsibility of generating compensation code. We demonstrate the benefit of hardware atomicity in the context of a Java virtual machine. We find incorporating the notion of atomic regions into an existing compiler intermediate representation to be natural, requiring roughly 3,000 lines of code (~3% of a JVM's optimizing compiler), most of which were for region formation. Its incorporation creates new opportunities for existing optimization passes, as well as greatly simplifying the implementation of additional optimizations (e.g., partial inlining, partial loop unrolling, and speculative lock elision). These optimizations reduce dynamic instruction count by 11% on average and result in a 10-15% average speedup, relative to a baseline compiler with a similar degree of inlining. expand
|
|
|
SESSION: Core fusion and quantum |
| |
D. Burger
|
|
|
|
|
Core fusion: accommodating software diversity in chip multiprocessors |
| |
Engin Ipek,
Meyrem Kirman,
Nevin Kirman,
Jose F. Martinez
|
|
Pages: 186-197 |
|
doi>10.1145/1250662.1250686 |
|
Full text: PDF
|
|
This paper presents core fusion, a reconfigurable chip multiprocessor(CMP) architecture where groups of fundamentally independent cores can dynamically morph into a larger CPU, or they can be used as distinct processing elements, as needed at ...
This paper presents core fusion, a reconfigurable chip multiprocessor(CMP) architecture where groups of fundamentally independent cores can dynamically morph into a larger CPU, or they can be used as distinct processing elements, as needed at run time by applications. Core fusion gracefully accommodates software diversity and incremental parallelization in CMPs. It provides a single execution model across all configurations, requires no additional programming effort or specialized compiler support, maintains ISA compatibility, and leverages mature micro-architecture technology. expand
|
|
|
Tailoring quantum architectures to implementation style: a quantum computer for mobile and persistent qubits |
| |
Eric Chi,
Stephen A. Lyon,
Margaret Martonosi
|
|
Pages: 198-209 |
|
doi>10.1145/1250662.1250687 |
|
Full text: PDF
|
|
In recent years, quantum computing (QC) research has moved from the realm of theoretical physics and mathematics into real implementations. With many different potential hardware implementations, quantum computer architecture is a rich field with an ...
In recent years, quantum computing (QC) research has moved from the realm of theoretical physics and mathematics into real implementations. With many different potential hardware implementations, quantum computer architecture is a rich field with an opportunity to solve interesting new problems and to revisit old ones. This paper presents a QC architecture tailored to physical implementations with highly mobile and persistent quantum bits (qubits). Implementations with qubit coherency times that are much longer than operation times and qubit transportation times that are orders of magnitude faster than operation times lend greater flexibility to the architecture. This is particularly true in the placement and locality of individual qubits. For concreteness, we assume a physical device model based on electron-spin qubits on liquid helium (eSHe). Like many conventional computer architectures, QCs focus on the efficient exposure of parallelism.We present here a QC microarchitecture that enjoys increasing computational parallelism with size and latency scaling only linearly with the number of operations. Although an efficient and high level of parallelism is admirable, quantum hardware is still expensive and difficult to build, so we demonstrate how the software may be optimized to reduce an application's hardware requirements by 25% with no performance loss. Because the majority of a QC's time and resources are devoted to quantum error correction, we also present noise modeling results that evaluate error correction procedures. These results demonstrate that idle qubits in memory need only be refreshedapproximately once every one hundred operation cycles. expand
|
|
|
SESSION: Streams to physics processors |
| |
B. Dally
|
|
|
|
|
A 64-bit stream processor architecture for scientific applications |
| |
Xuejun Yang,
Xiaobo Yan,
Zuocheng Xing,
Yu Deng,
Jiang Jiang,
Ying Zhang
|
|
Pages: 210-219 |
|
doi>10.1145/1250662.1250689 |
|
Full text: PDF
|
|
Stream architecture is a novel microprocessor architecture with wide application potential. But as for whether it can be used efficiently in scientific computing, many issues await further study. This paper first gives the design and implementation of ...
Stream architecture is a novel microprocessor architecture with wide application potential. But as for whether it can be used efficiently in scientific computing, many issues await further study. This paper first gives the design and implementation of a 64-bit stream processor, FT64 (Fei Teng 64), for scientific computing. The carrying out of 64-bit extension design and scientific computing oriented optimization are described in such aspects as instruction set architecture, stream controller, micro controller, ALU cluster, memory hierarchy and interconnection interface here. Second, two kinds of communications as message passing and stream communications are put forward. An interconnection based on the communications is designed for FT64-based high performance computers. Third, a novel stream programming language, SF95 (Stream FORTRAN95), and its compiler, SF95Compiler (Stream FORTRAN95 Compiler), are developed to facilitate the development of scientific applications. Finally, nine typical scientific application kernels are tested and the results show the efficiency of stream architecture for scientific computing. expand
|
|
|
Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors |
| |
Christopher J. Hughes,
Radek Grzeszczuk,
Eftychios Sifakis,
Daehyun Kim,
Sanjeev Kumar,
Andrew P. Selle,
Jatin Chhugani,
Matthew Holliman,
Yen-Kuang Chen
|
|
Pages: 220-231 |
|
doi>10.1145/1250662.1250690 |
|
Full text: PDF
|
|
We explore the emerging application area of physics-based simulation for computer animation and visual special effects. In particular, we examine its parallelization potential and characterize its behavior on a chip multiprocessor (CMP). Applications ...
We explore the emerging application area of physics-based simulation for computer animation and visual special effects. In particular, we examine its parallelization potential and characterize its behavior on a chip multiprocessor (CMP). Applications in this domain model and simulate natural phenomena, and often direct visual components of motion pictures. We study a set of three workloads that exemplify the span and complexity of physical simulation applications used in a production environment: fluid dynamics, facial animation, and cloth simulation. They are computationally demanding, requiring from a few seconds to several minutes to simulate a single frame; therefore, they can benefit greatly from the acceleration possible with large scale CMPs. Starting with serial versions of these applications, we parallelize code accounting for at least 96% of the serial execution time, targeting a large number of threads.We then study the most expensive modules using a simulated 64-core CMP. For the code representing key modules, we achieve parallel scaling of 45x, 50x, and 30x for fluid, face, and cloth simulations, respectively. The modules have a spectrum of parallel task granularity and locking behavior, and all but one are dominated by loop-level parallelism. Many modules operate on streams of data. In some cases, modules iterate over their data, leading to significant temporal locality. This streaming behavior leads to very high on-die and main memory bandwidth requirements. Finally, most modules have little inter-thread communication since they are data-parallel, but a few require heavy communication between data-parallel operations. expand
|
|
|
ParallAX: an architecture for real-time physics |
| |
Thomas Y. Yeh,
Petros Faloutsos,
Sanjay J. Patel,
Glenn Reinman
|
|
Pages: 232-243 |
|
doi>10.1145/1250662.1250691 |
|
Full text: PDF
|
|
Future interactive entertainment applications will featurethe physical simulation of thousands of interacting objectsusing explosions, breakable objects, and cloth effects. Whilethese applications require a tremendous amount of performanceto satisfy ...
Future interactive entertainment applications will featurethe physical simulation of thousands of interacting objectsusing explosions, breakable objects, and cloth effects. Whilethese applications require a tremendous amount of performanceto satisfy the minimum frame rate of 30 FPS, there is a dramatic amount of parallelism in future physics workloads.How will future physics architectures leverage parallelismto achieve the real-time constraint?. We propose and characterize a set of forward-looking benchmarksto represent future physics load and explore the designspace of future physics processors. In response to thedemand of this workload, we demonstrate an architecturewith a set of powerful cores and caches to provide performancefor the serial and coarse-grain parallel components ofphysics simulation, along with a exible set of simple coresto exploit fine-grain parallelism. Our architecture combinesintelligent, application-aware L2 management with dynamiccoupling/allocation of simple cores to complex cores. Furthermore,we perform sensitivity analysis on interconnectalternatives to determine how tightly to couple these cores. expand
|
|
|
SESSION: Bricks, mortars, and microfluidics |
| |
T. Sherwood
|
|
|
|
|
Architectural implications of brick and mortar silicon manufacturing |
| |
Martha Mercaldi Kim,
Mojtaba Mehrara,
Mark Oskin,
Todd Austin
|
|
Pages: 244-253 |
|
doi>10.1145/1250662.1250693 |
|
Full text: PDF
|
|
We introduce a novel chip fabrication technique called "brick and mortar", in which chips are made from small, pre-fabricated ASIC bricks and bonded in a designer-specified arrangement to an inter-brick communication backbone chip. The goal of brick ...
We introduce a novel chip fabrication technique called "brick and mortar", in which chips are made from small, pre-fabricated ASIC bricks and bonded in a designer-specified arrangement to an inter-brick communication backbone chip. The goal of brick and mortar assembly is to provide a low-overhead method to produce custom chips, yet with performance that tracks an ASIC more closely than an FPGA. This paper examines the architectural design choices in this chip-design system. These choices include the definition of reasonable bricks, both in functionality and size, as well as the communication interconnect that the I/O cap provides. To do this we synthesize candidate bricks, analyze their area and bandwidth demands, and present an architectural design for the inter-brick communication network. We discuss a sample chip design, a 16-way CMP, and analyze the costs and benefits of designing chips with brick and mortar. We find that this method of producing chips incurs only a small performance loss (8%) compared to a fully custom ASIC, which is significantly less than the degradation seen from other low-overhead chip options, such as FPGAs. Finally, we measure the effect that architectural design decisions have on the behavior of the proposed physical brick assembly technique, fluidic self-assembly. expand
|
|
|
Aquacore: a programmable architecture for microfluidics |
| |
Ahmed M. Amin,
Mithuna Thottethodi,
T. N. Vijaykumar,
Steven Wereley,
Stephen C. Jacobson
|
|
Pages: 254-265 |
|
doi>10.1145/1250662.1250694 |
|
Full text: PDF
|
|
Advances in microfluidic research has enabled lab-on-a-chip (LoC) technology to achieve miniaturization and integration of biological and chemical analyses to a single chip comprising channels, valves, mixers, heaters, separators, and sensors. These ...
Advances in microfluidic research has enabled lab-on-a-chip (LoC) technology to achieve miniaturization and integration of biological and chemical analyses to a single chip comprising channels, valves, mixers, heaters, separators, and sensors. These miniature instruments appear to offer the rare combination of faster, cheaper, and higher-precision analyses in comparison to conventional bench-scale methods. LoCs have been applied to diverse domains such as proteomics, genomics, biochemistry, virology, cell biology, and chemical synthesis. However, to date LoCs have been designed as application-specific chips which incurs significant design effort, turn-around time, and cost, and degrades designer and user productivity. To address these limitations, we envision a programmable LoC (PLoC) and propose a comprehensive fluidic instruction set, called AquaCore Instruction Set (AIS), and a fluidic microarchitecture, called AquaCore, to implement AIS. We present four key design aspects in which the AIS and AquaCore differ from their computer counterparts, and our design decisions made on the basis of the implications of these differences. We demonstrate the use of the PLoC in a range of domains by hand-compiling real-world microfluidic assays in AIS, and show a detailed breakdown of the execution times for the assays and an estimate of the chip area. expand
|
|
|
SESSION: Memory consistency |
| |
M. Hill
|
|
|
|
|
Mechanisms for store-wait-free multiprocessors |
| |
Thomas F. Wenisch,
Anastasia Ailamaki,
Babak Falsafi,
Andreas Moshovos
|
|
Pages: 266-277 |
|
doi>10.1145/1250662.1250696 |
|
Full text: PDF
|
|
Store misses cause significant delays in shared-memory multiprocessors because of limited store buffering and ordering constraints required for proper synchronization. Today, programmers must choose from a spectrum of memory consistency models that reduce ...
Store misses cause significant delays in shared-memory multiprocessors because of limited store buffering and ordering constraints required for proper synchronization. Today, programmers must choose from a spectrum of memory consistency models that reduce store stalls at the cost of increased programming complexity. Prior research suggests that the performance gap among consistency models can be closed through speculation--enforcing order only when dynamically necessary. Unfortunately, past designs either provide insufficient buffering, replace all stores with read-modify-write operations, and/or recover from ordering violations via impractical fine-grained rollback mechanisms. We propose two mechanisms that, together, enable store-wait-free implementations of any memory consistency model. To eliminate buffer-capacity-related stalls, we propose the scalable store buffer, which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers. To eliminate ordering-related stalls, we propose atomic sequence ordering, which enforces ordering constraints over coarse-grain access sequences while relaxing order among individual accesses. Using cycle-accurate full-system simulation of scientific and commercial applications, we demonstrate that these mechanisms allow the simplified programming of strict ordering while outperforming conventional implementations on average by 32% (sequential consistency), 22% (SPARC total store order) and 9% (SPARC relaxed memory order). expand
|
|
|
BulkSC: bulk enforcement of sequential consistency |
| |
Luis Ceze,
James Tuck,
Pablo Montesinos,
Josep Torrellas
|
|
Pages: 278-289 |
|
doi>10.1145/1250662.1250697 |
|
Full text: PDF
|
|
While Sequential Consistency (SC) is the most intuitive memory consistency model and the one most programmers likely assume, current multiprocessors do not support it. Instead, they support more relaxed models that deliver high performance. SC implementations ...
While Sequential Consistency (SC) is the most intuitive memory consistency model and the one most programmers likely assume, current multiprocessors do not support it. Instead, they support more relaxed models that deliver high performance. SC implementations are considered either too slow or -- when they can match the performance of relaxed models -- too difficult to implement. In this paper, we propose Bulk Enforcement of SC (BulkSC), anovel way of providing SC that is simple to implement and offers performance comparable to Release Consistency (RC). The idea is to dynamically group sets of consecutive instructions into chunks that appear to execute atomically and in isolation. The hardware enforces SC at the coarse grain of chunks which, to the program, appears as providing SC at the individual memory access level. BulkSC keeps the implementation simple by largely decoupling memory consistency enforcement from processor structures. Moreover, it delivers high performance by enabling full memory access reordering and overlapping within chunks and across chunks. We describe a complete system architecture that supports BulkSC and show that it delivers performance comparable to RC. expand
|
|
|
SESSION: Power and thermal |
| |
P. Ranganathan
|
|
|
|
|
Limiting the power consumption of main memory |
| |
Bruno Diniz,
Dorgival Guedes,
Wagner Meira, Jr.,
Ricardo Bianchini
|
|
Pages: 290-301 |
|
doi>10.1145/1250662.1250699 |
|
Full text: PDF
|
|
The peak power consumption of hardware components affects their powersupply, packaging, and cooling requirements. When the peak power consumption is high, the hardware components or the systems that use them can become expensive and bulky. Given that ...
The peak power consumption of hardware components affects their powersupply, packaging, and cooling requirements. When the peak power consumption is high, the hardware components or the systems that use them can become expensive and bulky. Given that components and systems rarely (if ever) actually require peak power, it is highly desirable to limit power consumption to a less-than-peak power budget, based on which power supply, packaging, and cooling infrastructure scan be more intelligently provisioned. In this paper, we study dynamic approaches for limiting the powerconsumption of main memories. Specifically, we propose four techniques that limit consumption by adjusting the power states of thememory devices, as a function of the load on the memory subsystem. Our simulations of applications from three benchmarks demonstrate that our techniques can consistently limit power to a pre-established budget. Two of the techniques can limit power with very low performance degradation. Our results also show that, when using these superior techniques, limiting power is at least as effective an energy-conservation approach as state-of-the-art technique sexplicitly designed for performance-aware energy conservation. These latter results represent a departure from current energy management research and practice. expand
|
|
|
Power model validation through thermal measurements |
| |
Francisco Javier Mesa-Martinez,
Joseph Nayfach-Battilana,
Jose Renau
|
|
Pages: 302-311 |
|
doi>10.1145/1250662.1250700 |
|
Full text: PDF
|
|
Simulation environments are an indispensable tool in the design, prototyping, performance evaluation, and analysis of computer systems. Simulator must beable to faithfully reflect the behavior of the system being analyzed. To ensure the accuracy of the ...
Simulation environments are an indispensable tool in the design, prototyping, performance evaluation, and analysis of computer systems. Simulator must beable to faithfully reflect the behavior of the system being analyzed. To ensure the accuracy of the simulator, it must be verified and determined to closely match empirical data. Modern processors provide enough performance counters to validate the majority of the performance models; nevertheless, the information provided is not enough to validate power and thermal models. In order to address some of the difficulties associated with the validation of power andthermal models, this paper proposes an infrared measurement setup to capture run-time power consumption and thermal characteristics of modern chips. We use infrared cameras with high spatial resolution (10x10μm) and high frame rate (125fps) to capture thermal maps. To generate a detailed power breakdown (leakage and dynamic) for each processor floorplan unit, we employ genetic algorithms. The genetic algorithm finds a power equation for each floorplan block that produces the measured temperature for a given thermal package. The difference between the predicted power and the externally measured power consumption for an AMD Athlon analyzed in this paper has less than 1% discrepancy. As an example of applicability, we compare the obtained measurements with CACTI power models, and propose extensions to existing thermal models to increase accuracy. expand
|
|
|
Thermal modeling and management of DRAM memory systems |
| |
Jiang Lin,
Hongzhong Zheng,
Zhichun Zhu,
Howard David,
Zhao Zhang
|
|
Pages: 312-322 |
|
doi>10.1145/1250662.1250701 |
|
Full text: PDF
|
|
With increasing speed and power density, high-performance memories, including FB-DIMM (Fully Buffered DIMM) and DDR2 DRAM, now begin to require dynamic thermal management(DTM) as processors and hard drives did. The DTM of memories, nevertheless, ...
With increasing speed and power density, high-performance memories, including FB-DIMM (Fully Buffered DIMM) and DDR2 DRAM, now begin to require dynamic thermal management(DTM) as processors and hard drives did. The DTM of memories, nevertheless, is different in that it should take the processor performance and power consumption into consideration. Existing schemes have ignored that. In this study, we investigate a new approach that controls the memory thermal issues from the source generating memory activities - the processor. It will smooth the program execution when compared with shutting down memory abruptly, and therefore improve the overall system performance and power efficiency. For multicore systems, we propose two schemes called adaptive core gating and coordinated DVFS. The first scheme activates clock gating on selected processor cores and the second one scales down the frequency and voltage levels of processor cores when the memory is to be over-heated. They can successfully control the memory activities and handle thermal emergency. More importantly, they improve performance significantly under the given thermal envelope. Our simulation results show that adaptive coregating improves performance by up to 23.3% (16.3% on average) on a four-core system with FB-DIMM when compared with DRAM thermal shutdown; and coordinated DVFS with control-theoretic methods improves the performance by up to 18.5% (8.3% on average). expand
|
|
|
SESSION: Clocks, scheduling, and stores |
| |
T. Austin
|
|
|
|
|
ReCycle:: pipeline adaptation to tolerate process variation |
| |
Abhishek Tiwari,
Smruti R. Sarangi,
Josep Torrellas
|
|
Pages: 323-334 |
|
doi>10.1145/1250662.1250703 |
|
Full text: PDF
|
|
Process variation affects processor pipelines by making some stages slower and others faster, therefore exacerbating pipeline unbalance. This reduces the frequency attainable by the pipeline. To improve performance, this paper proposes ReCycle, ...
Process variation affects processor pipelines by making some stages slower and others faster, therefore exacerbating pipeline unbalance. This reduces the frequency attainable by the pipeline. To improve performance, this paper proposes ReCycle, an architectural framework that comprehensively applies cycle time stealing to the pipeline - transferring the time slack of the faster stages to the slow ones by skewing clock arrival times to latching elements after fabrication. As a result, the pipeline can be clocked with a period equal to the average stage delay rather than the longest one. In addition, ReCycle's frequency gains are enhanced with Donor stages, which are empty stages added to "donate" slack to the slow stages. Finally, ReCycle can also convert slack into power reductions. For a 17FO4 pipeline, ReCycle increases the frequency by 12% and the application performance by 9% on average. Combining ReCycle and donor stages delivers improvements of 36% in frequency and 15% in performance onaverage, completely reclaiming the performance losses due to variation. expand
|
|
|
Matrix scheduler reloaded |
| |
Peter G. Sassone,
Jeff Rupley, II,
Edward Brekelbaum,
Gabriel H. Loh,
Bryan Black
|
|
Pages: 335-346 |
|
doi>10.1145/1250662.1250704 |
|
Full text: PDF
|
|
From multiprocessor scale-up to cache sizes to the number of reorder-buffer entries, microarchitects wish to reap the benefits of more computing resources while staying within power and latency bounds. This tension is quite evident in schedulers, which ...
From multiprocessor scale-up to cache sizes to the number of reorder-buffer entries, microarchitects wish to reap the benefits of more computing resources while staying within power and latency bounds. This tension is quite evident in schedulers, which need to be large and single-cycle for maximum performance on out-of-order cores. In this work we present two straightforward modifications to a matrix scheduler implementation which greatly strengthen its scalability. Both are based on the simple observation that the wakeup and picker matrices are sparse, even at small sizes; thus small indirection tables can be used to greatly reduce their width and latency. This technique can be used to create quicker iso-performance schedulers (17-58% reduced critical path) or larger iso-timing schedulers (7-26% IPC increase). Importantly, the power and area requirements of the additional hardware are likely offset by the greatly reduced matrix sizes and subsuming the functionality of the power-hungry allocation CAMs. expand
|
|
|
Late-binding: enabling unordered load-store queues |
| |
Simha Sethumadhavan,
Franziska Roesner,
Joel S. Emer,
Doug Burger,
Stephen W. Keckler
|
|
Pages: 347-357 |
|
doi>10.1145/1250662.1250705 |
|
Full text: PDF
|
|
Conventional load/store queues (LSQs) are an impediment to both power-efficient execution in superscalar processors and scaling tolarge-window designs. In this paper, we propose techniques to improve the area and power efficiency of LSQs by allocating ...
Conventional load/store queues (LSQs) are an impediment to both power-efficient execution in superscalar processors and scaling tolarge-window designs. In this paper, we propose techniques to improve the area and power efficiency of LSQs by allocating entries when instructions issue ("late binding"), rather than when they are dispatched. This approach enables lower occupancy and thus smaller LSQs. Efficient implementations of late-binding LSQs, however, require the entries in the LSQ to be unordered with respect to age. In this paper, we show how to provide full LSQ functionality in an unordered design with only small additional complexity and negligible performance losses. We show that late-binding, unordered LSQs work well for small-window superscalar processors, but can also be scaled effectively to large, kilo-window processors by breaking the LSQs into address-interleaved banks. To handle the increased overflows, we apply classic network flow control techniques to the processor micronetworks, enabling low-overhead recovery mechanisms from bank overflows. We evaluate three such mechanisms: instruction replay, skid buffers, an dvirtual-channel buffering in the on-chip memory network. We show that for an 80-instruction window, the LSQ can be reduced to 32 entries. For a 1024-instruction window, the unordered, late-binding LSQ works well with four banks of 48 entries each. By applying a Bloom filter as well, this design achieves full hardware memory disambiguation for a 1,024 instruction window while requiring low average power per load and store access of 8 and 12 CAM entries, respectively. expand
|
|
|
SESSION: Memory and caches |
| |
L. Barroso
|
|
|
|
|
Comparing memory systems for chip multiprocessors |
| |
Jacob Leverich,
Hideho Arakida,
Alex Solomatnikov,
Amin Firoozshahian,
Mark Horowitz,
Christos Kozyrakis
|
|
Pages: 358-368 |
|
doi>10.1145/1250662.1250707 |
|
Full text: PDF
|
|
There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, ...
There are two basic models for the on-chip memory in CMP systems:hardware-managed coherent caches and software-managed streaming memory. This paper performs a direct comparison of the two modelsunder the same set of assumptions about technology, area, and computational capabilities. The goal is to quantify how and when they differ in terms of performance, energy consumption, bandwidth requirements, and latency tolerance for general-purpose CMPs. We demonstrate that for data-parallel applications, the cache-based and streaming models perform and scale equally well. For certain applications with little data reuse, streaming scales better due to better bandwidth use and macroscopic software prefetching. However, the introduction of techniques such as hardware prefetching and non-allocating stores to the cache-based model eliminates the streaming advantage. Overall, our results indicate that there is not sufficient advantage in building streaming memory systems where all on-chip memory structures are explicitly managed. On the other hand, we show that streaming at the programming model level is particularly beneficial, even with the cache-based model, as it enhances locality and creates opportunities for bandwidth optimizations. Moreover, we observe that stream programming is actually easier with the cache-based model because the hardware guarantees correct, best-effort execution even when the programmer cannot fully regularize an application's code. expand
|
|
|
Interconnect design considerations for large NUCA caches |
| |
Naveen Muralimanohar,
Rajeev Balasubramonian
|
|
Pages: 369-380 |
|
doi>10.1145/1250662.1250708 |
|
Full text: PDF
|
|
The ever increasing sizes of on-chip caches and the growing domination of wire delay necessitate significant changes to cache hierarchy design methodologies. Many recent proposals advocate splitting the cache into a large number of banks and employing ...
The ever increasing sizes of on-chip caches and the growing domination of wire delay necessitate significant changes to cache hierarchy design methodologies. Many recent proposals advocate splitting the cache into a large number of banks and employing a network-on-chip (NoC) to allow fast access to nearby banks (referred to as Non-Uniform Cache Architectures--NUCA). Most studies on NUCA organizations have assumed a generic NoC and focused on logical policies for cache block placement, movement, and search. Since wire/router delay and power are major limiting factors in modern processors, this work focuses on interconnect design and its influence on NUCA performance and power. We extend the widely-used CACTI cache modeling tool to take network design parameters into account. With these overheads appropriately accounted for, the optimal cache organization is typically very different from that assumed in prior NUCA studies. To alleviate the interconnect delay bottleneck, we propose novel cache access optimizations that introduce heterogeneity within the inter-bank network. The careful consideration of interconnect choices for a large cache results in a 51% performance improvement over a baseline generic NoC and the introduction of heterogeneity within the network yields an additional 11-15% performance improvement. expand
|
|
|
Adaptive insertion policies for high performance caching |
| |
Moinuddin K. Qureshi,
Aamer Jaleel,
Yale N. Patt,
Simon C. Steely,
Joel Emer
|
|
Pages: 381-391 |
|
doi>10.1145/1250662.1250709 |
|
Full text: PDF
|
|
The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU ...
The commonly used LRU replacement policy is susceptible to thrashing for memory-intensive workloads that have a working set greater than the available cache size. For such applications, the majority of lines traverse from the MRU position to the LRU position without receiving any cache hits, resulting in inefficient use of cache space. Cache performance can be improved if some fraction of the working set is retained in the cache so that at least that fraction of the working set can contribute to cache hits. We show that simple changes to the insertion policy can significantly reduce cache misses for memory-intensive workloads. We propose the LRU Insertion Policy (LIP) which places the incoming line in the LRU position instead of the MRU position. LIP protects the cache from thrashing and results in close to optimal hitrate for applications that have a cyclic reference pattern. We also propose the Bimodal Insertion Policy (BIP) as an enhancement of LIP that adapts to changes in the working set while maintaining the thrashing protection of LIP. We finally propose a Dynamic Insertion Policy (DIP) to choose between BIP and the traditional LRU policy depending on which policy incurs fewer misses. The proposed insertion policies do not require any change to the existing cache structure, are trivial to implement, and have a storage requirement of less than two bytes. We show that DIP reduces the average MPKI of the baseline 1MB 16-way L2 cache by 21%, bridging two-thirds of the gap between LRU and OPT. expand
|
|
|
SESSION: Experience and methodology |
| |
J. Emer
|
|
|
|
|
Performance and security lessons learned from virtualizing the alpha processor |
| |
Paul A. Karger
|
|
Pages: 392-401 |
|
doi>10.1145/1250662.1250711 |
|
Full text: PDF
|
|
Virtualization has become much more important throughout the computer industry both to improve security and to support multiple workloads on the same hardware with effective isolation between those workloads. The most widely used chip architecture, the ...
Virtualization has become much more important throughout the computer industry both to improve security and to support multiple workloads on the same hardware with effective isolation between those workloads. The most widely used chip architecture, the Intel and AMD x86 processors, have begun to support virtualization, but the initial implementations show some limitations. This paper examines the virtualization properties of the Alpha architecture with particular emphasis on features that improve performance and security. It shows how the Alpha's features of PALcode, address space numbers, software handling of translation buffer misses, lack of used and modified bits, and secure handling of unpredictable results all contribute to making virtualization of the Alpha particularly easy. The paper then compares the virtual architecture of the Alpha with Intel's and AMD's virtualization approaches for x86. It also comments briefly on Intel's virtualization technology for Itanium, IBM's zSeries and pSeries hypervisors and Sun's UltraSPARC virtualization. It particularly identifies some differences between translation buffers on x86 and translation buffers on VAX and Alpha that can have adverse performance consequences. expand
|
|
|
Automated design of application specific superscalar processors: an analytical approach |
| |
Tejas S. Karkhanis,
James E. Smith
|
|
Pages: 402-411 |
|
doi>10.1145/1250662.1250712 |
|
Full text: PDF
|
|
Analytical modeling is applied to the automated design of application-specific superscalar processors. Using an analytical method bridges the gap between the size of the design space and the time required for detailed cycle-accurate simulations. The ...
Analytical modeling is applied to the automated design of application-specific superscalar processors. Using an analytical method bridges the gap between the size of the design space and the time required for detailed cycle-accurate simulations. The proposed design framework takes as inputs the design targets (upper bounds on execution time, area, and energy), design alternatives, and one or more application programs. The output is the set of out-of-order superscalar processors that are Pareto-optimal with respect to performance-energy-area. The core of the new design framework is made up of analytical performance and energy activity models, and an analytical model-based design optimization process. For a set of benchmark programs and a design space of 2000 designs, the design framework arrives at all performance-energy-area Pareto-optimal design points within 16 minutes on a 2 GHz Pentium-4. In contrast, it is estimated that a naíve cycle-accurate simulation-based exhaustive search would require at least two months to arrive at the Pareto-optimal design points for the same design space. expand
|
|
|
Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite |
| |
Aashish Phansalkar,
Ajay Joshi,
Lizy K. John
|
|
Pages: 412-423 |
|
doi>10.1145/1250662.1250713 |
|
Full text: PDF
|
|
The recently released SPEC CPU2006 benchmark suite is expected to be used by computer designers and computer architecture researchers for pre-silicon early design analysis. Partial use of benchmark suites by researchers, due to simulation time constraints, ...
The recently released SPEC CPU2006 benchmark suite is expected to be used by computer designers and computer architecture researchers for pre-silicon early design analysis. Partial use of benchmark suites by researchers, due to simulation time constraints, compiler difficulties, or library or system call issues is likely to happen; but a random subset can lead to misleading results. This paper analyzes the SPEC CPU2006 benchmarks using performance counter based experimentation from several state of the art systems, and uses statistical techniques such as principal component analysis and clustering to draw inferences on the similarity of the benchmarks and the redundancy in the suite and arrive at meaningful subsets. The SPEC CPU2006 benchmark suite contains several programs from areas such as artificial intelligence and includes none from the electronic design automation (EDA) application area. Hence there is a concern on the application balance in the suite. An analysis from the perspective of fundamental program characteristics shows that the included programs offer characteristics broader than the EDA programs' space. A subset of 6 integer programs and 8 floating point programs can yield most of the information from the entire suite. expand
|
|
|
SESSION: Control independence and prediction |
| |
C. Zilles
|
|
|
|
|
VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization |
| |
Hyesoon Kim,
José A. Joao,
Onur Mutlu,
Chang Joo Lee,
Yale N. Patt,
Robert Cohn
|
|
Pages: 424-435 |
|
doi>10.1145/1250662.1250715 |
|
Full text: PDF
|
|
Indirect branches have become increasingly common in modular programs written in modern object-oriented languages and virtual machine based runtime systems. Unfortunately, the prediction accuracy of indirect branches has not improved as much as that ...
Indirect branches have become increasingly common in modular programs written in modern object-oriented languages and virtual machine based runtime systems. Unfortunately, the prediction accuracy of indirect branches has not improved as much as that of conditional branches. Furthermore, previously proposed indirect branch predictors usually require a significant amount of extra hardware storage and complexity, which makes them less attractive to implement. This paper proposes a new technique for handling indirect branches, called Virtual Program Counter (VPC) prediction. The key idea of VPC prediction is to treat a single indirect branch as multiple virtual conditional branches in hardware for prediction purposes. Our technique predicts each of the virtual conditional branches using the existing conditional branch prediction hardware. Thus, no separate storage structure is required for predicting indirect branch targets. Our evaluation shows that VPC prediction improves average performance by 26.7% compared to a commonly-used branch target buffer based predictor on 12 indirect branch intensive applications. VPC prediction achieves the performance improvement provided by at least a 12KB (and usually a 192KB) tagged target cache predictor on half of the examined applications. We show that VPC prediction can be used with any existing conditional branch prediction mechanism and that the accuracy of VPC prediction improves when a more accurate conditional branch predictor is used. expand
|
|
|
Ginger: control independence using tag rewriting |
| |
Andrew D. Hilton,
Amir Roth
|
|
Pages: 436-447 |
|
doi>10.1145/1250662.1250716 |
|
Full text: PDF
|
|
The negative performance impact of branch mis-predictions can be reduced by exploiting control independence (CI). When a branch mis-predicts, the wrong-path instructions up to the point where control converges with the correct path are selectively squashed ...
The negative performance impact of branch mis-predictions can be reduced by exploiting control independence (CI). When a branch mis-predicts, the wrong-path instructions up to the point where control converges with the correct path are selectively squashed and replaced with correct-path instructions. Instructions beyond the convergence-point-the branch's control-independent (CI) instructions-are spared from squashing. Exploiting CI requires updating the input data dependences of CI instructions to reflect the selective removal and insertion of logically older instructions and transitively re-dispatching those CI instructions whose inputs have changed. This capability is generally called out-of-order renaming. Previously proposed CI designs use out-of-order renaming schemes that either consume excessive rename/dispatch bandwidth, can only be applied in limited cases, or incur a cost even when the branch would be correctly predicted. Ginger is a CI design that is both general and bandwidth efficient. Ginger implements out-of-order renaming using tag rewriting, re-linking the input dependences of CI instructions as they sit in the window. To do this, Ginger halts the pipeline uses the idle map table read and write ports and the issue queue match lines and write lines to perform a register-tag "search-and-replace" operation. After a few cycles, the pipeline restarts and execution resumes with correct data dependences. Cycle-level simulation shows that Ginger out-performs previous CI designs, yielding geometric mean speedups over an aggressive non-CI processor of 5%, 12%, and 11%-on SPECint2000, MediaBench, and Comm-Bench-with speedups of 15% or greater on 11 of 46 programs. expand
|
|
|
Transparent control independence (TCI) |
| |
Ahmed S. Al-Zawawi,
Vimal K. Reddy,
Eric Rotenberg,
Haitham H. Akkary
|
|
Pages: 448-459 |
|
doi>10.1145/1250662.1250717 |
|
Full text: PDF
|
|
Superscalar architectures have been proposed that exploit control independence, reducing the performance penalty of branch mispredictions by preserving the work of future misprediction-independent instructions. The essential goal of exploiting control ...
Superscalar architectures have been proposed that exploit control independence, reducing the performance penalty of branch mispredictions by preserving the work of future misprediction-independent instructions. The essential goal of exploiting control independence is to completely decouple future misprediction-independent instructions from deferred misprediction-dependent instructions. Current implementations fall short of this goal because they explicitly maintain program order among misprediction-independent and misprediction-dependent instructions. Explicit approaches sacrifice design efficiency and ultimately performance. We observe it is sufficient to emulate program order. Potential misprediction-dependent instructions are singled out a priori and their unchanging source values are checkpointed. These instructions and values are set aside as a "recovery program". Checkpointed source values break the data dependencies with co-mingled misprediction-independent instructions - now long since gone from the pipeline - achieving the essential decoupling objective. When the mispredicted branch resolves, recovery is achieved by fetching the self-sufficient, condensed recovery program. Recovery is effectively transparent to the pipeline, in that speculative state is not rolled back and recovery appears as a jump to code. A coarse-grain retirement substrate permits the relaxed order between the decoupled programs. Transparent control independence (TCI) yields a highly streamlined pipeline that quickly recycles resources based on conventional speculation, enabling a large window with small cycle-critical resources, and prevents many mispredictions from disrupting this large window. TCI achieves speedups as high as 64% (16% average) and 88% (22% average) for 4-issue and 8-issue pipelines, respectively, among 15 SPEC integer benchmarks. Factors that limit the performance of explicitly ordered approaches are quantified. expand
|
|
|
SESSION: Faults |
| |
J. Torrellas
|
|
|
|
|
Examining ACE analysis reliability estimates using fault-injection |
| |
Nicholas J. Wang,
Aqeel Mahesri,
Sanjay J. Patel
|
|
Pages: 460-469 |
|
doi>10.1145/1250662.1250719 |
|
Full text: PDF
|
|
ACE analysis is a technique to provide an early reliability estimate for microprocessors. ACE analysis couples data from abstract performance models with low level design details to identify and rule out transient faults that will not cause incorrect ...
ACE analysis is a technique to provide an early reliability estimate for microprocessors. ACE analysis couples data from abstract performance models with low level design details to identify and rule out transient faults that will not cause incorrect execution. While many transient faults are analyzable in ACE analysis frameworks, some are not. As a result, ACE analysis is conservative and provides a lower bound for the reliability of a processor design. Bounding the reliability of a design is useful since it can guarantee that the given design will meet reliability goals. In this work, we quantify and identify the sources of ACE analysis conservatism by comparing an ACE analysis methodology against a rigorous fault-injection study. We evaluate two flavors of ACE analysis: a "simple" analysis and a refined analysis, finding that even the refined analysis overestimates the soft error vulnerability of an instruction scheduler by 2-3x. The conservatism stems from two key sources: from lack of detail in abstract performance models and from what we term Y-Bits, a result of the single-pass simulation methodology that is typical of ACE analysis. We also examine the efficacy of applying ACE analysis to a class of "partial coverage" error mitigation techniques. In particular, we perform a case study on one such technique and extrapolate our findings to others. expand
|
|
|
Configurable isolation: building high availability systems with commodity multi-core processors |
| |
Nidhi Aggarwal,
Parthasarathy Ranganathan,
Norman P. Jouppi,
James E. Smith
|
|
Pages: 470-481 |
|
doi>10.1145/1250662.1250720 |
|
Full text: PDF
|
|
High availability is an increasingly important requirement for enterprise systems, often valued more than performance. Systems designed for high availability typically use redundant hardware for error detection and continued uptime in the event of a ...
High availability is an increasingly important requirement for enterprise systems, often valued more than performance. Systems designed for high availability typically use redundant hardware for error detection and continued uptime in the event of a failure. Chip multiprocessors with an abundance of identical resources like cores, cache and interconnection networks would appear to be ideal building blocks for implementing high availability solutions on chip. However, doing so poses significant challenges with respect to error containment and faulty component replacement. Increasing silicon and transient fault rates with future technology scaling exacerbate the problem. This paper proposes a novel, cost-effective, architecture for high availability systems built from future multi-core processors. We propose a new chip multiprocessor architecture that provides configurable isolation for fault containment and component retirement, based upon cost-effective modifications to commodity designs. The design is evaluated for a state-of-the-art industrial fault model and the proposed architecture is shown to provide effective fault isolation and graceful degradation even when the failure rate is high. expand
|
|
|
SESSION: Security |
| |
G. Reinman
|
|
|
|
|
Raksha: a flexible information flow architecture for software security |
| |
Michael Dalton,
Hari Kannan,
Christos Kozyrakis
|
|
Pages: 482-493 |
|
doi>10.1145/1250662.1250722 |
|
Full text: PDF
|
|
High-level semantic vulnerabilities such as SQL injection and crosssite scripting have surpassed buffer overflows as the most prevalent security exploits. The breadth and diversity of software vulnerabilities demand new security solutions that combine ...
High-level semantic vulnerabilities such as SQL injection and crosssite scripting have surpassed buffer overflows as the most prevalent security exploits. The breadth and diversity of software vulnerabilities demand new security solutions that combine the speed and practicality of hardware approaches with the flexibility and robustness of software systems. This paper proposes Raksha, an architecture for software security based on dynamic information flow tracking (DIFT). Raksha provides three novel features that allow for a flexible hardware/software approach to security. First, it supports flexible and programmable security policies that enable software to direct hardware analysis towards a wide range of high-level and low-level attacks. Second, it supports multiple active security policies that can protect the system against concurrent attacks. Third, it supports low-overhead security handlers that allow software to correct, complement, or extend the hardware-based analysis without the overhead associated with operating system traps. We present an FPGA prototype for Raksha that provides a full featured Linux workstation for security analysis. Using unmodified binaries for real-world applications, we demonstrate that Raksha can detect high-level attacks such as directory traversal, command injection, SQL injection, and cross-site scripting as well as low-level attacks such as buffer overflows. We also show that low overhead exception handling is critical for analyses such as memory corruption protection in order to address false positives that occur due to the diverse code patterns in frequently used software. expand
|
|
|
New cache designs for thwarting software cache-based side channel attacks |
| |
Zhenghong Wang,
Ruby B. Lee
|
|
Pages: 494-505 |
|
doi>10.1145/1250662.1250723 |
|
Full text: PDF
|
|
Software cache-based side channel attacks are a serious new class of threats for computers. Unlike physical side channel attacks that mostly target embedded cryptographic devices, cache-based side channel attacks can also undermine general purpose systems. ...
Software cache-based side channel attacks are a serious new class of threats for computers. Unlike physical side channel attacks that mostly target embedded cryptographic devices, cache-based side channel attacks can also undermine general purpose systems. The attacks are easy to perform, effective on most platforms, and do not require special instruments or excessive computation power. In recently demonstrated attacks on software implementations of ciphers like AES and RSA, the full key can be recovered by an unprivileged user program performing simple timing measurements based on cache misses. We first analyze these attacks, identifying cache interference as the root cause of these attacks. We identify two basic mitigation approaches: the partition-based approach eliminates cache interference whereas the randomization-based approach randomizes cache interference so that zero information can be inferred. We present new security-aware cache designs, the Partition-Locked cache (PLcache) and Random Permutation cache (RPcache), analyze and prove their security, and evaluate their performance. Our results show that our new cache designs with built-in security can defend against cache-based side channel attacks in general-rather than only specific attacks on a given cryptographic algorithm-with very little performance degradation and hardware cost. expand
|
|
|
SESSION: Vulnerabilities |
| |
S. Adve
|
|
|
|
|
Mechanisms for bounding vulnerabilities of processor structures |
| |
Niranjan Kumar Soundararajan,
Angshuman Parashar,
Anand Sivasubramaniam
|
|
Pages: 506-515 |
|
doi>10.1145/1250662.1250725 |
|
Full text: PDF
|
|
Concern for the increasing susceptibility of processor structures to transient errors has led to several recent research efforts that propose architectural techniques to enhance reliability. However, real systems are typically required to satisfy hard ...
Concern for the increasing susceptibility of processor structures to transient errors has led to several recent research efforts that propose architectural techniques to enhance reliability. However, real systems are typically required to satisfy hard reliability budgets, and barring expensive full-redundancy approaches, none of the proposed solutions treat any reliability budgets or bounds as hard constraints. Meeting vulnerability bounds requires monitoring vulnerabilities of processor structures and taking appropriate actions whenever these bounds are violated. This mandates treating reliability as a first-order microarchitecture design constraint, while optimizing performance as long as reliability requirements are satisfied. This paper makes three key contributions towards this goal: (i) we present a simple infrastructure to monitor and provide upper bounds on the vulnerabilities of key processor structures at cycle-level fidelity; (ii) we propose two distinct control mechanisms - throttling and selective redundancy - to proactively and/or reactively bound the vulnerabilities to any limit specified by the system designer; (iii) within this framework, we propose a novel adaptation of Out-of-Order Commit for vulnerability reduction, which automatically provides additional leverage for the control mechanisms to boost performance while remaining within the reliability budget. expand
|
|
|
Dynamic prediction of architectural vulnerability from microarchitectural state |
| |
Kristen R. Walcott,
Greg Humphreys,
Sudhanva Gurumurthi
|
|
Pages: 516-527 |
|
doi>10.1145/1250662.1250726 |
|
Full text: PDF
|
|
Transient faults due to particle strikes are a key challenge in microprocessor design. Driven by exponentially increasing transistor counts, per-chip faults are a growing burden. To protect against soft errors, redundancy techniques such as redundant ...
Transient faults due to particle strikes are a key challenge in microprocessor design. Driven by exponentially increasing transistor counts, per-chip faults are a growing burden. To protect against soft errors, redundancy techniques such as redundant multithreading (RMT) are often used. However, these techniques assume that the probability that a structural fault will result in a soft error (i.e., the Architectural Vulnerability Factor (AVF)) is 100 percent, unnecessarily draining processor resources. Due to the high cost of redundancy, there have been efforts to throttle RMT at runtime. To date, these methods have not incorporated an AVF model and therefore tend to be ad hoc. Unfortunately, computing the AVF of complex microprocessor structures (e.g., the ISQ) can be quite involved. To provide probabilistic guarantees about fault tolerance, we have created a rigorous characterization of AVF behavior that can be easily implemented in hardware. We experimentally demonstrate AVF variability within and across the SPEC2000 benchmarks and identify strong correlations between structural AVF values and a small set of processor metrics. Using these simple indicators as predictors, we create a proof-of-concept RMT implementation that demonstrates that AVF prediction can be used to maintain a low fault tolerance level without significant performance impact. expand
|