|
|
Token tenure and PATCH: A predictive/adaptive token-counting hybrid |
| |
Arun Raghavan,
Colin Blundell,
Milo M. K. Martin
|
|
Article No.: 6 |
|
doi>10.1145/1839667.1839668 |
|
Full text: PDF
|
|
Traditional coherence protocols present a set of difficult trade-offs: the reliance of snoopy protocols on broadcast and ordered interconnects limits their scalability, while directory protocols incur a performance penalty on sharing misses due to indirection. ...
Traditional coherence protocols present a set of difficult trade-offs: the reliance of snoopy protocols on broadcast and ordered interconnects limits their scalability, while directory protocols incur a performance penalty on sharing misses due to indirection. This work introduces Patch (Predictive/Adaptive Token-Counting Hybrid), a coherence protocol that provides the scalability of directory protocols while opportunistically sending direct requests to reduce sharing latency. Patch extends a standard directory protocol to track tokens and use token-counting rules for enforcing coherence permissions. Token counting allows Patch to support direct requests on an unordered interconnect, while a mechanism called token tenure provides broadcast-free forward progress using the directory protocol's per-block point of ordering at the home along with either timeouts at requesters or explicit race notification messages. Patch makes three main contributions. First, Patch introduces token tenure, which provides broadcast-free forward progress for token-counting protocols. Second, Patch deprioritizes best-effort direct requests to match or exceed the performance of directory protocols without restricting scalability. Finally, Patch provides greater scalability than directory protocols when using inexact encodings of sharers because only processors holding tokens need to acknowledge requests. Overall, Patch is a “one-size-fits-all” coherence protocol that dynamically adapts to work well for small systems, large systems, and anywhere in between. expand
|
|
|
Automatic feedback-directed object fusing |
| |
Christian Wimmer,
Hanspeter Mössenbösck
|
|
Article No.: 7 |
|
doi>10.1145/1839667.1839669 |
|
Full text: PDF
|
|
Object fusing is an optimization that embeds certain referenced objects into their referencing object. The order of objects on the heap is changed in such a way that objects that are accessed together are placed next to each other in memory. Their offset ...
Object fusing is an optimization that embeds certain referenced objects into their referencing object. The order of objects on the heap is changed in such a way that objects that are accessed together are placed next to each other in memory. Their offset is then fixed, that is, the objects are colocated, allowing field loads to be replaced by address arithmetic. Array fusing specifically optimizes arrays, which are frequently used for the implementation of dynamic data structures. Therefore, the length of arrays often varies, and fields referencing such arrays have to be changed. An efficient code pattern detects these changes and allows the optimized access of such fields. We integrated these optimizations into Sun Microsystems' Java HotSpot™ VM. The analysis is performed automatically at runtime, requires no actions on the part of the programmer, and supports dynamic class loading. To safely eliminate a field load, the colocation of the object that holds the field and the object that is referenced by the field must be guaranteed. Two preconditions must be satisfied: The objects must be allocated at the same time, and the field must not be overwritten later. These preconditions are checked by the just-in-time compiler to avoid an interprocedural data flow analysis. The garbage collector ensures that groups of colocated objects are not split by copying groups as a whole. The evaluation shows that the dynamic approach successfully identifies and optimizes frequently accessed fields for several benchmarks with a low compilation and analysis overhead. It leads to a speedup of up to 76% for simple benchmarks and up to 6% for complex workloads. expand
|
|
|
Applied inference: Case studies in microarchitectural design |
| |
Benjamin C. Lee,
David Brooks
|
|
Article No.: 8 |
|
doi>10.1145/1839667.1839670 |
|
Full text: PDF
|
|
We propose and apply a new simulation paradigm for microarchitectural design evaluation and optimization. This paradigm enables more comprehensive design studies by combining spatial sampling and statistical inference. Specifically, this paradigm (i) ...
We propose and apply a new simulation paradigm for microarchitectural design evaluation and optimization. This paradigm enables more comprehensive design studies by combining spatial sampling and statistical inference. Specifically, this paradigm (i) defines a large, comprehensive design space, (ii) samples points from the space for simulation, and (iii) constructs regression models based on sparse simulations. This approach greatly improves the computational efficiency of microarchitectural simulation and enables new capabilities in design space exploration. We illustrate new capabilities in three case studies for a large design space of approximately 260,000 points: (i) Pareto frontier, (ii) pipeline depth, and (iii) multiprocessor heterogeneity analyses. In particular, regression models are exhaustively evaluated to identify Pareto optimal designs that maximize performance for given power budgets. These models enable pipeline depth studies in which all parameters vary simultaneously with depth, thereby more effectively revealing interactions with nondepth parameters. Heterogeneity analysis combines regression-based optimization with clustering heuristics to identify efficient design compromises between similar optimal architectures. These compromises are potential core designs in a heterogeneous multicore architecture. Increasing heterogeneity can improve bips3/w efficiency by as much as 2.4×, a theoretical upper bound on heterogeneity benefits that neglects contention between shared resources as well as design complexity. Collectively these studies demonstrate regression models' ability to expose trends and identify optima in diverse design regions, motivating the application of such models in statistical inference for more effective use of modern simulator infrastructure. expand
|
|
|
Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors |
| |
R. Rakvic,
Q. Cai,
J. González,
G. Magklis,
P. Chaparro,
A. González
|
|
Article No.: 9 |
|
doi>10.1145/1839667.1839671 |
|
Full text: PDF
|
|
We provide an analysis of thread-management techniques that increase performance or reduce energy in multicore and Simultaneous Multithreaded (SMT) cores. Thread delaying reduces energy consumption by running the core containing the critical thread at ...
We provide an analysis of thread-management techniques that increase performance or reduce energy in multicore and Simultaneous Multithreaded (SMT) cores. Thread delaying reduces energy consumption by running the core containing the critical thread at maximum frequency while scaling down the frequency and voltage of the cores containing noncritical threads. In this article, we provide an insightful breakdown of thread delaying on a simulated multi-core microprocessor. Thread balancing improves overall performance by giving higher priority to the critical thread in the issue queue of an SMT core. We provide a detailed breakdown of performance results for thread-balancing, identifying performance benefits and limitations. For those benchmarks where a performance benefit is not possible, we introduce a novel thread-balancing mechanism on an SMT core that can reduce energy consumption. We have performed a detailed study on an Intel microprocessor simulator running parallel applications. Thread delaying can reduce energy consumption by 4% to 44% with negligible performance loss. Thread balancing can increase performance by 20% or can reduce energy consumption by 23%. expand
|
|
|
A memory-efficient pipelined implementation of the aho-corasick string-matching algorithm |
| |
Derek Pao,
Wei Lin,
Bin Liu
|
|
Article No.: 10 |
|
doi>10.1145/1839667.1839672 |
|
Full text: PDF
|
|
With rapid advancement in Internet technology and usages, some emerging applications in data communications and network security require matching of huge volume of data against large signature sets with thousands of strings in real time. In this article, ...
With rapid advancement in Internet technology and usages, some emerging applications in data communications and network security require matching of huge volume of data against large signature sets with thousands of strings in real time. In this article, we present a memory-efficient hardware implementation of the well-known Aho-Corasick (AC) string-matching algorithm using a pipelining approach called P-AC. An attractive feature of the AC algorithm is that it can solve the string-matching problem in time linearly proportional to the length of the input stream, and the computation time is independent of the number of strings in the signature set. A major disadvantage of the AC algorithm is the high memory cost required to store the transition rules of the underlying deterministic finite automaton. By incorporating pipelined processing, the state graph is reduced to a character trie that only contains forward edges. Together with an intelligent implementation of look-up tables, the memory cost of P-AC is only about 18 bits per character for a signature set containing 6,166 strings extracted from Snort. The control structure of P-AC is simple and elegant. The cost of the control logic is very low. With the availability of dual-port memories in FPGA devices, we can double the system throughput by duplicating the control logic such that the system can process two data streams concurrently. Since our method is memory-based, incremental changes to the signature set can be accommodated by updating the look-up tables without reconfiguring the FPGA circuitry. expand
|
|
|
Exploiting the reuse supplied by loop-dependent stream references for stream processors |
| |
Xuejun Yang,
Ying Zhang,
Xicheng Lu,
Jingling Xue,
Ian Rogers,
Gen Li,
Guibin Wang,
Xudong Fang
|
|
Article No.: 11 |
|
doi>10.1145/1839667.1839673 |
|
Full text: PDF
|
|
Memory accesses limit the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip, software controlled storage, the number of memory accesses can be reduced. In current stream compilers, reuse ...
Memory accesses limit the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip, software controlled storage, the number of memory accesses can be reduced. In current stream compilers, reuse exploitation is only attempted for simple stream references, those whose start and end are known. Compiler analysis, from outside of stream processors, does not directly enable the consideration of other more complex stream references. In this article, we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: lemmas identifying the reuse supplied by stream references, a new abstract representation called the Stream Reuse Graph (SRG) depicting the identified reuse, and the optimization of the SRG for our transformation. Both the reuse between the whole sequences accessed by stream references and between partial sequences is exploited in the article. In particular, partial reuse and its treatment are quite new and have never, to the best of our knowledge, appeared in scalar and vector processing. At the same time, reusing streams increases the pressure on the SRF, and this presents a problem of which reuse should be exploited within limited SRF capacity. We extend our analysis to achieve this objective. Finally, we implement our techniques based on the StreamC/KernelC compiler that has been optimized with the best existing compilation techniques for stream processors. Experimental results show a resultant speed-up of 1.14 to 2.54 times using a range of benchmarks. expand
|
|
|
Eliminating voltage emergencies via software-guided code transformations |
| |
Vijay Janapa Reddi,
Simone Campanoni,
Meeta S. Gupta,
Michael D. Smith,
Gu-Yeon Wei,
David Brooks,
Kim Hazelwood
|
|
Article No.: 12 |
|
doi>10.1145/1839667.1839674 |
|
Full text: PDF
|
|
In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These ...
In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this article, we present a hardware--software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a runtime software layer reschedules the program's instruction stream to prevent recurring violations at the same program location. The runtime layer, combined with the proposed code-rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry-standard approach to circumvent the issue altogether by optimizing for the worst-case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations. expand
|