|
|
Experiences with MapReduce, an abstraction for large-scale computation |
| |
Jeffrey Dean
|
|
Pages: 1-1 |
|
doi>10.1145/1152154.1152155 |
|
Full text: PDF
|
|
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a Map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a Reduce function that ...
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a Map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a Reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines.The MapReduce run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required intermachine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: thousands of MapReduce programs have been implemented and several thousand thousand MapReduce jobs are executed on Google's clusters every day.In this talk I'll describe the basic programming model, discuss our experience using it in a variety of domains, and talk about the implications of programming models like MapReduce as one paradigm to simplify development of parallel software for multi-core microprocessors. expand
|
|
|
SESSION: Multi-core design I |
|
|
|
|
Architectural support for operating system-driven CMP cache management |
| |
Nauman Rafique,
Won-Taek Lim,
Mithuna Thottethodi
|
|
Pages: 2-12 |
|
doi>10.1145/1152154.1152160 |
|
Full text: PDF
|
|
The role of the operating system (OS) in managing shared resources such as CPU time, memory, peripherals, and even energy is well motivated and understood [23]. Unfortunately, one key resource—lower-level shared cache in chip multi-processors—is ...
The role of the operating system (OS) in managing shared resources such as CPU time, memory, peripherals, and even energy is well motivated and understood [23]. Unfortunately, one key resource—lower-level shared cache in chip multi-processors—is commonly managed purely in hardware by rudimentary replacement policies such as least-recentlyused (LRU). The rigid nature of the hardware cache management policy poses a serious problem since there is no single best cache management policy across all sharing scenarios. For example, the cache management policy for a scenario where applications from a single organization are running under "best effort" performance expectation is likely to be different from the policy for a scenario where applications from competing business entities (say, at a third party data center) are running under a minimum service level expectation. When it comes to managing shared caches, there is an inherent tension between flexibility and performance. On one hand, managing the shared cache in the OS offers immense policy flexibility since it may be implemented in software. Unfortunately, it is prohibitively expensive in terms of performance for the OS to be involved in managing temporally fine-grain events such as cache allocation. On the other hand, sophisticated hardware-only cache management techniques to achieve fair sharing or throughput maximization have been proposed. But they offer no policy flexibility.This paper addresses this problem by designing architectural support for OS to efficiently manage shared caches with a wide variety of policies. Our scheme consists of a hardware cache quota management mechanism, an OS interface and a set of OS level quota orchestration policies. The hardware mechanism guarantees that OS-specified quotas are enforced in shared caches, thus eliminating the need for (and the performance penalty of) temporally fine-grained OS intervention. The OS retains policy flexibility since it can tune the quotas during regularly scheduled OS interventions. We demonstrate that our scheme can support a wide range of policies including policies that provide (a) passive performance differentiation, (b) reactive fairness by miss-rate equalization and (c) reactive performance differentiation. expand
|
|
|
Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource |
| |
Lisa R. Hsu,
Steven K. Reinhardt,
Ravishankar Iyer,
Srihari Makineni
|
|
Pages: 13-22 |
|
doi>10.1145/1152154.1152161 |
|
Full text: PDF
|
|
As chip multiprocessors (CMPs) become increasingly mainstream, architects have likewise become more interested in how best to share a cache hierarchy among multiple simultaneous threads of execution. The complexity of this problem is exacerbated as the ...
As chip multiprocessors (CMPs) become increasingly mainstream, architects have likewise become more interested in how best to share a cache hierarchy among multiple simultaneous threads of execution. The complexity of this problem is exacerbated as the number of simultaneous threads grows from two or four to the tens or hundreds. However, there is no consensus in the architectural community on what "best" means in this context. Some papers in the literature seek to equalize each thread's performance loss due to sharing, while others emphasize maximizing overall system performance. Furthermore, the specific effect of these goals varies depending on the metric used to define "performance".In this paper we label equal performance targets as Communist cache policies and overall performance targets as Utilitarian cache policies. We compare both of these models to the most common current model of a free-for-all cache (a Capitalist policy). We consider various performance metrics, including miss rates, bandwidth usage, and IPC, including both absolute and relative values of each metric. Using analytical models and behavioral cache simulation, we find that the optimal partitioning of a shared cache can vary greatly as different but reasonable definitions of optimality are applied. We also find that, although Communist and Utilitarian targets are generally compatible, each policy has workloads for which it provides poor overall performance or poor fairness, respectively. Finally, we find that simple policies like LRU replacement and static uniform partitioning are not sufficient to provide near-optimal performance under any reasonable definition, indicating that some thread-aware cache resource allocation mechanism is required. expand
|
|
|
Core architecture optimization for heterogeneous chip multiprocessors |
| |
Rakesh Kumar,
Dean M. Tullsen,
Norman P. Jouppi
|
|
Pages: 23-32 |
|
doi>10.1145/1152154.1152162 |
|
Full text: PDF
|
|
Previous studies have demonstrated the advantages of single-ISA heterogeneous multi-core architectures for power and performance. However, none of those studies examined how to design such a processor; instead, they started with an assumed combination ...
Previous studies have demonstrated the advantages of single-ISA heterogeneous multi-core architectures for power and performance. However, none of those studies examined how to design such a processor; instead, they started with an assumed combination of pre-existing cores.This work assumes the flexibility to design a multi-core architecture from the ground up and seeks to address the following question: what should be the characteristics of the cores for a heterogeneous multi processor for the highest area or power efficiency? The study is done for varying degrees of thread-level parallelism and for different area and power budgets.The most efficient chip multiprocessors are shown to be heterogeneous, with each core customized to a different subset of application characteristics - no single core is necessarily well suited to all applications. The performance ordering of cores on such processors is different for different applications; there is only a partial ordering among cores in terms of resources and complexity. This methodology produces performance gains as high as 40%. The performance improvements come with the added cost of customization. expand
|
|
|
SESSION: Program analysis and optimization |
|
|
|
|
Compiling for stream processing |
| |
Abhishek Das,
William J. Dally,
Peter Mattson
|
|
Pages: 33-42 |
|
doi>10.1145/1152154.1152164 |
|
Full text: PDF
|
|
This paper describes a compiler for stream programs that efficiently schedules computational kernels and stream memory operations, and allocates on-chip storage. Our compiler uses information about the program structure and estimates of kernel and memory ...
This paper describes a compiler for stream programs that efficiently schedules computational kernels and stream memory operations, and allocates on-chip storage. Our compiler uses information about the program structure and estimates of kernel and memory operation execution times to overlap kernel execution with memory transfers, maximizing performance, and to optimize use of scarce on-chip memory, significantly reducing external memory bandwidth. Our compiler applies optimizations such as strip-mining, loop unrolling, and software pipelining, at the level of kernels and stream memory operations. We evaluate the performance of our compiler on a suite of media and scientific benchmarks. Our results show that compiler management of on-chip storage reduces external memory bandwidth by 35% to 93% and reduces execution time by 23% to 72% compared to cachelike LRU management of the same storage. We show that strip-mining stream applications enables producer-consumer locality to be captured in on-chip storage reducing external bandwidth by 50% to 80%. We also evaluate the sensitivity of performance to the scheduling methods used and to critical resources. Overall, our compiler is able to overlap memory operations and manage local storage so that 78% to 96% of program execution time is spent in running computational kernels. expand
|
|
|
Region array SSA |
| |
Silvius Rus,
Guobin He,
Christophe Alias,
Lawrence Rauchwerger
|
|
Pages: 43-52 |
|
doi>10.1145/1152154.1152165 |
|
Full text: PDF
|
|
Static Single Assignment (SSA) has become the intermediate program representation of choice in most modern compilers because it enables efficient data flow analysis of scalars and thus leads to better scalar optimizations. Unfortunately not much progress ...
Static Single Assignment (SSA) has become the intermediate program representation of choice in most modern compilers because it enables efficient data flow analysis of scalars and thus leads to better scalar optimizations. Unfortunately not much progress has been achieved in applying the same techniques to array data flow analysis, a very important and potentially powerful technology. In this paper we propose to improve the applicability of previous efforts in array SSA through the use of a symbolic memory access descriptor that can aggregate the accesses to the elements of an array over large, interprocedural program contexts. We then show the power of our new representation by using it to implement a basic data flow algorithm, reaching definitions. Finally we apply this analysis to array constant propagation and array privatization and show performance improvement (speedups) for benchmark codes. expand
|
|
|
A two-phase escape analysis for parallel java programs |
| |
Kyungwoo Lee,
Samuel P. Midkiff
|
|
Pages: 53-62 |
|
doi>10.1145/1152154.1152166 |
|
Full text: PDF
|
|
Thread escape analysis conservatively determines which objects may be accessed in more than one thread. Thread escape analysis is useful for a variety of purposes—finding races in multi-threaded programs, removing useless synchronization, allocating ...
Thread escape analysis conservatively determines which objects may be accessed in more than one thread. Thread escape analysis is useful for a variety of purposes—finding races in multi-threaded programs, removing useless synchronization, allocating data to thread-local heaps, and compiling to target more strict consistency models. Thread escape analyses are often interprocedural, and interprocedural analyses are generally either too slow to perform at runtime in dynamic systems, or trade-off significant amounts of precision for speed. This paper describes a two-phase offline/online interprocedural and inter thread escape analysis that is faster and more accurate, on average, than previously published analyses. By performing an offline pre-analysis followed by a dynamic online analysis that integrates offline results with dynamic information, significant improvements in performance and accuracy are achieved. For compiling Java programs under a sequentially consistent memory model, our approach enables application executions that are, on average, 1.5 times faster than those using the previous fastest online algorithm, with only 80% of the online compilation time. expand
|
|
|
Challenges and opportunities in the post single-thread-processor era |
| |
Steve Scott
|
|
Pages: 63-63 |
|
doi>10.1145/1152154.1152156 |
|
Full text: PDF
|
|
The age of the single thread juggernaut has ended, due to a variety of factors. Multi-core processors are coming on strong, and scaling in being stressed more than ever. This presents a number of architectural, hardware and software challenges. This ...
The age of the single thread juggernaut has ended, due to a variety of factors. Multi-core processors are coming on strong, and scaling in being stressed more than ever. This presents a number of architectural, hardware and software challenges. This talk will reflect on these challenges from Cray's perspective in the high performance computing industry. expand
|
|
|
SESSION: Security and correctness |
|
|
|
|
Self-checking instructions: reducing instruction redundancy for concurrent error detection |
| |
Sumeet Kumar,
Aneesh Aggarwal
|
|
Pages: 64-73 |
|
doi>10.1145/1152154.1152168 |
|
Full text: PDF
|
|
With reducing feature size, increasing chip capacity, and increasing clock speed, microprocessors are becoming increasingly susceptible to transient (soft) errors. Redundant multi-threading (RMT) is an attractive approach for concurrent error ...
With reducing feature size, increasing chip capacity, and increasing clock speed, microprocessors are becoming increasingly susceptible to transient (soft) errors. Redundant multi-threading (RMT) is an attractive approach for concurrent error detection. However, redundant thread execution has a significant impact on performance and energy consumption in the chip.In this paper, we propose reducing instruction redundancy (the instructions that are redundantly executed) as a means to mitigate the performance and energy impact of redundancy. In this paper, we experiment with an decoupled RMT approach where the frontend pipeline stages are protected through error codes, while the backend pipeline stages are protected through redundant execution. In this approach, we define two categories of instructions—self-checking and semi self-checking instructions. Self checking instructions are those instructions whose results are checked for any errors when their "main" copies are executed. These instructions are not redundantly executed. Semi self-checking instructions are those instructions for which a major part of their results is checked when the "main" copies are executed, and the remaining part of the instructions is checked using a small amount of additional hardware. Reducing instruction redundancy with this approach has the same fault coverage as the base architecture where all the instructions are redundantly executed. The techniques are evaluated in terms of their performance, power, and vulnerability impact on the RMT processor. Our experiments show that the techniques reduce instruction redundancy by about 58% and recover about 51% of the performance lost due to redundant execution. Our techniques also recover about 40% of the energy consumption increase in the key data-path structures. expand
|
|
|
A low-cost memory remapping scheme for address bus protection |
| |
Lan Gao,
Jun Yang,
Marek Chrobak,
Youtao Zhang,
San Nguyen,
Hsien-Hsin S. Lee
|
|
Pages: 74-83 |
|
doi>10.1145/1152154.1152169 |
|
Full text: PDF
|
|
The address sequence on the processor-memory bus can reveal abundant information about the control flow of a program. This can lead to critical information leakage such as encryption keys or proprietary algorithms. Addresses can be observed by attaching ...
The address sequence on the processor-memory bus can reveal abundant information about the control flow of a program. This can lead to critical information leakage such as encryption keys or proprietary algorithms. Addresses can be observed by attaching a hardware device on the bus that passively monitors the bus transaction. Such side-channel attacks should be given rising attention especially in a distributed computing environment, where remote servers running sensitive programs are not within the physical control of the client.Two previously proposed hardware techniques tackled this problem through randomizing address patterns on the bus. One proposal permutes a set of contiguous memory blocks under certain conditions, while the other approach randomly swaps two blocks when necessary. In this paper, we present an anatomy of these attempts and show that they impose great pressure on both the memory and the disk. This leaves them less scalable in high-performance systems where the bandwidth of the bus and memory are critical resources. We propose a lightweight solution to alleviating the pressure without compromising the security strength. The results show that our technique can reduce the memory traffic by a factor of 10 compared with the prior scheme, while keeping almost the same page fault rate as a baseline system with no security protection. expand
|
|
|
Efficient data protection for distributed shared memory multiprocessors |
| |
Brian Rogers,
Milos Prvulovic,
Yan Solihin
|
|
Pages: 84-94 |
|
doi>10.1145/1152154.1152170 |
|
Full text: PDF
|
|
Data security in computer systems has recently become an increasing concern, and hardware-based attacks have emerged. As a result, researchers have investigated hardware encryption and authentication mechanisms as a means of addressing this security ...
Data security in computer systems has recently become an increasing concern, and hardware-based attacks have emerged. As a result, researchers have investigated hardware encryption and authentication mechanisms as a means of addressing this security concern. Unfortunately, no such techniques have been investigated for Distributed Shared Memory (DSM) multiprocessors, and previously proposed techniques for uni-processor and Symmetric Multiprocessor (SMP) systems cannot be directly used for DSMs. This work is the first to examine the issues involved in protecting secrecy and integrity of data in DSM systems. We first derive security requirements for processor-processor communication in DSMs, and find that different types of coherence messages need different protection. Then we propose and evaluate techniques to provide efficient encryption and authentication of the data in DSM systems. Our simulation results using SPLASH-2 benchmarks show that the execution time overhead for our three proposed approaches is small and ranges from 6% to 8% on a 16-processor DSM system, relative to a similar DSM without support for data secrecy and integrity. expand
|
|
|
SESSION: Characterizing program behavior |
|
|
|
|
Wavelet-based phase classification |
| |
Ted Huffmire,
Tim Sherwood
|
|
Pages: 95-104 |
|
doi>10.1145/1152154.1152172 |
|
Full text: PDF
|
|
Phase analysis has proven to be a useful method of summarizing the time-varying behavior of programs, with uses ranging from reducing simulation time to guiding run-time optimizations. Although phase classification techniques based on basic block vectors ...
Phase analysis has proven to be a useful method of summarizing the time-varying behavior of programs, with uses ranging from reducing simulation time to guiding run-time optimizations. Although phase classification techniques based on basic block vectors have shown impressive accuracies on SPEC benchmarks, commercial programs remain a significant challenge due to their complex behaviors and multiple threads. Some behaviors, such as L2 cache misses, may have less correlation with the code and therefore are much harder to capture with basic block frequency vectors.Comparing the similarity of two or more intervals requires a good metric, one that is not only fast enough to analyze the full execution of the program, but that is also highly correlated with important performance degrading events (such as L2 misses). We examine the use of many different interval similarity metrics and their uses for program phase analysis across a range of commercial applications and show that there is still significant room for improvement. To address this problem, we introduce a novel wavelet-based phase classification scheme that captures and compares images of memory behavior in two or more dimensions. Over a set of five commercial applications, we show that a wavelet-based scheme can strictly outperform a broad range of prior metrics both in terms of accuracy and overhead. expand
|
|
|
Complexity-based program phase analysis and classification |
| |
Chang-Burm Cho,
Tao Li
|
|
Pages: 105-113 |
|
doi>10.1145/1152154.1152173 |
|
Full text: PDF
|
|
Modeling and analysis of program behavior are at the foundation of computer system design and optimization. As computer systems become more adaptive, their efficiency increasingly depends on program dynamic characteristics. Previous studies have revealed ...
Modeling and analysis of program behavior are at the foundation of computer system design and optimization. As computer systems become more adaptive, their efficiency increasingly depends on program dynamic characteristics. Previous studies have revealed that program runtime execution manifests phase behavior. Recently, methods and tools to analyze and classify program phases have also been developed. However, very few studies have been proposed so far to understand and evaluate program phases from their dynamics and complexity perspectives. In this work, we propose new methods, metrics and frameworks which aim to analyze, quantify, and classify the dynamics and complexity of program phases. Our methods use wavelet techniques to represent program phases at multiresolution scales. The cross-correlation coefficients between phase dynamics observed at different scales are then computed as metrics to quantify phase complexity. We propose to apply wavelet-based multiresolution analysis and data clustering to classify program execution into phases that exhibit similar degree of complexity. Experimental results on SPEC CPU 2000 benchmarks show that the proposed schemes classify complexity-based program phases better than currently used approaches. expand
|
|
|
Performance prediction based on inherent program similarity |
| |
Kenneth Hoste,
Aashish Phansalkar,
Lieven Eeckhout,
Andy Georges,
Lizy K. John,
Koen De Bosschere
|
|
Pages: 114-122 |
|
doi>10.1145/1152154.1152174 |
|
Full text: PDF
|
|
A key challenge in benchmarking is to predict the performance of an application of interest on a number of platforms in order to determine which platform yields the best performance. This paper proposes an approach for doing this. We measure a number ...
A key challenge in benchmarking is to predict the performance of an application of interest on a number of platforms in order to determine which platform yields the best performance. This paper proposes an approach for doing this. We measure a number of microarchitecture-independent characteristics from the application of interest, and relate these characteristics to the characteristics of the programs from a previously profiled benchmark suite. Based on the similarity of the application of interest with programs in the benchmark suite, we make a performance prediction of the application of interest. We propose and evaluate three approaches (normalization, principal components analysis and genetic algorithm) to transform the raw data set of microarchitecture-independent characteristics into a benchmark space in which the relative distance is a measure for the relative performance differences. We evaluate our approach using all of the SPEC CPU2000 benchmarks and real hardware performance numbers from the SPEC website. Our framework estimates per-benchmark machine ranks with a 0.89 average and a 0.80 worst case rank correlation coefficient. expand
|
|
|
Deep computing in biology: challenges and progress |
| |
Ajay Royyuru
|
|
Pages: 123-123 |
|
doi>10.1145/1152154.1152157 |
|
Full text: PDF
|
|
The Computational Biology Center at IBM Research pursues basic and exploratory research at the interface of information technology and biology. Information technology plays a vital role in enabling new science and discovery in biology. Advances in high ...
The Computational Biology Center at IBM Research pursues basic and exploratory research at the interface of information technology and biology. Information technology plays a vital role in enabling new science and discovery in biology. Advances in high throughput and platform technologies in biology present an unprecedented challenge in scale, management, and analysis of biological data. Advances in computing architecture and scale are enabling simulations of complex biological processes at various organizational levels from atomic to cellular and beyond. High performance computing that takes full advantage of massive parallelism is a necessary means to obtain the performance needed to tackle this complexity. This talk will provide an overview of our current research in computational biology and highlight recent advances in large scale simulations of biological systems. expand
|
|
|
SESSION: Multi-core design II |
|
|
|
|
Hardware support for spin management in overcommitted virtual machines |
| |
Philip M. Wells,
Koushik Chakraborty,
Gurindar S. Sohi
|
|
Pages: 124-133 |
|
doi>10.1145/1152154.1152176 |
|
Full text: PDF
|
|
Multiprocessor operating systems (OSs) pose several unique and conflicting challenges to System Virtual Machines (System VMs). For example, most existing system VMs resort to gang scheduling a guest OS's virtual processors (VCPUs) to avoid OS synchronization ...
Multiprocessor operating systems (OSs) pose several unique and conflicting challenges to System Virtual Machines (System VMs). For example, most existing system VMs resort to gang scheduling a guest OS's virtual processors (VCPUs) to avoid OS synchronization overhead. However, gang scheduling is infeasible for some application domains, and is inflexible in other domains.In an overcommitted environment, an individual guest OS has more VCPUs than available physical processors (PCPUs), precluding the use of gang scheduling. In such an environment, we demonstrate a more than two-fold increase in runtime when transparently virtualizing a chip-multiprocessor's cores. To combat this problem, we propose a hardware technique to detect several cases when a VCPU is not performing useful work, and suggest preempting that VCPU to run a different, more productive VCPU. Our technique can dramatically reduce cycles wasted on OS synchronization, without requiring any semantic information from the software.We then present a case study, typical of server consolidation, to demonstrate the potential of more flexible scheduling policies enabled by our technique. We propose one such policy that logically partitions the CMP cores between guest VMs. This policy increases throughput by 10-25% for consolidated server workloads due to improved cache locality and core utilization, and substantially improves performance isolation in private caches. expand
|
|
|
Testing implementations of transactional memory |
| |
Chaiyasit Manovit,
Sudheendra Hangal,
Hassan Chafi,
Austen McDonald,
Christos Kozyrakis,
Kunle Olukotun
|
|
Pages: 134-143 |
|
doi>10.1145/1152154.1152177 |
|
Full text: PDF
|
|
Transactional memory is an attractive design concept for scalable multiprocessors because it offers efficient lock-free synchronization and greatly simplifies parallel software. Given the subtle issues involved with concurrency and atomicity, ...
Transactional memory is an attractive design concept for scalable multiprocessors because it offers efficient lock-free synchronization and greatly simplifies parallel software. Given the subtle issues involved with concurrency and atomicity, however, it is important that transactional memory systems be carefully designed and aggressively tested to ensure their correctness. In this paper, we propose an axiomatic framework to model the formal specification of a realistic transactional memory system which may contain a mix of transactional and non-transactional operations. Using this framework and extensions to analysis algorithms originally developed for checking traditional memory consistency, we show that the widely practiced pseudo-random testing methodology can be effectively applied to transactional memory systems. Our testing methodology was successful in finding previously unknown bugs in the implementation of TCC, a transactional memory system. We study two flavors of the underlying analysis algorithm, one incomplete and the other complete, and show that the complete algorithm while being theoretically intractable is very efficient in practice. expand
|
|
|
Efficient emulation of hardware prefetchers via event-driven helper threading |
| |
Ilya Ganusov,
Martin Burtscher
|
|
Pages: 144-153 |
|
doi>10.1145/1152154.1152178 |
|
Full text: PDF
|
|
The advance of multi-core architectures provides significant benefits for parallel and throughput-oriented computing, but the performance of individual computation threads does not improve and may even suffer a penalty because of the increased contention ...
The advance of multi-core architectures provides significant benefits for parallel and throughput-oriented computing, but the performance of individual computation threads does not improve and may even suffer a penalty because of the increased contention for shared resources. This paper explores the idea of using available general-purpose cores in a CMP as helper engines for individual threads running on the active cores. We propose a lightweight architectural framework for efficient event-driven software emulation of complex hardware accelerators and describe how this framework can be applied to implement a variety of prefetching techniques. We demonstrate the viability and effectiveness of our framework on a wide range of applications from the SPEC CPU2000 and Olden benchmark suites. On average, our mechanism provides performance benefits within 5% of pure hardware implementations. Furthermore, we demonstrate that running event-driven prefetching threads on top of a baseline with a hardware stride prefetcher yields significant speedups for many programs. Finally, we show that our approach provides competitive performance improvements over other hardware approaches for multi-core execution while executing fewer instructions and requiring considerably less hardware support. expand
|
|
|
SESSION: Performance profiling and tuning |
|
|
|
|
DEP: detailed execution profile |
| |
Qin Zhao,
Joon Edward Sim,
Weng-Fai Wong,
Larry Rudolph
|
|
Pages: 154-163 |
|
doi>10.1145/1152154.1152180 |
|
Full text: PDF
|
|
In many areas of computer architecture design and program development, the knowledge of dynamic program behavior can be very handy. Several challenges beset the accurate and complete collection of dynamic control flow and memory reference information. ...
In many areas of computer architecture design and program development, the knowledge of dynamic program behavior can be very handy. Several challenges beset the accurate and complete collection of dynamic control flow and memory reference information. These include scalability issues, runtime-overhead, and code coverage. For example, while Tallam and Gupta's work on extending WPP (Whole Program Paths) showed good compressibility, their profile requires 500MBytes of intermediate memory space and an average of 23 times slowdown to be collected.To address these challenges, this paper presents DEP (Detailed Execution Profile). DEP captures the complete dynamic control flow, data dependency and memory reference of a whole program's execution. The profile size is significantly reduced due to the insight that most information can be recovered from a tightly coupled record of control flow and register value changes. DEP is collected in an infrastructure called Adept (A dynamic execution profiling tool), which uses the DynamoRIO binary instrumentation framework to insert profile-collecting instructions within the running application. DEP profiles user-level code execution in its entirety, including interprocedural paths and the execution of multiple threads.The framework for collecting DEP has been tested on real, large and commercial applications. Our experiments show that DEP of Linux SPECInt 2000 benchmarks and Windows SysMark benchmarks can be collected with an average of 5 times slowdown while maintaining competitive compressibility. DEP's profile sizes are about 60% that of traditional profiles. expand
|
|
|
Whole-program optimization of global variable layout |
| |
Nathaniel McIntosh,
Sandya Mannarswamy,
Robert Hundt
|
|
Pages: 164-172 |
|
doi>10.1145/1152154.1152181 |
|
Full text: PDF
|
|
On machines with high-performance processors, the memory system continues to be a performance bottleneck. Compilers insert prefetch operations and reorder data accesses to improve locality, but increasingly seek to modify an application's data layout ...
On machines with high-performance processors, the memory system continues to be a performance bottleneck. Compilers insert prefetch operations and reorder data accesses to improve locality, but increasingly seek to modify an application's data layout to reduce cache miss and page fault penalties. In this paper we discuss Global Variable Layout (GVL), an optimization of the placement of entire static global data objects in the binary. We describe two practical methods for GVL in the HP-UX Integrity optimizing compiler for the Itanium © architecture. The first layout strategy relies on profile feedback, collaboratively employing the compiler, the linker and a pre-link tool to facilitate reordering. The second strategy uses whole-program analysis to drive data layout decisions, and does not require the use of a dynamic profile. We give a detailed description of our implementation and evaluate its performance for the SPEC integer benchmark programs, as well as for a large commercial database application. expand
|
|
|
Fast, automatic, procedure-level performance tuning |
| |
Zhelong Pan,
Rudolf Eigenmann
|
|
Pages: 173-181 |
|
doi>10.1145/1152154.1152182 |
|
Full text: PDF
|
|
This paper presents an automated performance tuning solution, which partitions a program into a number of tuning sections and finds the best combination of compiler options for each section. Our solution builds on prior work on feedback-driven ...
This paper presents an automated performance tuning solution, which partitions a program into a number of tuning sections and finds the best combination of compiler options for each section. Our solution builds on prior work on feedback-driven optimization, which tuned the whole program, instead of each section. Our key novel algorithm partitions a program into appropriate tuning sections. We also present the architecture of a system that automates the tuning process; it includes several pre-tuning steps that partition and instrument the program, followed by the actual tuning and the post-tuning assembly of the individually-optimized parts. Our system, called PEAK, achieves fast tuning speed by measuring a small number of invocations of each code section, instead of the whole-program execution time, as in common solutions. Compared to these solutions PEAK reduces tuning time from 2.19 hours to 5.85 minutes on average, while achieving similar program performance. PEAK improves the performance of SPEC CPU2000 FP benchmarks by 12% on average over GCC O3, the highest optimization level, on a Pentium IV machine. expand
|
|
|
SESSION: Instruction fetch and control flow |
|
|
|
|
Reducing control overhead in dataflow architectures |
| |
Andrew Petersen,
Andrew Putnam,
Martha Mercaldi,
Andrew Schwerin,
Susan Eggers,
Steve Swanson,
Mark Oskin
|
|
Pages: 182-191 |
|
doi>10.1145/1152154.1152184 |
|
Full text: PDF
|
|
In recent years, computer architects have proposed tiled architectures in response to several emerging problems in processor design, such as design complexity, wire delay, and fabrication reliability. One of these architectures, WaveScalar, uses a dynamic, ...
In recent years, computer architects have proposed tiled architectures in response to several emerging problems in processor design, such as design complexity, wire delay, and fabrication reliability. One of these architectures, WaveScalar, uses a dynamic, tagged-token dataflow execution model to simplify the design of the processor tiles and their interconnection network and to achieve good parallel performance. However, using a dataflow execution model reawakens old problems, including the instruction overhead required for control flow. Previous work compiling the functional language Id to the Monsoon Dataflow System found this overhead to be 2–3× that of programs written in C and targeted to a MIPS R3000.In this paper, we present and analyze three compiler optimizations that significantly reduce control overhead with minimal additional hardware. We begin by describing how to translate imperative code into dataflow assembly and analyze the resulting control overhead. We report a similar 2–4× instruction overhead, which suggests that the execution model, rather than a specific source language or target architecture, is responsible. Then, we present the compiler optimizations, each of which is designed to eliminate a particular type of control overhead, and analyze the extent to which they were able to do so. Finally, we evaluate the effect using all optimizations together has on program performance. Together, the optimizations reduce control overhead by 80% on average, increasing application performance between 21–37%. expand
|
|
|
Power-efficient instruction delivery through trace reuse |
| |
Chengmo Yang,
Alex Orailoglu
|
|
Pages: 192-201 |
|
doi>10.1145/1152154.1152185 |
|
Full text: PDF
|
|
As power dissipation inexorably becomes the major bottleneck in system integration and reliability, the front-end instruction delivery path in a traditional out-of-order superscalar processor needs to deliver high application performance in an energy-effective ...
As power dissipation inexorably becomes the major bottleneck in system integration and reliability, the front-end instruction delivery path in a traditional out-of-order superscalar processor needs to deliver high application performance in an energy-effective manner. This challenge can be addressed by efficiently reusing the work of fetch and decode performed during preceding loop iterations and resident mostly within the processor itself. As a large percentage of the instructions currently under fetch have previously dispatched copies resident in the Reorder Buffer (ROB), in this paper we develop a mechanism to utilize the ROB as a storage location for previously decoded instructions. Thus instructions can be fed directly from the ROB into the rename and issue stages, enabling the gating off of the fetch and decode logic for large periods of time so as to deliver significant power savings. Power and performance criticality of the ROB requires an efficient reuse identification mechanism; we outline such a cost-efficient Reuse Identification Unit (RIU) which enables effective identification of the matches between the ROB entries and the instructions currently under fetch. Simulation results on both multimedia and SPEC 2000 benchmarks confirm that incorporating the proposed technique on traditional out-of-order superscalar processors results in not only a sight improvement in performance, but also significant savings in the overall system power dissipation, achieved within a limited hardware budget. expand
|
|
|
Branch predictor guided instruction decoding |
| |
Oliverio J. Santana,
Ayose Falcón,
Alex Ramirez,
Mateo Valero
|
|
Pages: 202-211 |
|
doi>10.1145/1152154.1152186 |
|
Full text: PDF
|
|
Fast instruction decoding is a challenge for the design of CISC microprocessors. A well-known solution to overcome this problem is using a trace cache. It stores and fetches already decoded instructions, avoiding the need for decoding them again. However, ...
Fast instruction decoding is a challenge for the design of CISC microprocessors. A well-known solution to overcome this problem is using a trace cache. It stores and fetches already decoded instructions, avoiding the need for decoding them again. However, implementing a trace cache involves an important increase in the fetch architecture complexity.In this paper, we propose a novel decoding architecture that reduces the fetch engine implementation cost. Instead of using a special-purpose buffer like the trace cache, our proposal stores frequently decoded instructions in the memory hierarchy. The address where the decoded instructions are stored is kept in the branch prediction mechanism, enabling it to guide our decoding architecture. This makes it possible for the processor front-end to fetch already decoded instructions from memory instead of the original nondecoded instructions. Our results show that an 8-wide superscalar processor achieves an average 14% performance improvement by using our decoding architecture. This improvement is comparable to the one achieved by using the more complex trace cache, while requiring 16% less chip area and 21% less energy consumption in the fetch architecture. expand
|
|
|
SESSION: Application-specific optimizations |
|
|
|
|
Two-level mapping based cache index selection for packet forwarding engines |
| |
Kaushik Rajan,
R. Govindarajan
|
|
Pages: 212-221 |
|
doi>10.1145/1152154.1152188 |
|
Full text: PDF
|
|
Packet forwarding is a memory-intensive application requiring multiple accesses through a trie structure. The efficiency of a cache for this application critically depends on the placement function to reduce conflict misses. Traditional placement functions ...
Packet forwarding is a memory-intensive application requiring multiple accesses through a trie structure. The efficiency of a cache for this application critically depends on the placement function to reduce conflict misses. Traditional placement functions use a one-level mapping that naively partitions trie-nodes into cache sets. However, as a significant percentage of trie nodes are not useful, these schemes suffer from a non-uniform distribution of useful nodes to sets. This in turn results in increased conflict misses. Newer organizations such as variable associativity caches achieve flexibility in placement at the expense of increased hit-latency. This makes them unsuitable for L1 caches.We propose a novel two-level mapping framework that retains the hit-latency of one-level mapping yet incurs fewer conflict misses. This is achieved by introducing a secondlevel mapping which reorganizes the nodes in the naive initial partitions into refined partitions with near-uniform distribution of nodes. Further as this remapping is accomplished by simply adapting the index bits to a given routing table the hit-latency is not affected. We propose three new schemes which result in up to 16% reduction in the number of misses and 13% speedup in memory access time. In comparison, an XOR-based placement scheme known to perform extremely well for general purpose architectures, can obtain up to 2% speedup in memory access time. expand
|
|
|
Program generation for the all-pairs shortest path problem |
| |
Sung-Chul Han,
Franz Franchetti,
Markus Püschel
|
|
Pages: 222-232 |
|
doi>10.1145/1152154.1152189 |
|
Full text: PDF
|
|
A recent trend in computing are domain-specific program generators, designed to alleviate the effort of porting and reoptimizing libraries for fast-changing and increasingly complex computing platforms. Examples include ATLAS, SPIRAL, and the codelet ...
A recent trend in computing are domain-specific program generators, designed to alleviate the effort of porting and reoptimizing libraries for fast-changing and increasingly complex computing platforms. Examples include ATLAS, SPIRAL, and the codelet generator in FFTW. Each of these generators produces highly optimized source code directly from a problem specification. In this paper, we extend this list by a program generator for the well-known Floyd-Warshall (FW) algorithm that solves the all-pairs shortest path problem, which is important in a wide range of engineering applications.As the first contribution, we derive variants of the FW algorithm that make it possible to apply many of the optimization techniques developed for matrix-matrix multiplication. The second contribution is the actual program generator, which uses tiling, loop unrolling, and SIMD vectorization combined with a hill climbing search to produce the best code (float or integer) for a given platform.Using the program generator, we demonstrate a speedup over a straightforward single-precision implementation of up to a factor of 1.3 on Pentium 4 and 1.8 on Athlon 64. Use of 4-way vectorization further improves the performance by another factor of up to 5.7 on Pentium 4 and 3.0 on Athlon 64. For data type short integers, 8-way vectorization provides a speed-up of up to 4.6 on Pentium 4 and 5.0 on Athlon 64 over the best scalar code. expand
|
|
|
Combining analytical and empirical approaches in tuning matrix transposition |
| |
Qingda Lu,
Sriram Krishnamoorthy,
P. Sadayappan
|
|
Pages: 233-242 |
|
doi>10.1145/1152154.1152190 |
|
Full text: PDF
|
|
Matrix transposition is an important kernel used in many applications. Even though its optimization has been the subject of many studies, an optimization procedure that targets the characteristics of current processor architectures has not been developed. ...
Matrix transposition is an important kernel used in many applications. Even though its optimization has been the subject of many studies, an optimization procedure that targets the characteristics of current processor architectures has not been developed. In this paper, we develop an integrated optimization framework that addresses a number of issues, including tiling for the memory hierarchy, effective handling of memory misalignment, utilizing memory subsystem characteristics, and the exploitation of the parallelism provided by the vector instruction sets in current processors. A judicious combination of analytical and empirical approaches is used to determine the most appropriate optimizations. The absence of problem information until execution time is handled by generating multiple versions of the code - the best version is chosen at runtime, with assistance from minimal-overhead inspectors. The approach highlights aspects of empirical optimization that are important for similar computations with little temporal reuse. Experimental results on PowerPC G5 and Intel Pentium 4 demonstrate the effectiveness of the developed framework. expand
|
|
|
Processor architecture: too much parallelism? |
| |
David B. Kirk
|
|
Pages: 243-243 |
|
doi>10.1145/1152154.1152158 |
|
Full text: PDF
|
|
CPUs and GPUs have evolved considerably in the past few years, and the pace of change and evolution in processor architecture is likely to increase. Constraints of excess heat dissipation and power consumption have forced a radical rethinking of microprocessor ...
CPUs and GPUs have evolved considerably in the past few years, and the pace of change and evolution in processor architecture is likely to increase. Constraints of excess heat dissipation and power consumption have forced a radical rethinking of microprocessor architecture, from the headlong pursuit of GHz clock rates to multicore and multithreaded approaches. The demands of graphics vertex and pixel processing as well as more general non-graphics applications have driven GPUs to be powerful data-parallel floating point processing engines. Research and development in programming languages and environments has not kept pace with the changes in processors. Consequently, computer science and computer engineering research and education is not addressing important problems, or preparing students well for today's computer industr.This talk will provide some historical and architectural perspective on data-parallel GPU architectures, and will attempt to make some trend predictions for the future of GPUs. We will then provide some examples of successes and failures in mapping parallel algorithms to these architectures. Finally, we will conclude with some calls to action in research and education, to improve the utilization of these ubiquitous and powerful parallel machines. expand
|
|
|
SESSION: Out-of-order microarchitecture |
|
|
|
|
Adaptive reorder buffers for SMT processors |
| |
Joseph Sharkey,
Deniz Balkan,
Dmitry Ponomarev
|
|
Pages: 244-253 |
|
doi>10.1145/1152154.1152192 |
|
Full text: PDF
|
|
In SMT processors, the complex interplay between private and shared datapath resources needs to be considered in order to realize the full performance potential. In this paper, we show that blindly increasing the size of the per-thread reorder buffers ...
In SMT processors, the complex interplay between private and shared datapath resources needs to be considered in order to realize the full performance potential. In this paper, we show that blindly increasing the size of the per-thread reorder buffers to provide a larger number of in-flight instructions does not result in the expected performance gains but, quite in contrast, degrades the instruction throughput for virtually all multithreaded workloads. The reason for this performance loss is the excessive pressure on the shared datapath resources, especially the instruction scheduling logic. We propose intelligent mechanisms for dynamically adapting the number of reorder buffer entries allocated to each thread in an effort to avoid such allocations if they detrimentally impact the scheduler. We achieve this goal through categorizing the program execution into issue-bound and commit-bound phases and only performing the buffer allocations to the threads operating in commit-bound phases. Our adaptive technique achieves improvements of 21% in instruction throughput and 10% in the fairness metric compared to the best performing baseline configuration with static ROBs. expand
|
|
|
SEED: scalable, efficient enforcement of dependences |
| |
Francisco J. Mesa-Martínez,
Michael C. Huang,
Jose Renau
|
|
Pages: 254-264 |
|
doi>10.1145/1152154.1152193 |
|
Full text: PDF
|
|
Instruction issue logic is a critical component in modern high-performance out-of-order processors. The ever increasing latencies found in modern processors, mostly associated with memory accesses and longer pipelines, can be attenuated using large issue ...
Instruction issue logic is a critical component in modern high-performance out-of-order processors. The ever increasing latencies found in modern processors, mostly associated with memory accesses and longer pipelines, can be attenuated using large issue queues. Conventional designs rely on atomic wakeup-select cycles to ensure compact scheduling. These designs must aggressively utilize broadcasting, compaction, and heavily-ported structures that scale poorly in terms of both power consumption and access tim.To provide high scheduling flexibility and large instruction capacity without incurring prohibitive latency and energy overhead, we propose a novel scheme that uses an out-of-order, broadcast-free instruction wakeup block feeding an in-order scheduler. Multi-banked, index-based structures are used throughout this scheme to provide a high degree of scalability while achieving efficient dependence tracking, resulting in good overall performance and energy efficiency. We call this design "Scalable, Efficient Enforcement of Dependences (SEED)". We present a detailed design and analysis of SEED through an extensive evaluation. Compared to a conventional issue queue design, which is assumed favorably to scale in size without any impact on cycle time, the performance degradation of our design is 3% for both INT and FP suites of SPEC CPU2000. For such a small performance cost, SEED enjoys a 19% reduction in total chip power consumption for a 32-entry configuration. We also synthesize SEED and a conventional issue logic with 90nm standard cell logic. Synthesis results show that SEED can cycle twice the speed of a conventional issue logic of equivalent size. Cycling at the same frequency, SEED consumes ten times less dynamic power and five times less static power while achieving substantial area savings. expand
|
|
|
SPARTAN: speculative avoidance of register allocations to transient values for performance and energy efficiency |
| |
Deniz Balkan,
Joseph Sharkey,
Dmitry Ponomarev,
Kanad Ghose
|
|
Pages: 265-274 |
|
doi>10.1145/1152154.1152194 |
|
Full text: PDF
|
|
High-performance microprocessors use large, heavily-ported physical register files (RFs) to increase the instruction throughput. The high complexity and power dissipation of such RFs mainly stem from the need to maintain each and every result for a large ...
High-performance microprocessors use large, heavily-ported physical register files (RFs) to increase the instruction throughput. The high complexity and power dissipation of such RFs mainly stem from the need to maintain each and every result for a large number of cycles after the result generation. We observed that a significant fraction (about 45%) of the result values are never read from the register file and are not required to recover from branch mispredictions. In this paper, we propose SPARTAN - a set of micro-architectural extensions that predicts such transient values and in many cases completely avoids physical register allocations to them. We show that the transient values can be predicted as such with more than 97% accuracy on the average across simulated SPEC 2000 benchmarks. We evaluate the performance of SPARTAN on a variety of configurations and show that significant improvements in performance and energy-efficiency can be realized. Furthermore, we directly compare SPARTAN against a number of previously proposed schemes for register optimizations and show that our technique significantly outperforms all those schemes. expand
|
|
|
SESSION: Dependences and register allocation |
|
|
|
|
Overlapping dependent loads with addressless preload |
| |
Zhen Yang,
Xudong Shi,
Feiqi Su,
Jih-Kwon Peir
|
|
Pages: 275-284 |
|
doi>10.1145/1152154.1152196 |
|
Full text: PDF
|
|
Modern out-of-order processors with non-blocking caches exploit Memory-Level Parallelism (MLP) by overlapping cache misses in a wide instruction window. The exploitation of MLP, however, can be limited due to long-latency operations in producing the ...
Modern out-of-order processors with non-blocking caches exploit Memory-Level Parallelism (MLP) by overlapping cache misses in a wide instruction window. The exploitation of MLP, however, can be limited due to long-latency operations in producing the base address of a cache miss load. When the parent instruction is also a cache miss load, a serialization of the two loads must be enforced to satisfy the load-load data dependence.In this paper, we propose a mechanism that dynamically captures the load-load data dependences at runtime. A special Preload is issued in place of the dependent load without waiting for the parent load, thus effectively overlapping the two loads. The Preload provides necessary information for the memory controller to calculate the correct memory address upon the availability of the parent's data to eliminate any interconnect delay between the two loads. Performance evaluations based on SPEC2000 and Olden applications show that significant speedups up to 40% with an average of 16% are achievable using the Preload. In conjunction with other aggressive MLP exploitation methods, such as runahead execution, the Preload can make more significant improvement with an average of 22%. expand
|
|
|
Prematerialization: reducing register pressure for free |
| |
Ivan D. Baev,
Richard E. Hank,
David H. Gross
|
|
Pages: 285-294 |
|
doi>10.1145/1152154.1152197 |
|
Full text: PDF
|
|
Modern compiler transformations that eliminate redundant computations or reorder instructions, such as partial redundancy elimination and instruction scheduling, are very effective in improving application performance but tend to create longer and potentially ...
Modern compiler transformations that eliminate redundant computations or reorder instructions, such as partial redundancy elimination and instruction scheduling, are very effective in improving application performance but tend to create longer and potentially more complex live ranges. Typically the task of dealing with the increased register pressure is left to the register allocator. To avoid introduction of spill code which can reduce or completely eliminate the benefit of earlier optimizations, researchers have developed techniques such as live range splitting and rematerializatio.This paper describes prematerialization (PM), a novel method for reducing register pressure for VLIW architectures with nop instructions. PM and rematerialization both select "never killed" live ranges and break them up by introducing one or more definitions close to the uses. However, while rematerialization is applied to live ranges selected for spilling during register allocation, PM relies on the availability of nop instructions and occurs prior to register allocation. PM simplifies register allocation by creating live ranges that are easier to color and less likely to spill. We have implemented prematerialization in HP-UX production compilers for the Intel® Itanium® architecture. Performance evaluation indicates that the proposed technique is effective in reducing register pressure inherent in highly optimized code. expand
|
|
|
An empirical evaluation of chains of recurrences for array dependence testing |
| |
J. Birch,
R.A. van Engelen,
K.A. Gallivan,
Y. Shou
|
|
Pages: 295-304 |
|
doi>10.1145/1152154.1152198 |
|
Full text: PDF
|
|
Code restructuring compilers rely heavily on program analysis techniques to automatically detect data dependences between program statements. Dependences between statement instances in the iteration space of a loop nest impose ordering constraints that ...
Code restructuring compilers rely heavily on program analysis techniques to automatically detect data dependences between program statements. Dependences between statement instances in the iteration space of a loop nest impose ordering constraints that must be preserved in order to produce valid optimized, vectorized, and parallelized loop nests. This paper evaluates a new approach for fast and accurate nonlinear array dependence testing using Chains of Recurrences (CRs). A flow-sensitive loop analysis algorithm is presented for constructing the CR forms of array index expressions. Unlike other approaches, the CR forms are directly integrated into a standard dependence test to solve nonlinear CR-based dependence equations. To study the coverage and performance of the proposed CR-based enhancements of a standard test, we chose the inexact Banerjee test. We implemented a new CR-based Banerjee test in the Polaris compiler and compared the results to the Omega test and Range test on a set of SPEC and LAPACK Benchmark programs. The experimental results suggest that a CR enhancement can dramatically increase the effectiveness of a dependence test without a significant cost increase. More surprisingly, the findings indicate that the enhanced test exceeds the capabilities of the Omega and Range tests for many nonlinear dependence relations detected in the PERFECT Club and LAPACK Benchmark programs. expand
|