ABSTRACTCommercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for system designs since they often exhibit inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates. A number of recent studies have characterized the behavior of commercial workloads and proposed architectural features to improve their performance. However, there has been little research on the impact of software and compiler-level optimizations for improving the behavior of such workloads.
This paper provides a detailed study of profile-driven compiler optimizations to improve the code layout in commercial workloads with large instruction footprints. Our compiler algorithms are implemented in the context of Spike, an executable optimizer for the Alpha architecture. Our experiments use the Oracle commercial database engine running an OLTP workload, with results generated using both full system simulations and actual runs on Alpha multiprocessors. Our results show that code layout optimizations can provide a major improvement in the instruction cache behavior, providing a 55% to 65% reduction in the application misses for 64-128K caches. Our analysis shows that this improvement primarily arises from longer sequences of consecutively executed instructions and more reuse of cache lines before they are replaced. We also show that the majority of application instruction misses are caused by self-interference. However, code layout optimizations significantly reduce the amount of self-interference, thus elevating the relative importance of interference with operating system code. Finally, we show that better code layout can also provide substantial improvements in the behavior of other memory system components such as the instruction TLB and the unified second-level cache. The overall performance impact of our code layout optimizations is an improvement of 1.33 times in the execution time of our workload.
AUTHORS
|
|
||||||||||||||||||||||||||||||||||||||||
| View colleagues of Alex Ramirez | |||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||
| View colleagues of Luiz André Barroso | ||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| View colleagues of Kourosh Gharachorloo | |||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| View colleagues of Robert Cohn | |||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| View colleagues of Josep Larriba-Pey | |||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||
| View colleagues of P. Geoffrey Lowney | |||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||
| View colleagues of Mateo Valero | ||||||||||||||||||||||||||||||||||||||||
REFERENCESNote: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Jennifer M. Anderson , Lance M. Berc , Jeffrey Dean , Sanjay Ghemawat , Monika R. Henzinger , Shun-Tak A. Leung , Richard L. Sites , Mark T. Vandevoorde , Carl A. Waldspurger , William E. Weihl, Continuous profiling: where have all the cycles gone?, Proceedings of the sixteenth ACM symposium on Operating systems principles, p.1-14, October 05-08, 1997, Saint Malo, France [doi>10.1145/268998.266637]
|
| |
2
|
|
| |
3
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada [doi>10.1145/339647.339696]
|
|
4
|
L. A. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. Impact of Chip-Level Integration on Performance of OLTP Workloads. In Proceedings of the 6th International Symposium on High Performance Comp,tter Architecture, January 2000.
|
|
|
5
|
||
| |
6
|
|
|
7
|
Z. Cvetanovic and D. D. Donaldson. AlphaServer 4100 performance characterization. Digital Technical Journal, 8(4):3-20, 1996.
|
|
| |
8
|
|
|
9
|
||
|
10
|
D. J. Hartfield and J. Gerald. Program restructuring for virtual memory. IBM Systems Journal, 2:169-192, 1971.
|
|
|
11
|
||
| |
12
|
|
| |
13
|
|
|
14
|
||
| |
15
|
Kimberly Keeton , David A. Patterson , Yong Qiang He , Roger C. Raphael , Walter E. Baker, Performance characterization of a Quad Pentium Pro SMP using OLTP workloads, Proceedings of the 25th annual international symposium on Computer architecture, p.15-26, June 27-July 02, 1998, Barcelona, Spain [doi>10.1145/279358.279364]
|
| |
16
|
Jack L. Lo , Luiz André Barroso , Susan J. Eggers , Kourosh Gharachorloo , Henry M. Levy , Sujay S. Parekh, An analysis of database workload performance on simultaneous multithreaded processors, Proceedings of the 25th annual international symposium on Computer architecture, p.39-50, June 27-July 02, 1998, Barcelona, Spain [doi>10.1145/279358.279367]
|
| |
17
|
Ann Marie Grizzaffi Maynard , Colette M. Donnelly , Bret R. Olszewski, Contrasting characteristics and cache performance of technical and multi-user commercial workloads, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.145-156, October 05-07, 1994, San Jose, California, USA [doi>10.1145/195473.195524]
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
|
21
|
||
| |
22
|
|
| |
23
|
Parthasarathy Ranganathan , Kourosh Gharachorloo , Sarita V. Adve , Luiz André Barroso, Performance of database workloads on shared-memory systems with out-of-order processors, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.307-318, October 02-07, 1998, San Jose, California, USA [doi>10.1145/291069.291067]
|
| |
24
|
|
| |
25
|
M. Rosenblum , E. Bugnion , S. A. Herrod , E. Witchel , A. Gupta, The impact of architectural trends on operating system performance, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.285-298, December 03-06, 1995, Copper Mountain, Colorado, USA [doi>10.1145/224056.224078]
|
|
26
|
||
|
27
|
A. Srivastava and D. W. Wall. A practical system for intermodule code optimization at link-time. Journal of Programming Languages, 1(1):1-18, Dec. 1992.
|
|
|
28
|
||
|
29
|
||
|
30
|
Transaction Processing Performance Council. TPC Benchmark B (Online Transaction Processing) Standard Specification, 1990.
|
CITED BY25 Citations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
INDEX TERMSThe ACM Computing Classification System (CCS rev.2012)
PUBLICATION| · Proceeding | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Title | ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture table of contents | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Chairman | Per Stenström Chalmers Univ. of Technology | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Pages | 155-164 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Publication Date | 2001-06-01 (yyyy-mm-dd) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sponsors | SIGARCH ACM Special Interest Group on Computer Architecture | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| IEEE-CS\TCCA TC on Computer Arhitecture | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Publisher | ACM New York, NY, USA ©2001 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ISBN: 0-7695-1162-7 doi>10.1145/379240.379260 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Conference |
ISCAInternational Symposium on Computer Architecture
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Paper Acceptance Rate 24 of 163 submissions, 15% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Overall Acceptance Rate 533 of 2,983 submissions, 18% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| · Newsletter | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Title | ACM SIGARCH Computer Architecture News - Special Issue: Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01) Homepage table of contents archive | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Volume 29 Issue 2, May 2001 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Editor | Per Stenström Chalmers Univ. of Technology | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Pages | 155-164 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Publication Date | 2001-05-01 (yyyy-mm-dd) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sponsor | SIGARCH ACM Special Interest Group on Computer Architecture | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Publisher | ACM New York, NY, USA | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ISSN: 0163-5964 doi>10.1145/384285.379260 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
REVIEWS
COMMENTSBe the first to comment To Post a comment please sign in or create a free Web account
Table of Contents| General Chair's Message | |
| Page: .07 | |
Full text available:
Publisher Site
|
|
| Program Chair's Message | |
| Page: .08 | |
Full text available:
Publisher Site
|
|
| Conference Organization | |
| Page: .09 | |
Full text available:
Publisher Site
|
|
| Reviewers | |
| Page: .10 | |
Full text available:
Publisher Site
|
|
| Execution-based prediction using speculative slices | |
| Craig Zilles, Gurindar Sohi | |
| Pages: 2-13 | |
| doi>10.1145/379240.379246 | |
Full text: PDF
|
|
|
A relatively small set of static instructions has significant leverage on program execution performance. These problem instructions contribute a disproportionate number of cache misses and branch mispredictions because their behavior cannot be accurately ...
expand
|
|
| Opening Remarks | |
| Page: 4 | |
Full text available:
Publisher Site
|
|
| Speculative precomputation: long-range prefetching of delinquent loads | |
| Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, John P. Shen | |
| Pages: 14-25 | |
| doi>10.1145/379240.379248 | |
Full text: PDF
|
|
|
This paper explores Speculative Precomputation, a technique that uses idle thread context in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future ...
expand
|
|
| Dynamically allocating processor resources between nearby and distant ILP | |
| Rajeev Balasubramonian, Sandhya Dwarkadas, David H. Albonesi | |
| Pages: 26-37 | |
| doi>10.1145/379240.379249 | |
Full text: PDF
|
|
|
Modern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Because instructions must be committed in order so as to guarantee precise exceptions, increasing ...
expand
|
|
| Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors | |
| Chi-Keung Luk | |
| Pages: 40-51 | |
| doi>10.1145/379240.379250 | |
Full text: PDF
|
|
|
Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures ...
expand
|
|
| Data prefetching by dependence graph precomputation | |
| Murali Annavaram, Jignesh M. Patel, Edward S. Davidson | |
| Pages: 52-61 | |
| doi>10.1145/379240.379251 | |
Full text: PDF
|
|
|
Data cache misses reduce the performance of wide-issue processors by stalling the data supply to the processor. Prefetching data by predicting the miss address is one way to tolerate the cache miss latencies. But current applications with irregular ...
expand
|
|
| Concurrency, latency, or system overhead: which has the largest impact on uniprocessor DRAM-system performance? | |
| Vinodh Cuppu, Bruce Jacob | |
| Pages: 62-71 | |
| doi>10.1145/379240.379252 | |
Full text: PDF
|
|
|
Given a fixed CPU architecture and a fixed DRAM timing specification, there is still a large design space for a DRAM system organization. Parameters include the number of memory channels, the bandwidth of each channel, burst sizes, queue sizes and ...
expand
|
|
| Focusing processor policies via critical-path prediction | |
| Brian Fields, Shai Rubin, Rastislav Bodík | |
| Pages: 74-85 | |
| doi>10.1145/379240.379253 | |
Full text: PDF
|
|
|
Although some instructions hurt performance more than others, current processors typically apply scheduling and speculation as if each instruction was equally costly. Instruction cost can be naturally expressed through the critical path: if we ...
expand
|
|
| Automated design of finite state machine predictors for customized processors | |
| Timothy Sherwood, Brad Calder | |
| Pages: 86-97 | |
| doi>10.1145/379240.379254 | |
Full text: PDF
|
|
|
Customized processors use compiler analysis and design automation techniques to take a generalized architectural model and create a specific instance of it which is optimized to a given application or set of applications. These processors offer the ...
expand
|
|
| Better exploration of region-level value locality with integrated computation reuse and value prediction | |
| Youfeng Wu, Dong-Yuan Chen, Jesse Fang | |
| Pages: 98-108 | |
| doi>10.1145/379240.379255 | |
Full text: PDF
|
|
|
Computation-reuse and value-prediction are two recent techniques for improving microprocessor performance by exploiting value localities. They both aim at breaking the data dependence limit in traditional processors. In this paper, we propose a speculative ...
expand
|
|
| CryptoManiac: a fast flexible architecture for secure communication | |
| Lisa Wu, Chris Weaver, Todd Austin | |
| Pages: 110-119 | |
| doi>10.1145/379240.379256 | |
Full text: PDF
|
|
|
The growth of the Internet as a vehicle for secure communication and electronic commerce has brought cryptographic processing performance to the forefront of high throughput system design. This trend will be further underscored with the widespread ...
expand
|
|
| QoS provisioning in clusters: an investigation of Router and NIC design | |
| Ki Hwan Yum, Eun Jung Kim, Chita R. Das | |
| Pages: 120-129 | |
| doi>10.1145/379240.379257 | |
Full text: PDF
|
|
|
Design of high performance cluster networks (routers) with Quality-of-Service (QoS) guarantees is becoming increasingly important to support a variety of multimedia applications, many of which have real-time constraints. Most commercial routers, which ...
expand
|
|
| Locality vs. criticality | |
| Roy Dz-ching Ju, Alvin R. Lebeck, Chris Wilkerson / Srikanth T. Srinivasan | |
| Pages: 132-143 | |
| doi>10.1145/379240.379258 | |
Full text: PDF
|
|
|
Current memory hierarchies exploit locality of references to reduce load latency and thereby improve processor performance. Locality based schemes aim at reducing the number of cache misses and tend to ignore the nature of misses. This leads to a ...
expand
|
|
| Dead-block prediction & dead-block correlating prefetchers | |
| An-Chow Lai, Cem Fide, Babak Falsafi | |
| Pages: 144-154 | |
| doi>10.1145/379240.379259 | |
Full text: PDF
|
|
|
Effective data prefetching requires accurate mechanisms to predict both “which” cache blocks to prefetch and “when” to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that ...
expand
|
|
| Code layout optimizations for transaction processing workloads | |
| Alex Ramirez, Luiz André Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P. Geoffrey Lowney, Mateo Valero | |
| Pages: 155-164 | |
| doi>10.1145/379240.379260 | |
Full text: PDF
|
|
|
Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for ...
expand
|
|
| Exploring and exploiting wire-level pipelining in emerging technologies | |
| Michael Thaddeus Niemier, Peter M. Kogge | |
| Pages: 166-177 | |
| doi>10.1145/379240.379261 | |
Full text: PDF
|
|
|
Pipelining is a technique that has long since been considered fundamental by computer architects. However, the world of nanoelectronics is pushing the idea of pipelining to new and lower levels — particularly the device level. How this affects ...
expand
|
|
| NanoFabrics: spatial computing using molecular electronics | |
| Seth Copen Goldstein, Mihai Budiu | |
| Pages: 178-191 | |
| doi>10.1145/379240.379262 | |
Full text: PDF
|
|
|
The continuation of the remarkable exponential increases in processing power over the recent past faces imminent challenges due in part to the physics of deep-submicron CMOS devices and the costs of both chip masks and future fabrication plants. A promising ...
expand
|
|
| A simple method for extracting models for protocol code | |
| David Lie, Andy Chou, Dawson Engler, David L. Dill | |
| Pages: 192-203 | |
| doi>10.1145/379240.379263 | |
Full text: PDF
|
|
|
The use of model checking for validation requires that models of the underlying system be created. Creating such models is both difficult and error prone and as a result, verification is rarely used despite its advantages. In this paper, we present ...
expand
|
|
| Removing architectural bottlenecks to the scalability of speculative parallelization | |
| Milos Prvulovic, María Jesús Garzarán, Lawrence Rauchwerger, Josep Torrellas | |
| Pages: 204-215 | |
| doi>10.1145/379240.379264 | |
Full text: PDF
|
|
|
Speculative thread-level parallelization is a promising way to speed up codes that compilers fail to parallelize. While several speculative parallelization schemes have been proposed for different machine sizes and types of codes, the results so far ...
expand
|
|
| Power and energy reduction via pipeline balancing | |
| R. Iris Bahar, Srilatha Manne | |
| Pages: 218-229 | |
| doi>10.1145/379240.379265 | |
Full text: PDF
|
|
|
Minimizing power dissipation is an important design requirement for both portable and non-portable systems. In this work, we propose an architectural solution to the power problem that retains performance while reducing power. The technique, known ...
expand
|
|
| Energy-effective issue logic | |
| Daniele Folegnani, Antonio González | |
| Pages: 230-239 | |
| doi>10.1145/379240.379266 | |
Full text: PDF
|
|
|
The issue logic of a dynamically-scheduled superscalar processor is a complex mechanism devoted to start the execution of multiple instructions every cycle. Due to its complexity, it is responsible for a significant percentage of the energy consumed ...
expand
|
|
| Cache decay: exploiting generational behavior to reduce cache leakage power | |
| Stefanos Kaxiras, Zhigang Hu, Margaret Martonosi | |
| Pages: 240-251 | |
| doi>10.1145/379240.379268 | |
Full text: PDF
|
|
|
Power dissipation is increasingly important in CPUs ranging from those intended for mobile use, all the way up to high-performance processors for high-end servers. While the bulk of the power dissipated is dynamic switching power, leakage power is ...
expand
|
|
| Variability in the execution of multimedia applications and implications for architecture | |
| Christopher J. Hughes, Praful Kaul, Sarita V. Adve, Rohit Jain, Chanik Park, Jayanth Srinivasan | |
| Pages: 254-265 | |
| doi>10.1145/379240.379270 | |
Full text: PDF
|
|
|
Multimedia applications are an increasingly important workload for general-purpose processors. This paper analyzes frame-level execution time variability for several multimedia applications on general-purpose architectures. There are two reasons for ...
expand
|
|
| Measuring Experimental Error in Microprocessor Simulation | |
| Rajagopalan Desikan, Doug Burger, Stephen W. Keckler | |
| Pages: 266-277 | |
| doi>10.1145/379240.565338 | |
Full text: PDF
|
|
|
Abstract: We measure the experimental error that arises from the use of non-validated simulators in computer architecture research, with the goal of increasing the rigor of simulation- based studies. We describe the methodology that we used to validate ...
expand
|
|
| Rapid profiling via stratified sampling | |
| S. Subramanya Sastry, Rastislav Bodík, James E. Smith | |
| Pages: 278-289 | |
| doi>10.1145/379240.379273 | |
Full text: PDF
|
|
|
Sophisticated binary translators and dynamic optimizers demand a program profiler with low overhead, high accuracy, and the ability to collect a variety of profile types. A profiling scheme that achieves these goals is proposed. Conceptually, the ...
expand
|
|
| Author Index | |
| Page: 291 | |
Full text available:
Publisher Site
|
|