Contact The DL Team Contact Us | Switch to tabbed view

top of pageABSTRACT

Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for system designs since they often exhibit inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates. A number of recent studies have characterized the behavior of commercial workloads and proposed architectural features to improve their performance. However, there has been little research on the impact of software and compiler-level optimizations for improving the behavior of such workloads.

This paper provides a detailed study of profile-driven compiler optimizations to improve the code layout in commercial workloads with large instruction footprints. Our compiler algorithms are implemented in the context of Spike, an executable optimizer for the Alpha architecture. Our experiments use the Oracle commercial database engine running an OLTP workload, with results generated using both full system simulations and actual runs on Alpha multiprocessors. Our results show that code layout optimizations can provide a major improvement in the instruction cache behavior, providing a 55% to 65% reduction in the application misses for 64-128K caches. Our analysis shows that this improvement primarily arises from longer sequences of consecutively executed instructions and more reuse of cache lines before they are replaced. We also show that the majority of application instruction misses are caused by self-interference. However, code layout optimizations significantly reduce the amount of self-interference, thus elevating the relative importance of interference with operating system code. Finally, we show that better code layout can also provide substantial improvements in the behavior of other memory system components such as the instruction TLB and the unified second-level cache. The overall performance impact of our code layout optimizations is an improvement of 1.33 times in the execution time of our workload.

Advertisements



top of pageAUTHORS



Author image not provided  Alex Ramirez

No contact information provided yet.

Bibliometrics: publication history
Publication years1999-2014
Publication count86
Citation Count496
Available for download25
Downloads (6 Weeks)38
Downloads (12 Months)639
Downloads (cumulative)9,765
Average downloads per article390.60
Average citations per article5.77
View colleagues of Alex Ramirez


Luiz André Barroso Luiz André Barroso

Personal web page
labatbarroso.org
Bibliometrics: publication history
Publication years1991-2014
Publication count29
Citation Count2,195
Available for download17
Downloads (6 Weeks)467
Downloads (12 Months)7,947
Downloads (cumulative)120,930
Average downloads per article7,113.53
Average citations per article75.69
View colleagues of Luiz André Barroso


Author image not provided  Kourosh Gharachorloo

No contact information provided yet.

Bibliometrics: publication history
Publication years1988-2001
Publication count36
Citation Count2,449
Available for download26
Downloads (6 Weeks)65
Downloads (12 Months)805
Downloads (cumulative)21,947
Average downloads per article844.12
Average citations per article68.03
View colleagues of Kourosh Gharachorloo


Author image not provided  Robert Cohn

No contact information provided yet.

Bibliometrics: publication history
Publication years1988-2011
Publication count28
Citation Count1,401
Available for download20
Downloads (6 Weeks)78
Downloads (12 Months)837
Downloads (cumulative)12,397
Average downloads per article619.85
Average citations per article50.04
View colleagues of Robert Cohn


Author image not provided  Josep Larriba-Pey

No contact information provided yet.

Bibliometrics: publication history
Publication years1993-2016
Publication count68
Citation Count258
Available for download23
Downloads (6 Weeks)77
Downloads (12 Months)1,116
Downloads (cumulative)7,600
Average downloads per article330.43
Average citations per article3.79
View colleagues of Josep Larriba-Pey


Author image not provided  P. Geoffrey Lowney

No contact information provided yet.

Bibliometrics: publication history
Publication years1981-2002
Publication count9
Citation Count250
Available for download5
Downloads (6 Weeks)7
Downloads (12 Months)55
Downloads (cumulative)1,967
Average downloads per article393.40
Average citations per article27.78
View colleagues of P. Geoffrey Lowney


Mateo Valero Mateo Valero

No contact information provided yet.

Bibliometrics: publication history
Publication years1983-2016
Publication count308
Citation Count2,062
Available for download128
Downloads (6 Weeks)245
Downloads (12 Months)2,841
Downloads (cumulative)43,367
Average downloads per article338.80
Average citations per article6.69
View colleagues of Mateo Valero

top of pageREFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
L. A. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. Impact of Chip-Level Integration on Performance of OLTP Workloads. In Proceedings of the 6th International Symposium on High Performance Comp,tter Architecture, January 2000.
 
5
6
 
7
Z. Cvetanovic and D. D. Donaldson. AlphaServer 4100 performance characterization. Digital Technical Journal, 8(4):3-20, 1996.
8
 
9
 
10
D. J. Hartfield and J. Gerald. Program restructuring for virtual memory. IBM Systems Journal, 2:169-192, 1971.
 
11
12
13
 
14
15
16
17
18
19
20
 
21
22
23
24
25
 
26
 
27
A. Srivastava and D. W. Wall. A practical system for intermodule code optimization at link-time. Journal of Programming Languages, 1(1):1-18, Dec. 1992.
 
28
 
29
 
30
Transaction Processing Performance Council. TPC Benchmark B (Online Transaction Processing) Standard Specification, 1990.

top of pageCITED BY

25 Citations

 
 
 
 
 
 
 
 
 
 
 
 

top of pageINDEX TERMS

The ACM Computing Classification System (CCS rev.2012)

Note: Larger/Darker text within each node indicates a higher relevance of the materials to the taxonomic classification.

top of pagePUBLICATION

· Proceeding
Title ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture table of contents
Chairman Per Stenström Chalmers Univ. of Technology
Pages 155-164
Publication Date2001-06-01 (yyyy-mm-dd)
Sponsors SIGARCH ACM Special Interest Group on Computer Architecture
IEEE-CS\TCCA TC on Computer Arhitecture
PublisherACM New York, NY, USA ©2001
ISBN: 0-7695-1162-7 doi>10.1145/379240.379260
Conference ISCAInternational Symposium on Computer Architecture ISCA logo
Paper Acceptance Rate 24 of 163 submissions, 15%
Overall Acceptance Rate 533 of 2,983 submissions, 18%
Year Submitted Accepted Rate
ISCA '99 135 26 19%
ISCA '01 163 24 15%
ISCA '02 180 27 15%
ISCA '03 184 36 20%
ISCA '04 217 31 14%
ISCA '05 194 45 23%
ISCA '06 234 31 13%
ISCA '07 204 46 23%
ISCA '08 259 37 14%
ISCA '09 210 43 20%
ISCA '10 245 44 18%
ISCA '11 208 40 19%
ISCA '12 262 47 18%
ISCA '13 288 56 19%
Overall 2,983 533 18%
· Newsletter
Title ACM SIGARCH Computer Architecture News - Special Issue: Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01) Homepage table of contents archive
Volume 29 Issue 2, May 2001
Editor Per Stenström Chalmers Univ. of Technology
Pages 155-164
Publication Date2001-05-01 (yyyy-mm-dd)
Sponsor SIGARCH ACM Special Interest Group on Computer Architecture
PublisherACM New York, NY, USA
ISSN: 0163-5964 doi>10.1145/384285.379260

APPEARS IN
Hardware Design
Hardware Design
Performance
Performance

top of pageREVIEWS


Reviews are not available for this item
Computing Reviews logo

top of pageCOMMENTS

Be the first to comment To Post a comment please sign in or create a free Web account

top of pageTable of Contents

Proceedings of the 28th annual international symposium on Computer architecture
Table of Contents
General Chair's Message
Page: .07
Full text available: Publisher SitePublisher Site
Program Chair's Message
Page: .08
Full text available: Publisher SitePublisher Site
Conference Organization
Page: .09
Full text available: Publisher SitePublisher Site
Reviewers
Page: .10
Full text available: Publisher SitePublisher Site
Execution-based prediction using speculative slices
Craig Zilles, Gurindar Sohi
Pages: 2-13
doi>10.1145/379240.379246
Full text: PDFPDF

A relatively small set of static instructions has significant leverage on program execution performance. These problem instructions contribute a disproportionate number of cache misses and branch mispredictions because their behavior cannot be accurately ...
expand
Opening Remarks
Page: 4
Full text available: Publisher SitePublisher Site
Speculative precomputation: long-range prefetching of delinquent loads
Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan Lavery, John P. Shen
Pages: 14-25
doi>10.1145/379240.379248
Full text: PDFPDF

This paper explores Speculative Precomputation, a technique that uses idle thread context in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future ...
expand
Dynamically allocating processor resources between nearby and distant ILP
Rajeev Balasubramonian, Sandhya Dwarkadas, David H. Albonesi
Pages: 26-37
doi>10.1145/379240.379249
Full text: PDFPDF

Modern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Because instructions must be committed in order so as to guarantee precise exceptions, increasing ...
expand
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors
Chi-Keung Luk
Pages: 40-51
doi>10.1145/379240.379250
Full text: PDFPDF

Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures ...
expand
Data prefetching by dependence graph precomputation
Murali Annavaram, Jignesh M. Patel, Edward S. Davidson
Pages: 52-61
doi>10.1145/379240.379251
Full text: PDFPDF

Data cache misses reduce the performance of wide-issue processors by stalling the data supply to the processor. Prefetching data by predicting the miss address is one way to tolerate the cache miss latencies. But current applications with irregular ...
expand
Concurrency, latency, or system overhead: which has the largest impact on uniprocessor DRAM-system performance?
Vinodh Cuppu, Bruce Jacob
Pages: 62-71
doi>10.1145/379240.379252
Full text: PDFPDF

Given a fixed CPU architecture and a fixed DRAM timing specification, there is still a large design space for a DRAM system organization. Parameters include the number of memory channels, the bandwidth of each channel, burst sizes, queue sizes and ...
expand
Focusing processor policies via critical-path prediction
Brian Fields, Shai Rubin, Rastislav Bodík
Pages: 74-85
doi>10.1145/379240.379253
Full text: PDFPDF

Although some instructions hurt performance more than others, current processors typically apply scheduling and speculation as if each instruction was equally costly. Instruction cost can be naturally expressed through the critical path: if we ...
expand
Automated design of finite state machine predictors for customized processors
Timothy Sherwood, Brad Calder
Pages: 86-97
doi>10.1145/379240.379254
Full text: PDFPDF

Customized processors use compiler analysis and design automation techniques to take a generalized architectural model and create a specific instance of it which is optimized to a given application or set of applications. These processors offer the ...
expand
Better exploration of region-level value locality with integrated computation reuse and value prediction
Youfeng Wu, Dong-Yuan Chen, Jesse Fang
Pages: 98-108
doi>10.1145/379240.379255
Full text: PDFPDF

Computation-reuse and value-prediction are two recent techniques for improving microprocessor performance by exploiting value localities. They both aim at breaking the data dependence limit in traditional processors. In this paper, we propose a speculative ...
expand
CryptoManiac: a fast flexible architecture for secure communication
Lisa Wu, Chris Weaver, Todd Austin
Pages: 110-119
doi>10.1145/379240.379256
Full text: PDFPDF

The growth of the Internet as a vehicle for secure communication and electronic commerce has brought cryptographic processing performance to the forefront of high throughput system design. This trend will be further underscored with the widespread ...
expand
QoS provisioning in clusters: an investigation of Router and NIC design
Ki Hwan Yum, Eun Jung Kim, Chita R. Das
Pages: 120-129
doi>10.1145/379240.379257
Full text: PDFPDF

Design of high performance cluster networks (routers) with Quality-of-Service (QoS) guarantees is becoming increasingly important to support a variety of multimedia applications, many of which have real-time constraints. Most commercial routers, which ...
expand
Locality vs. criticality
Roy Dz-ching Ju, Alvin R. Lebeck, Chris Wilkerson / Srikanth T. Srinivasan
Pages: 132-143
doi>10.1145/379240.379258
Full text: PDFPDF

Current memory hierarchies exploit locality of references to reduce load latency and thereby improve processor performance. Locality based schemes aim at reducing the number of cache misses and tend to ignore the nature of misses. This leads to a ...
expand
Dead-block prediction & dead-block correlating prefetchers
An-Chow Lai, Cem Fide, Babak Falsafi
Pages: 144-154
doi>10.1145/379240.379259
Full text: PDFPDF

Effective data prefetching requires accurate mechanisms to predict both “which” cache blocks to prefetch and “when” to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that ...
expand
Code layout optimizations for transaction processing workloads
Alex Ramirez, Luiz André Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P. Geoffrey Lowney, Mateo Valero
Pages: 155-164
doi>10.1145/379240.379260
Full text: PDFPDF

Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for ...
expand
Exploring and exploiting wire-level pipelining in emerging technologies
Michael Thaddeus Niemier, Peter M. Kogge
Pages: 166-177
doi>10.1145/379240.379261
Full text: PDFPDF

Pipelining is a technique that has long since been considered fundamental by computer architects. However, the world of nanoelectronics is pushing the idea of pipelining to new and lower levels — particularly the device level. How this affects ...
expand
NanoFabrics: spatial computing using molecular electronics
Seth Copen Goldstein, Mihai Budiu
Pages: 178-191
doi>10.1145/379240.379262
Full text: PDFPDF

The continuation of the remarkable exponential increases in processing power over the recent past faces imminent challenges due in part to the physics of deep-submicron CMOS devices and the costs of both chip masks and future fabrication plants. A promising ...
expand
A simple method for extracting models for protocol code
David Lie, Andy Chou, Dawson Engler, David L. Dill
Pages: 192-203
doi>10.1145/379240.379263
Full text: PDFPDF

The use of model checking for validation requires that models of the underlying system be created. Creating such models is both difficult and error prone and as a result, verification is rarely used despite its advantages. In this paper, we present ...
expand
Removing architectural bottlenecks to the scalability of speculative parallelization
Milos Prvulovic, María Jesús Garzarán, Lawrence Rauchwerger, Josep Torrellas
Pages: 204-215
doi>10.1145/379240.379264
Full text: PDFPDF

Speculative thread-level parallelization is a promising way to speed up codes that compilers fail to parallelize. While several speculative parallelization schemes have been proposed for different machine sizes and types of codes, the results so far ...
expand
Power and energy reduction via pipeline balancing
R. Iris Bahar, Srilatha Manne
Pages: 218-229
doi>10.1145/379240.379265
Full text: PDFPDF

Minimizing power dissipation is an important design requirement for both portable and non-portable systems. In this work, we propose an architectural solution to the power problem that retains performance while reducing power. The technique, known ...
expand
Energy-effective issue logic
Daniele Folegnani, Antonio González
Pages: 230-239
doi>10.1145/379240.379266
Full text: PDFPDF

The issue logic of a dynamically-scheduled superscalar processor is a complex mechanism devoted to start the execution of multiple instructions every cycle. Due to its complexity, it is responsible for a significant percentage of the energy consumed ...
expand
Cache decay: exploiting generational behavior to reduce cache leakage power
Stefanos Kaxiras, Zhigang Hu, Margaret Martonosi
Pages: 240-251
doi>10.1145/379240.379268
Full text: PDFPDF

Power dissipation is increasingly important in CPUs ranging from those intended for mobile use, all the way up to high-performance processors for high-end servers. While the bulk of the power dissipated is dynamic switching power, leakage power is ...
expand
Variability in the execution of multimedia applications and implications for architecture
Christopher J. Hughes, Praful Kaul, Sarita V. Adve, Rohit Jain, Chanik Park, Jayanth Srinivasan
Pages: 254-265
doi>10.1145/379240.379270
Full text: PDFPDF

Multimedia applications are an increasingly important workload for general-purpose processors. This paper analyzes frame-level execution time variability for several multimedia applications on general-purpose architectures. There are two reasons for ...
expand
Measuring Experimental Error in Microprocessor Simulation
Rajagopalan Desikan, Doug Burger, Stephen W. Keckler
Pages: 266-277
doi>10.1145/379240.565338
Full text: PDFPDF

Abstract: We measure the experimental error that arises from the use of non-validated simulators in computer architecture research, with the goal of increasing the rigor of simulation- based studies. We describe the methodology that we used to validate ...
expand
Rapid profiling via stratified sampling
S. Subramanya Sastry, Rastislav Bodík, James E. Smith
Pages: 278-289
doi>10.1145/379240.379273
Full text: PDFPDF

Sophisticated binary translators and dynamic optimizers demand a program profiler with low overhead, high accuracy, and the ability to collect a variety of profile types. A profiling scheme that achieves these goals is proposed. Conceptually, the ...
expand
Author Index
Page: 291
Full text available: Publisher SitePublisher Site

Powered by The ACM Guide to Computing Literature


The ACM Digital Library is published by the Association for Computing Machinery. Copyright © 2016 ACM, Inc.
Terms of Usage   Privacy Policy   Code of Ethics   Contact Us

Useful downloads: Adobe Reader    QuickTime    Windows Media Player    Real Player
Did you know the ACM DL App is now available?
Did you know your Organization can subscribe to the ACM Digital Library?
The ACM Guide to Computing Literature
All Tags
Export Formats
 
 
Save to Binder