Contact The DL Team Contact Us | Switch to tabbed view

top of pageABSTRACT

The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instruction-level parallelism. Meanwhile, such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. The abundance of explicit thread-level parallelism in commercial workloads, along with advances in semiconductor integration density, identify chip multiprocessing (CMP) as potentially the most promising approach for designing processors targeted at commercial servers. This paper describes the Piranha system, a research prototype being developed at Compaq that aggressively exploits chip multi-processing by integrating eight simple Alpha processor cores along with a two-level cache hierarchy onto a single chip. Piranha also integrates further on-chip functionality to allow for scalable multiprocessor configurations to be built in a glueless and modular fashion. The use of simple processor cores combined with an industry-standard ASIC design methodology allow us to complete our prototype within a short time-frame, with a team size and investment that are an order of magnitude smaller than that of a commercial microprocessor. Our detailed simulation results show that while each Piranha processor core is substantially slower than an aggressive next-generation processor, the integration of eight cores onto a single chip allows Piranha to outperform next-generation processors by up to 2.9 times (on a per chip basis) on important workloads such as OLTP. This performance advantage can approach a factor of five by using full-custom instead of ASIC logic. In addition to exploiting chip multiprocessing, the Piranha prototype incorporates several other unique design choices including a shared second-level cache with no inclusion, a highly optimized cache coherence protocol, and a novel I/O architecture.
Advertisements



top of pageAUTHORS



Luiz André Barroso Luiz André Barroso

Personal web page
labatbarroso.org
Bibliometrics: publication history
Publication years1991-2014
Publication count29
Citation Count2,195
Available for download17
Downloads (6 Weeks)467
Downloads (12 Months)7,947
Downloads (cumulative)120,930
Average downloads per article7,113.53
Average citations per article75.69
View colleagues of Luiz André Barroso


Author image not provided  Kourosh Gharachorloo

No contact information provided yet.

Bibliometrics: publication history
Publication years1988-2001
Publication count36
Citation Count2,449
Available for download26
Downloads (6 Weeks)65
Downloads (12 Months)805
Downloads (cumulative)21,947
Average downloads per article844.12
Average citations per article68.03
View colleagues of Kourosh Gharachorloo


Author image not provided  Robert McNamara

No contact information provided yet.

Bibliometrics: publication history
Publication years1997-2000
Publication count6
Citation Count227
Available for download4
Downloads (6 Weeks)4
Downloads (12 Months)112
Downloads (cumulative)4,879
Average downloads per article1,219.75
Average citations per article37.83
View colleagues of Robert McNamara


Author image not provided  Andreas Nowatzyk

No contact information provided yet.

Bibliometrics: publication history
Publication years1989-2005
Publication count13
Citation Count388
Available for download6
Downloads (6 Weeks)5
Downloads (12 Months)157
Downloads (cumulative)5,696
Average downloads per article949.33
Average citations per article29.85
View colleagues of Andreas Nowatzyk


Author image not provided  Shaz Qadeer

No contact information provided yet.

Bibliometrics: publication history
Publication years1996-2016
Publication count95
Citation Count2,636
Available for download39
Downloads (6 Weeks)84
Downloads (12 Months)1,205
Downloads (cumulative)22,100
Average downloads per article566.67
Average citations per article27.75
View colleagues of Shaz Qadeer


Author image not provided  Barton Sano

No contact information provided yet.

Bibliometrics: publication history
Publication years1990-2000
Publication count4
Citation Count213
Available for download3
Downloads (6 Weeks)3
Downloads (12 Months)66
Downloads (cumulative)3,088
Average downloads per article1,029.33
Average citations per article53.25
View colleagues of Barton Sano


Author image not provided  Scott Smith

No contact information provided yet.

Bibliometrics: publication history
Publication years2000-2009
Publication count11
Citation Count195
Available for download1
Downloads (6 Weeks)3
Downloads (12 Months)57
Downloads (cumulative)2,530
Average downloads per article2,530.00
Average citations per article17.73
View colleagues of Scott Smith


Author image not provided  Robert Stets

No contact information provided yet.

Bibliometrics: publication history
Publication years1996-2005
Publication count13
Citation Count305
Available for download4
Downloads (6 Weeks)8
Downloads (12 Months)99
Downloads (cumulative)4,961
Average downloads per article1,240.25
Average citations per article23.46
View colleagues of Robert Stets


Author image not provided  Ben Verghese
 vergheseatstanfordalumni.org
Bibliometrics: publication history
Publication years1994-2000
Publication count10
Citation Count404
Available for download6
Downloads (6 Weeks)15
Downloads (12 Months)179
Downloads (cumulative)6,040
Average downloads per article1,006.67
Average citations per article40.40
View colleagues of Ben Verghese

top of pageREFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
P. Bannon. Alpha 21364: A Scalable Single-chip SMP. Presented at the Microprocessor Forum '98 (http://www.digital.com/alphaoem/microprocessorforum.htm), October 1998.
 
3
L.A. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. Impact of Chip-Level Integration on Performance of OLTP Workloads. In 6th International Symposium on High-Performance Computer Architecture, pages 3-14, January 2000.
4
 
5
J. Borkenhagen and S. Storino. 5th Generation 64-bit PowerPC-Compatible Commercial Processor Design. http://www.rs6OOO.ibm.com /resource/technology/pulsar.pdf. September 1999.
 
6
S. Crowder et al. IEDM Technical Digest, page 1017, 1998.
7
 
8
Z. Cvetanovic and D. Donaldson. AlphaServer 4100 Performance Characterization. In Digital Technical Journal, 8(4), pages 3-20, 1996.
 
9
K. Diefendorff. Power4 Focuses on Memory Bandwidth: IBM Confronts IA-64, Says ISA Not Important. In Microprocessor Report, Vol. 13, No. 13, October 1999.
 
10
Digital Equipment Corporation. Digital Semiconductor 21164 Alpha Microprocessor Hardware Reference Manual. March 1996.
 
11
12
 
13
J.S. Emer. Simultaneous Multithreading: Multiplying Alpha's Performance. Presentation at the Microprocessor Forum '99, October 1999.
 
14
A. Gupta, W.-D. Weber, and T. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes. In International Conference on Parallel Processing, July 1990.
 
15
16
 
17
L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Willey, M. Chen, M. Kozyrczak, and K. Olukotun. The Stanford Hydra CMP. Presented at Hot Chips 11, August 1999.
 
18
 
19
IBM Microelectronics. ASIC SA27E Databook. International Business Machines, 1999.
20
21
22
 
23
24
25
26
27
28
29
30
 
31
A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, W. Radke, and S. Vishin. The S3.mp Scalable Shared Memory Multiprocessor. In International Conference on Parallel Processing (ICPP' 95), pages 1.1 - 1.10, July 1995.
 
32
33
34
35
36
37
38
 
39
 
40
Standard Performance Council. The SPEC95 CPU Benchmark Suite. http ://www.specbench.org, 1995.
 
41
42
 
43
Transaction Processing Performance Council. TPC Benchmark B Standard Specification Revision 2.0. June 1994.
 
44
Transaction Processing Performance Council. TPC Benchmark D (Decision Support) Standard Specification Revision 1.2. November 1996.
 
45
Transaction Processing Performance Council. TPC Benchmark C, Standard Specification Revision 3.6, October 1999.
 
46
 
47
M. Tremblay. MAJC-5200: A VLIW Convergent MPSOC. In Microprocessor Forum, October 1999.
48

top of pageCITED BY

192 Citations

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

top of pageINDEX TERMS

The ACM Computing Classification System (CCS rev.2012)

Note: Larger/Darker text within each node indicates a higher relevance of the materials to the taxonomic classification.

top of pagePUBLICATION

· Proceeding
Title ISCA '00 Proceedings of the 27th annual international symposium on Computer architecture table of contents
Chairmen Alan Berenbaum Lucent Technologies
Joel Emer Compaq Computer Corp.
Pages 282-293
Publication Date2000-06-10 (yyyy-mm-dd)
Sponsor SIGARCH ACM Special Interest Group on Computer Architecture
PublisherACM New York, NY, USA ©2000
ISBN: 1-58113-232-8 Order Number: 415004 doi>10.1145/339647.339696
Conference ISCAInternational Symposium on Computer Architecture ISCA logo
Overall Acceptance Rate 533 of 2,983 submissions, 18%
Year Submitted Accepted Rate
ISCA '99 135 26 19%
ISCA '01 163 24 15%
ISCA '02 180 27 15%
ISCA '03 184 36 20%
ISCA '04 217 31 14%
ISCA '05 194 45 23%
ISCA '06 234 31 13%
ISCA '07 204 46 23%
ISCA '08 259 37 14%
ISCA '09 210 43 20%
ISCA '10 245 44 18%
ISCA '11 208 40 19%
ISCA '12 262 47 18%
ISCA '13 288 56 19%
Overall 2,983 533 18%
· Newsletter
Title ACM SIGARCH Computer Architecture News - Special Issue: Proceedings of the 27th annual international symposium on Computer architecture (ISCA '00) Homepage table of contents archive
Volume 28 Issue 2, May 2000
Chairmen Alan Berenbaum Lucent Technologies, Berkeley Heights, NJ
Joel Emer Compaq Computer Corp., Palo Alto, CA
Pages 282-293
Publication Date2000-05-01 (yyyy-mm-dd)
Sponsor SIGARCH ACM Special Interest Group on Computer Architecture
PublisherACM New York, NY, USA
ISSN: 0163-5964 doi>10.1145/342001.339696

APPEARS IN
Hardware Design
Hardware Design
Performance
Performance

top of pageREVIEWS


Reviews are not available for this item
Computing Reviews logo

top of pageCOMMENTS

Be the first to comment To Post a comment please sign in or create a free Web account

top of pageTable of Contents

Proceedings of the 27th annual international symposium on Computer architecture
Table of Contents
A scalable approach to thread-level speculation
J. Greggory Steffan, Christopher B. Colohan, Antonia Zhai, Todd C. Mowry
Pages: 1-12
doi>10.1145/339647.339650
Full text: PDFPDF

While architects understand how to build cost-effective parallel machines across a wide spectrum of machine sizes (ranging from within a single chip to large-scale servers), the real challenge is how to easily create parallel software ...
expand
Architectural support for scalable speculative parallelization in shared-memory multiprocessors
Marcelo Cintra, José F. Martínez, Josep Torrellas
Pages: 13-24
doi>10.1145/339647.363382
Full text: PDFPDF

Speculative parallelization aggressively executes in parallel codes that cannot be fully parallelized by the compiler. Past proposals of hardware schemes have mostly focused on single-chip multiprocessors (CMPs), whose effectiveness is necessarily limited ...
expand
Transient fault detection via simultaneous multithreading
Steven K. Reinhardt, Shubhendu S. Mukherjee
Pages: 25-36
doi>10.1145/339647.339652
Full text: PDFPDF

Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully ...
expand
Trace preconstruction
Quinn Jacobson, James E. Smith
Pages: 37-46
doi>10.1145/339647.339653
Full text: PDFPDF

Trace caches enable high bandwidth, low latency instruction supply, but have a high miss penalty and relatively large working sets. Consequently, their performance may suffer due to capacity and compulsory misses. Trace preconstruction augments a trace ...
expand
Completion time multiple branch prediction for enhancing trace cache performance
Ryan Rakvic, Bryan Black, John Paul Shen
Pages: 47-58
doi>10.1145/339647.339654
Full text: PDFPDF

The need for multiple branch prediction is inherent to wide instruction fetching. This paper presents a completion time multiple branch predictor called the Tree-based Multiple Branch Predictor (TMP) that builds on previous single branch prediction ...
expand
A hardware mechanism for dynamic extraction and relayout of program hot spots
Matthew C. Merten, Andrew R. Trick, Erik M. Nystrom, Ronald D. Barnes, Wen-mei W. Hmu
Pages: 59-70
doi>10.1145/339647.339655
Full text: PDFPDF

This paper presents a new mechanism for collecting and deploying runtime optimized code. The code-collecting component resides in the instruction retirement stage and lays out hot execution paths to improve instruction fetch rate as well as enable further ...
expand
HLS: combining statistical and symbolic simulation to guide microprocessor designs
Mark Oskin, Frederic T. Chong, Matthew Farrens
Pages: 71-82
doi>10.1145/339647.339656
Full text: PDFPDF

As microprocessors continue to evolve, many optimizations reach a point of diminishing returns. We introduce HLS, a hybrid processor simulator which uses statistical models and symbolic execution to evaluate design alternatives. This simulation ...
expand
Wattch: a framework for architectural-level power analysis and optimizations
David Brooks, Vivek Tiwari, Margaret Martonosi
Pages: 83-94
doi>10.1145/339647.339657
Full text: PDFPDF

Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and even compiler writers, in addition to circuit designers. Most ...
expand
Energy-driven integrated hardware-software optimizations using SimplePower
N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S. Kim, W. Ye
Pages: 95-106
doi>10.1145/339647.339659
Full text: PDFPDF

With the emergence of a plethora of embedded and portable applications, energy dissipation has joined throughput, area, and accuracy/precision as a major design constraint. Thus, designers must be concerned with both optimizing and estimating the energy ...
expand
A fully associative software-managed cache design
Erik G. Hallnor, Steven K. Reinhardt
Pages: 107-116
doi>10.1145/339647.339660
Full text: PDFPDF

As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features—full associativity and software ...
expand
Recency-based TLB preloading
Ashley Saulsbury, Fredrik Dahlgren, Per Stenström
Pages: 117-127
doi>10.1145/339647.339666
Full text: PDFPDF

Caching and other latency tolerating techniques have been quite successful in maintaining high memory system performance for general purpose processors. However, TLB misses have become a serious bottleneck as working sets are growing beyond the ...
expand
Memory access scheduling
Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, John D. Owens
Pages: 128-138
doi>10.1145/339647.339668
Full text: PDFPDF

The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the “3-D” structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order ...
expand
Selective, accurate, and timely self-invalidation using last-touch prediction
An-Chow Lai, Babak Falsafi
Pages: 139-148
doi>10.1145/339647.339669
Full text: PDFPDF

Communication in cache-coherent distributed shared memory (DSM) often requires invalidating (or writing back) cached copies of a memory block, incurring high overheads. This paper proposes Last-Touch Predictors (LTPs) that learn ...
expand
An embedded DRAM architecture for large-scale spatial-lattice computations
Norman Margolus
Pages: 149-160
doi>10.1145/339647.339672
Full text: PDFPDF

Spatial-lattice computations with finite-range interactions are an important class of easily parallelized computations. This class includes many simple and direct algorithms for physical simulation, virtual-reality simulation, agent-based modeling, logic ...
expand
Smart Memories: a modular reconfigurable architecture
Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, Mark Horowitz
Pages: 161-171
doi>10.1145/339647.339673
Full text: PDFPDF

Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency, yet also target the high volumes and low costs of widely applicable general purpose designs. To address these ...
expand
Understanding the backward slices of performance degrading instructions
Craig B. Zilles, Gurindar S. Sohi
Pages: 172-181
doi>10.1145/339647.339676
Full text: PDFPDF

For many applications, branch mispredictions and cache misses limit a processor's performance to a level well below its peak instruction throughput. A small fraction of static instructions, whose behavior cannot be anticipated using current branch ...
expand
On the value locality of store instructions
Kevin M. Lepak, Mikko H. Lipasti
Pages: 182-191
doi>10.1145/339647.339678
Full text: PDFPDF

Value locality, a recently discovered program attribute that describes the likelihood of the recurrence of previously-seen program values, has been studied enthusiastically in the recent published literature. Much of the energy has focused on refining ...
expand
Performance analysis of the Alpha 21264-based Compaq ES40 system
Zarka Cvetanovic, R. E. Kessler
Pages: 192-202
doi>10.1145/339647.339680
Full text: PDFPDF

This paper evaluates performance characteristics of the Compaq ES40 shared memory multiprocessor. The ES40 system contains up to four Alpha 21264 CPU's together with a high-performance memory system. We qualitatively describe architectural features ...
expand
Lx: a technology platform for customizable VLIW embedded processing
Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, Fred Homewood
Pages: 203-213
doi>10.1145/339647.339682
Full text: PDFPDF

Lx is a scalable and customizable VLIW processor technology platform designed by Hewlett-Packard and STMicroelectronics that allows variations in instruction issue width, the number and capabilities of structures and the processor instruction set. For ...
expand
Reconfigurable caches and their application to media processing
Parthasarathy Ranganathan, Sarita Adve, Norman P. Jouppi
Pages: 214-224
doi>10.1145/339647.339685
Full text: PDFPDF

High performance general-purpose processors are increasingly being used for a variety of application domains - scientific, engineering, databases, and more recently, media processing. It is therefore important to ensure that architectural features ...
expand
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
Zhi Alex Ye, Andreas Moshovos, Scott Hauck, Prithviraj Banerjee
Pages: 225-235
doi>10.1145/339647.339687
Full text: PDFPDF

Reconfigurable hardware has the potential for significant performance improvements by providing support for application-specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable ...
expand
Circuits for wide-window superscalar processors
Dana S. Henry, Bradley C. Kuszmaul, Gabriel H. Loh, Rahul Sami
Pages: 236-247
doi>10.1145/339647.339689
Full text: PDFPDF

Our program benchmarks and simulations of novel circuits indicate that large-window processors are feasible. Using our redesigned superscalar components, a large-window processor implemented in today's technology can achieve an increase of 10-60% (geometric ...
expand
Clock rate versus IPC: the end of the road for conventional microarchitectures
Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger
Pages: 248-259
doi>10.1145/339647.339691
Full text: PDFPDF

The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scali ng of the processor clock with technology generation. Our results show that, due to both diminishing ...
expand
Vector instruction set support for conditional operations
J. E. Smith, Greg Faanes, Rabin Sugumar
Pages: 260-269
doi>10.1145/339647.339693
Full text: PDFPDF

Vector instruction sets are receiving renewed interest because of their applicability to multimedia. Current multimedia instruction sets use short vectors with SIMD implementations, but long vector, pipelined implementations have a number of advantages ...
expand
Instruction path coprocessors
Yuan Chou, John Paul Shen
Pages: 270-281
doi>10.1145/339647.339694
Full text: PDFPDF

This paper presents the concept of an Instruction Path Coprocessor (I-COP), which is a programmable on-chip coprocessor, with its own mini-instruction set, that operates on the core processor's instructions to transform them into an internal ...
expand
Piranha: a scalable architecture based on single-chip multiprocessing
Luiz André Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, Ben Verghese
Pages: 282-293
doi>10.1145/339647.339696
Full text: PDFPDF

The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instruction-level parallelism. Meanwhile, such designs are especially ...
expand
Allowing for ILP in an embedded Java processor
Ramesh Radhakrishnan, Deependra Talla, Lizy Kurian John
Pages: 294-305
doi>10.1145/339647.339702
Full text: PDFPDF

Java processors are ideal for embedded and network computing applications such as Internet TV's, set-top boxes, smart phones, and other consumer electronics applications. In this paper, we investigate cost-effective microarchitectural techniques ...
expand
Early load address resolution via register tracking
Michael Bekerman, Adi Yoaz, Freddy Gabbay, Stephan Jourdan, Maxim Kalaev, Ronny Ronen
Pages: 306-315
doi>10.1145/339647.339705
Full text: PDFPDF

Higher microprocessor frequencies accentuate the performance cost of memory accesses. This is especially noticeable in the Intel's IA32 architecture where lack of registers results in increased number of memory accesses. This paper presents novel, ...
expand
Multiple-banked register file architectures
José-Lorenzo Cruz, Antonio González, Mateo Valero, Nigel P. Topham
Pages: 316-325
doi>10.1145/339647.339708
Full text: PDFPDF

The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies ...
expand

Powered by The ACM Guide to Computing Literature


The ACM Digital Library is published by the Association for Computing Machinery. Copyright © 2016 ACM, Inc.
Terms of Usage   Privacy Policy   Code of Ethics   Contact Us

Useful downloads: Adobe Reader    QuickTime    Windows Media Player    Real Player
Did you know the ACM DL App is now available?
Did you know your Organization can subscribe to the ACM Digital Library?
The ACM Guide to Computing Literature
All Tags
Export Formats
 
 
Save to Binder