Contact The DL Team Contact Us | Switch to tabbed view

top of pageABSTRACT

This paper describes the system architecture of the Cray BlackWidow scalable vector multiprocessor. The BlackWidow system is a distributed shared memory (DSM) architecture that is scalable to 32K processors, each with a 4-way dispatch scalar execution unit and an 8-pipe vector unit capable of 20.8 Gflops for 64-bit operations and 41.6 Gflops for 32-bit operations at the prototype operating frequency of 1.3 GHz. Global memory is directly accessible with processor loads and stores and is globally coherent. The system supports thousands of outstanding references to hide remote memory latencies, and provides a rich suite of built-in synchronization primitives. Each BlackWidow node is implemented as a 4-way SMP with up to 128 Gbytes of DDR2 main memory capacity. The system supports common programming models such as MPI and OpenMP, as well as global address space languages such as UPC and CAF. We describe the system architecture and microarchitecture of the processor, memory controller, and router chips. We give preliminary performance results and discuss design tradeoffs.

Advertisements



top of pageAUTHORS



Dennis Abts Dennis Abts

homepage
dabtsatgoogle.com
Bibliometrics: publication history
Publication years1999-2014
Publication count21
Citation Count355
Available for download14
Downloads (6 Weeks)74
Downloads (12 Months)886
Downloads (cumulative)11,099
Average downloads per article792.79
Average citations per article16.90
View colleagues of Dennis Abts


Author image not provided  Abdulla Bataineh

No contact information provided yet.

Bibliometrics: publication history
Publication years1991-2012
Publication count9
Citation Count43
Available for download4
Downloads (6 Weeks)10
Downloads (12 Months)129
Downloads (cumulative)1,239
Average downloads per article309.75
Average citations per article4.78
View colleagues of Abdulla Bataineh


Author image not provided  Steve Scott

No contact information provided yet.

Bibliometrics: publication history
Publication years1989-2012
Publication count77
Citation Count596
Available for download23
Downloads (6 Weeks)50
Downloads (12 Months)494
Downloads (cumulative)12,774
Average downloads per article555.39
Average citations per article7.74
View colleagues of Steve Scott


Author image not provided  Greg Faanes

No contact information provided yet.

Bibliometrics: publication history
Publication years1994-2012
Publication count5
Citation Count67
Available for download4
Downloads (6 Weeks)16
Downloads (12 Months)188
Downloads (cumulative)2,306
Average downloads per article576.50
Average citations per article13.40
View colleagues of Greg Faanes


Author image not provided  Jim Schwarzmeier

No contact information provided yet.

Bibliometrics: publication history
Publication years2005-2013
Publication count4
Citation Count19
Available for download2
Downloads (6 Weeks)6
Downloads (12 Months)60
Downloads (cumulative)1,330
Average downloads per article665.00
Average citations per article4.75
View colleagues of Jim Schwarzmeier


Author image not provided  Eric Lundberg

No contact information provided yet.

Bibliometrics: publication history
Publication years2007-2007
Publication count1
Citation Count16
Available for download1
Downloads (6 Weeks)3
Downloads (12 Months)39
Downloads (cumulative)481
Average downloads per article481.00
Average citations per article16.00
View colleagues of Eric Lundberg


Author image not provided  Tim Johnson

No contact information provided yet.

Bibliometrics: publication history
Publication years2007-2012
Publication count3
Citation Count38
Available for download2
Downloads (6 Weeks)10
Downloads (12 Months)128
Downloads (cumulative)953
Average downloads per article476.50
Average citations per article12.67
View colleagues of Tim Johnson


Author image not provided  Mike Bye

No contact information provided yet.

Bibliometrics: publication history
Publication years2007-2007
Publication count1
Citation Count16
Available for download1
Downloads (6 Weeks)3
Downloads (12 Months)39
Downloads (cumulative)481
Average downloads per article481.00
Average citations per article16.00
View colleagues of Mike Bye


Author image not provided  Gerald Schwoerer

No contact information provided yet.

Bibliometrics: publication history
Publication years2007-2007
Publication count1
Citation Count16
Available for download1
Downloads (6 Weeks)3
Downloads (12 Months)39
Downloads (cumulative)481
Average downloads per article481.00
Average citations per article16.00
View colleagues of Gerald Schwoerer

top of pageREFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
3
 
4
C. Clos. A Study of Non-Blocking Switching Networks. The Bell System technical Journal, 32(2):406--424, March 1953.
 
5
Condensed results for HPCC Challenge Benchmarks. http://icl.cs.utk.edu/hpcc/hpcc_results.cgi
 
6
Cray XI. http://www.cray.com/products/xl/.
 
7
Cray XT3. http://www.cray.com/products/xt3/.
 
8
Cray XT4. http://www.cray.com/products/xt4/.
9
 
10
 
11
HPCC Challenge Benchmarks. http://icl.cs.utk.edu/hpcc/.
 
12
Intel Core2 Duo. http://www.cray.com/products/xdl/.
 
13
A. Johnston. Scaling and Technology Issues for Soft Error Rates. In Proceedings of the 4th Annual Research Conference on Reliability, Stanford, CA, October 2000.
14
 
15
 
16
NEC SX-8 Vector supercomputer. http://www.nec.co.jp/press/en/0410/2001.html.
 
17
 
18
 
19
S. Scott and A. Bataineh. U.S. Patent: Optimized high-bandwidth cache coherence mechanism, http://www.patentstorm.us/patents/7082500.html. 2006.
20
 
21
22

top of pageCITED BY

16 Citations

 
 
 
 
 
 
 

top of pageINDEX TERMS

Index Terms are not available

top of pagePUBLICATION

Title SC '07 Proceedings of the 2007 ACM/IEEE conference on Supercomputing table of contents
General Chair Becky Verastegui Oak Ridge National Laboratory
Article No. 17
Publication Date2007-11-16 (yyyy-mm-dd)
Sponsors SIGARCH ACM Special Interest Group on Computer Architecture
IEEE-CS\DATC IEEE Computer Society
PublisherACM New York, NY, USA ©2007
ISBN: 978-1-59593-764-3 doi>10.1145/1362622.1362646
Conference SCThe International Conference for High Performance Computing, Networking, Storage, and Analysis SC logo
Paper Acceptance Rate 54 of 268 submissions, 20%
Overall Acceptance Rate 1,374 of 5,604 submissions, 25%
Year Submitted Accepted Rate
Supercomputing '91 215 83 39%
Supercomputing '92 220 75 34%
Supercomputing '93 300 72 24%
Supercomputing '95 241 69 29%
Supercomputing '00 179 62 35%
Supercomputing '01 240 60 25%
Supercomputing '02 230 67 29%
SC '03 207 60 29%
SC '04 200 60 30%
SC '05 260 62 24%
SC '06 239 54 23%
SC '07 268 54 20%
SC '08 277 59 21%
SC '09 261 59 23%
SC '10 253 51 20%
SC '11 352 74 21%
SC '12 461 100 22%
SC '13 449 91 20%
SC '14 394 83 21%
SC '15 358 79 22%
Overall 5,604 1,374 25%

APPEARS IN
Performance
Hardware Design

top of pageREVIEWS


Reviews are not available for this item
Computing Reviews logo

top of pageCOMMENTS

Be the first to comment To Post a comment please sign in or create a free Web account

top of pageTable of Contents

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Table of Contents
SESSION: Keynote address
Programming bits and atoms
Neil Gershenfeld
Article No.: 1
doi>10.1145/1362622.1362624
Full text: PDFPDF
Other formats: Mov High ResolutionMov High Resolution  Mov Low ResolutionMov Low Resolution
SESSION: Computational biology
Ann L. Chervenak
A preliminary investigation of a neocortex model implementation on the Cray XD1
Kenneth L. Rice, Christopher N. Vutsinas, Tarek M. Taha
Article No.: 2
doi>10.1145/1362622.1362626
Full text: PDFPDF

In this paper we study the acceleration of a new class of cognitive processing applications based on the structure of the neocortex. Specifically we examine the speedup of a visual cortex model for image recognition. We propose techniques to accelerate ...
expand
Anatomy of a cortical simulator
Rajagopal Ananthanarayanan, Dharmendra S. Modha
Article No.: 3
doi>10.1145/1362622.1362627
Full text: PDFPDF

Insights into brain's high-level computational principles will lead to novel cognitive systems, computing architectures, programming paradigms, and numerous practical applications. An important step towards this end is the study of large networks of ...
expand
Large-scale maximum likelihood-based phylogenetic analysis on the IBM BlueGene/L
Michael Ott, Jaroslaw Zola, Alexandros Stamatakis, Srinivas Aluru
Article No.: 4
doi>10.1145/1362622.1362628
Full text: PDFPDF

Phylogenetic inference is a grand challenge in Bioinformatics due to immense computational requirements. The increasing popularity of multi-gene alignments in biological studies, which typically provide a stable topological signal due to a more ...
expand
SESSION: Network switching and routing
Keith Underwood
Age-based packet arbitration in large-radix k-ary n-cubes
Dennis Abts, Deborah Weisser
Article No.: 5
doi>10.1145/1362622.1362630
Full text: PDFPDF

As applications scale to increasingly large processor counts, the interconnection network is frequently the limiting factor in application performance. In order to achieve application scalability, the interconnect must maintain high bandwidth while minimizing ...
expand
Performance adaptive power-aware reconfigurable optical interconnects for high-performance computing (HPC) systems
Avinash Kodi, Ahmed Louri
Article No.: 6
doi>10.1145/1362622.1362631
Full text: PDFPDF

As communication distances and bit rates increase, optoelectronic interconnects are being deployed for designing high-bandwidth low-latency interconnection networks for high performance computing (HPC) systems. While bandwidth scaling with efficient ...
expand
Evaluating network information models on resource efficiency and application performance in lambda-grids
Nut Taesombut, Andrew A. Chien
Article No.: 7
doi>10.1145/1362622.1362632
Full text: PDFPDF

A critical challenge for wide-area configurable networks is definition and widespread acceptance of Network Information Model (NIM). When a network comprises multiple domains, intelligent information sharing is required for a provider to maintain a competitive ...
expand
SESSION: System performance
Bronis R. de Supinski
Using MPI file caching to improve parallel write performance for large-scale scientific applications
Wei-keng Liao, Avery Ching, Kenin Coloma, Arifa Nisar, Alok Choudhary, Jacqueline Chen, Ramanan Sankaran, Scott Klasky
Article No.: 8
doi>10.1145/1362622.1362634
Full text: PDFPDF

Typical large-scale scientific applications periodically write checkpoint files to save the computational state throughout execution. Existing parallel file systems improve such write-only I/O patterns through the use of client-side file caching and ...
expand
Virtual machine aware communication libraries for high performance computing
Wei Huang, Matthew J. Koop, Qi Gao, Dhabaleswar K. Panda
Article No.: 9
doi>10.1145/1362622.1362635
Full text: PDFPDF

As the size and complexity of modern computing systems keep increasing to meet the demanding requirements of High Performance Computing (HPC) applications, manageability is becoming a critical concern to achieve both high performance and high productivity ...
expand
Investigation of leading HPC I/O performance using a scientific-application derived benchmark
Julian Borrill, Leonid Oliker, John Shalf, Hongzhang Shan
Article No.: 10
doi>10.1145/1362622.1362636
Full text: PDFPDF

With the exponential growth of high-fidelity sensor and simulated data, the scientific community is increasingly reliant on ultrascale HPC resources to handle their data analysis requirements. However, to utilize such extreme computing power effectively, ...
expand
SESSION: Grid scheduling
Satoshi Matsuoka
Automatic resource specification generation for resource selection
Richard Huang, Henri Casanova, Andrew A. Chien
Article No.: 11
doi>10.1145/1362622.1362638
Full text: PDFPDF

With an increasing number of available resources in large-scale distributed environments, a key challenge is resource selection. Fortunately, several middleware systems provide resource selection services. However, a user is still faced with a difficult ...
expand
Performance and cost optimization for multiple large-scale grid workflow applications
Rubing Duan, Radu Prodan, Thomas Fahringer
Article No.: 12
doi>10.1145/1362622.1362639
Full text: PDFPDF

Scheduling large-scale applications on the Grid is a fundamental challenge and is critical to application performance and cost. Large-scale applications typically contain a large number of homogeneous and concurrent activities which are main bottlenecks, ...
expand
Inter-operating grids through delegated matchmaking
Alexandru Iosup, Dick H. J. Epema, Todd Tannenbaum, Matthew Farrellee, Miron Livny
Article No.: 13
doi>10.1145/1362622.1362640
Full text: PDFPDF

The grid vision of a single computing utility has yet to materíalize: while many grids with thousands of processors each exist, most work in isolation. An important obstacle for the effective and efficient inter-operation of grids is the problem ...
expand
SESSION: Security and fault tolerance
Karsten Schwan
Automatic software interference detection in parallel applications
Vahid Tabatabaee, Jeffrey K. Hollingsworth
Article No.: 14
doi>10.1145/1362622.1362642
Full text: PDFPDF

We present an automated software interference detection methodology for Single Program, Multiple Data (SPMD) parallel applications. Interference comes from the system and unexpected processes. If not detected and corrected such interference may result ...
expand
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements
Qi Gao, Feng Qin, Dhabaleswar K. Panda
Article No.: 15
doi>10.1145/1362622.1362643
Full text: PDFPDF

While software reliability in large-scale systems becomes increasingly important, debugging in large-scale parallel systems remains a daunting task. This paper proposes an innovative technique to find hard-to-detect software bugs that can cause ...
expand
Scalable security for petascale parallel file systems
Andrew W. Leung, Ethan L. Miller, Stephanie Jones
Article No.: 16
doi>10.1145/1362622.1362644
Full text: PDFPDF

Petascale, high-performance file systems often hold sensitive data and thus require security, but authentication and authorization can dramatically reduce performance. Existing security solutions perform poorly in these environments because they cannot ...
expand
SESSION: System architecture
John B. Carter
The Cray BlackWidow: a highly scalable vector multiprocessor
Dennis Abts, Abdulla Bataineh, Steve Scott, Greg Faanes, Jim Schwarzmeier, Eric Lundberg, Tim Johnson, Mike Bye, Gerald Schwoerer
Article No.: 17
doi>10.1145/1362622.1362646
Full text: PDFPDF

This paper describes the system architecture of the Cray BlackWidow scalable vector multiprocessor. The BlackWidow system is a distributed shared memory (DSM) architecture that is scalable to 32K processors, each with a 4-way dispatch scalar execution ...
expand
GRAPE-DR: 2-Pflops massively-parallel computer with 512-core, 512-Gflops processor chips for scientific computing
Junichiro Makino, Kei Hiraki, Mary Inaba
Article No.: 18
doi>10.1145/1362622.1362647
Full text: PDFPDF

We describe the GRAPE-DR (Greatly Reduced Array of Processor Elements with Data Reduction) system, which will consist of 4096 processor chips each with 512 cores operating at the clock frequency of 500 MHz. The peak speed of a processor chip is 512Gflops ...
expand
A case for low-complexity MP architectures
Håkan Zeffer, Erik Hagersten
Article No.: 19
doi>10.1145/1362622.1362648
Full text: PDFPDF

Advances in semiconductor technology have driven shared-memory servers toward processors with multiple cores per die and multiple threads per core. This paper presents simple hardware primitives enabling flexible and low-complexity multi-chip designs ...
expand
SESSION: Microarchitecture
Dennis Abts
Variable latency caches for nanoscale processor
Serkan Ozdemir, Arindam Mallik, Ja Chun Ku, Gokhan Memik, Yehea Ismail
Article No.: 20
doi>10.1145/1362622.1362650
Full text: PDFPDF

Variability is one of the important issues in nanoscale processors. Due to increasing importance of interconnect structures in submicron technologies, the physical location and phenomena such as coupling have an increasing impact on the latency of operations. ...
expand
Data access history cache and associated data prefetching mechanisms
Yong Chen, Surendra Byna, Xian-He Sun
Article No.: 21
doi>10.1145/1362622.1362651
Full text: PDFPDF

Data prefetching is an effective way to bridge the increasing performance gap between processor and memory. As computing power is increasing much faster than memory performance, we suggest that it is time to have a dedicated cache to store data access ...
expand
Scaling performance of interior-point method on large-scale chip multiprocessor system
Mikhail Smelyanskiy, Victor W Lee, Daehyun Kim, Anthony D Nguyen, Pradeep Dubey
Article No.: 22
doi>10.1145/1362622.1362652
Full text: PDFPDF

In this paper we describe parallelization of interior-point method (IPM) aimed at achieving high scalability on large-scale chip-multiprocessors (CMPs). IPM is an important computational technique used to solve optimization problems in many areas of ...
expand
SESSION: PDE applications
Omar Ghattas
Data exploration of turbulence simulations using a database cluster
Eric Perlman, Randal Burns, Yi Li, Charles Meneveau
Article No.: 23
doi>10.1145/1362622.1362654
Full text: PDFPDF

We describe a new environment for the exploration of turbulent flows that uses a cluster of databases to store complete histories of Direct Numerical Simulation (DNS) results. This allows for spatial and temporal exploration of high-resolution data that ...
expand
Parallel hierarchical visualization of large time-varying 3D vector fields
Hongfeng Yu, Chaoli Wang, Kwan-Liu Ma
Article No.: 24
doi>10.1145/1362622.1362655
Full text: PDFPDF

We present the design of a scalable parallel pathline construction method for visualizing large time-varying 3D vector fields. A 4D (i.e., time and the 3D spatial domain) representation of the vector field is introduced to make a time-accurate depiction ...
expand
Low-constant parallel algorithms for finite element simulations using linear octrees
Hari Sundar, Rahul S. Sampath, Santi S. Adavani, Christos Davatzikos, George Biros
Article No.: 25
doi>10.1145/1362622.1362656
Full text: PDFPDF

In this article we propose parallel algorithms for the construction of conforming finite-element discretization on linear octrees. Existing octree-based discretizations scale to billions of elements, but the complexity constants can be high. In our approach ...
expand
SESSION: File systems
Frank Mueller
Noncontiguous locking techniques for parallel file systems
Avery Ching, Wei-keng Liao, Alok Choudhary, Robert Ross, Lee Ward
Article No.: 26
doi>10.1145/1362622.1362658
Full text: PDFPDF

Many parallel scientific applications use high-level I/O APIs that offer atomic I/O capabilities. Atomic I/O in current parallel file systems is often slow when multiple processes simultaneously access interleaved, shared files. Current atomic I/O solutions ...
expand
Integrating parallel file systems with object-based storage devices
Ananth Devulapalli, Dennis Dalessandro, Pete Wyckoff, Nawab Ali, P. Sadayappan
Article No.: 27
doi>10.1145/1362622.1362659
Full text: PDFPDF

As storage systems evolve, the block-based design of today's disks is becoming inadequate. As an alternative, object-based storage devices (OSDs) offer a view where the disk manages data layout and keeps track of various attributes about data objects. ...
expand
Evaluation of active storage strategies for the lustre parallel file system
Juan Piernas, Jarek Nieplocha, Evan J. Felix
Article No.: 28
doi>10.1145/1362622.1362660
Full text: PDFPDF

Active Storage provides an opportunity for reducing the amount of data movement between storage and compute nodes of a parallel filesystem such as Lustre, and PVFS. It allows certain types of data processing operations to be performed directly on the ...
expand
SESSION: Performance tools and methods
Bernd Mohr
The ghost in the machine: observing the effects of kernel operation on parallel application performance
Aroon Nataraj, Alan Morris, Allen D. Malony, Matthew Sottile, Pete Beckman
Article No.: 29
doi>10.1145/1362622.1362662
Full text: PDFPDF

The performance of a parallel application on a scalable HPC system is determined by user-level execution of the application code and system-level (OS kernel) operations. To understand the influences of system-level factors on application performance, ...
expand
PNMPI tools: a whole lot greater than the sum of their parts
Martin Schulz, Bronis R. de Supinski
Article No.: 30
doi>10.1145/1362622.1362663
Full text: PDFPDF

PNMPI extends the PMPI profiling interface to support multiple concurrent PMPI-based tools by enabling users to assemble tool stacks. We extend this basic concept to include new services for tool interoperability and to switch between ...
expand
Multi-threading and one-sided communication in parallel LU factorization
Parry Husbands, Katherine Yelick
Article No.: 31
doi>10.1145/1362622.1362664
Full text: PDFPDF

Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem ...
expand
SESSION: Grid management
Philip M. Papadopoulos
Workstation capacity tuning using reinforcement learning
Aharon Bar-Hillel, Amir Di-Nur, Liat Ein-Dor, Ran Gilad-Bachrach, Yossi Ittach
Article No.: 32
doi>10.1145/1362622.1362666
Full text: PDFPDF

Computer grids are complex, heterogeneous, and dynamic systems, whose behavior is governed by hundreds of manually-tuned parameters. As the complexity of these systems grows, automating the procedure of parameter tuning becomes indispensable. In this ...
expand
Anomaly detection and diagnosis in grid environments
Lingyun Yang, Chuang Liu, Jennifer M. Schopf, Ian Foster
Article No.: 33
doi>10.1145/1362622.1362667
Full text: PDFPDF

Identifying and diagnosing anomalies in application behavior is critical to delivering reliable application-level performance. In this paper we introduce a strategy to detect anomalies and diagnose the possible reasons behind them. Our approach extends ...
expand
User-friendly and reliable grid computing based on imperfect middleware
Rob V. van Nieuwpoort, Thilo Kielmann, Henri E. Bal
Article No.: 34
doi>10.1145/1362622.1362668
Full text: PDFPDF

Writing grid applications is hard. First, interfaces to existing grid middleware often are too low-level for application programmers who are domain experts rather than computer scientists. Second, grid APIs tend to evolve too quickly for applications ...
expand
SESSION: Network interfaces
Scott Pakin
Analyzing the impact of supporting out-of-order communication on in-order performance with iWARP
P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur, W. Gropp
Article No.: 35
doi>10.1145/1362622.1362670
Full text: PDFPDF

Due to the growing need to tolerate network faults and congestion in high-end computing systems, supporting multiple network communication paths is becoming increasingly important. However, multi-path communication comes with the disadvantage of out-of-order ...
expand
Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors
Keith D. Underwood, Michael J. Levenhagen, Ron Brightwell
Article No.: 36
doi>10.1145/1362622.1362671
Full text: PDFPDF

Partitioned global address space (PGAS) programming models have been identified as one of the few viable approaches for dealing with emerging many-core systems. These models tend to generate many small messages, which requires specific support from the ...
expand
High-performance ethernet-based communications for future multi-core processors
Michael Schlansker, Nagabhushan Chitlur, Erwin Oertli, Paul M. Stillwell, Jr, Linda Rankin, Dennis Bradford, Richard J. Carter, Jayaram Mudigonda, Nathan Binkert, Norman P. Jouppi
Article No.: 37
doi>10.1145/1362622.1362672
Full text: PDFPDF

Data centers and HPC clusters often incorporate specialized networking fabrics to satisfy system requirements. However, Ethernet's low cost and high performance are causing a shift from specialized fabrics toward standard Ethernet. Although Ethernet's ...
expand
SESSION: Benchmarking
Allan Snavely
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel
Article No.: 38
doi>10.1145/1362622.1362674
Full text: PDFPDF

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, ...
expand
Cray XT4: an early evaluation for petascale scientific simulation
Sadaf R. Alam, Jeffery A. Kuehn, Richard F. Barrett, Jeff M. Larkin, Mark R. Fahey, Ramanan Sankaran, Patrick H. Worley
Article No.: 39
doi>10.1145/1362622.1362675
Full text: PDFPDF

The scientific simulation capabilities of next generation high-end computing technology will depend on striking a balance among memory, processor, I/O, and local and global network performance across the breadth of the scientific simulation space. The ...
expand
An adaptive mesh refinement benchmark for modern parallel programming languages
Tong Wen, Jimmy Su, Phillip Colella, Katherine Yelick, Noel Keen
Article No.: 40
doi>10.1145/1362622.1362676
Full text: PDFPDF

We present an Adaptive Mesh Refinement benchmark for evaluating programmability and performance of modern parallel programming languages. Benchmarks employed today by language developing teams, originally designed for performance evaluation of computer ...
expand
SESSION: Grid performance
Daniel S. Katz
Exploring event correlation for failure prediction in coalitions of clusters
Song Fu, Cheng-Zhong Xu
Article No.: 41
doi>10.1145/1362622.1362678
Full text: PDFPDF

In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and ...
expand
Advanced data flow support for scientific grid workflow applications
Jun Qin, Thomas Fahringer
Article No.: 42
doi>10.1145/1362622.1362679
Full text: PDFPDF

Existing work does not provide a flexible dataset-oriented data flow mechanism to meet the complex requirements of scientific Grid workflow applications. In this paper we present a sophisticated approach to this problem by introducing a data collection ...
expand
Falkon: a Fast and Light-weight tasK executiON framework
Ioan Raicu, Yong Zhao, Catalin Dumitrescu, Ian Foster, Mike Wilde
Article No.: 43
doi>10.1145/1362622.1362680
Full text: PDFPDF

To enable the rapid execution of many tasks on compute clusters, we have developed Falkon, a Fast and Light-weight tasK executiON framework. Falkon integrates (1) multi-level scheduling to separate resource acquisition (via, e.g., requests to batch schedulers) ...
expand
SESSION: Storage, file systems, and GPU hashing
Brett M. Bode
RobuSTore: a distributed storage architecture with robust and high performance
Huaxia Xia, Andrew A. Chien
Article No.: 44
doi>10.1145/1362622.1362682
Full text: PDFPDF

Emerging large-scale scientific applications require to access large data objects in high and robust performance. We propose RobuSTore, a storage architecture that combines erasure codes and speculative access mechanisms for parallel write and read in ...
expand
A user-level secure grid file system
Ming Zhao, Renato J. Figueiredo
Article No.: 45
doi>10.1145/1362622.1362683
Full text: PDFPDF

A grid-wide distributed file system provides convenient data access interfaces that facilitate fine-grained cross-domain data sharing and collaboration. However, existing widely-adopted distributed file systems do not meet the security requirements for ...
expand
Efficient gather and scatter operations on graphics processors
Bingsheng He, Naga K. Govindaraju, Qiong Luo, Burton Smith
Article No.: 46
doi>10.1145/1362622.1362684
Full text: PDFPDF

Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs). ...
expand
SESSION: Modeling in action
Vladimir Getov
A genetic algorithms approach to modeling the performance of memory-bound computations
Mustafa M Tikir, Laura Carrington, Erich Strohmaier, Allan Snavely
Article No.: 47
doi>10.1145/1362622.1362686
Full text: PDFPDF

Benchmarks that measure memory bandwidth, such as STREAM, Apex-MAPS and MultiMAPS, are increasingly popular due to the "Von Neumann" bottleneck of modern processors which causes many calculations to be memory-bound. We present a scheme for predicting ...
expand
Performance under failures of high-end computing
Ming Wu, Xian-He Sun, Hui Jin
Article No.: 48
doi>10.1145/1362622.1362687
Full text: PDFPDF

Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, ...
expand
Bounding energy consumption in large-scale MPI programs
Barry Rountree, David K. Lowenthal, Shelby Funk, Vincent W. Freeh, Bronis R. de Supinski, Martin Schulz
Article No.: 49
doi>10.1145/1362622.1362688
Full text: PDFPDF

Power is now a first-order design constraint in large-scale parallel computing. Used carefully, dynamic voltage scaling can execute parts of a program at a slower CPU speed to achieve energy savings with a relatively small (possibly zero) time delay. ...
expand
SESSION: Performance optimization
Derek Chiou
Application development on hybrid systems
Roger D. Chamberlain, Mark A. Franklin, Eric J. Tyson, Jeremy Buhler, Saurabh Gayen, Patrick Crowley, James H. Buckley
Article No.: 50
doi>10.1145/1362622.1362690
Full text: PDFPDF

Hybrid systems consisting of a multitude of different computing device types are interesting targets for high-performance applications. Chip multiprocessors, FPGAs, DSPs, and GPUs can be readily put together into a hybrid system; however, it is not at ...
expand
Multi-level tiling: M for the price of one
DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay Rajopadhye, Michelle Mills Strout
Article No.: 51
doi>10.1145/1362622.1362691
Full text: PDFPDF

Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. High-performance implementations use multiple levels of tiling to exploit the hierarchy of parallelism and cache/register locality. Efficient generation ...
expand
Implementation and performance analysis of non-blocking collective operations for MPI
Torsten Hoefler, Andrew Lumsdaine, Wolfgang Rehm
Article No.: 52
doi>10.1145/1362622.1362692
Full text: PDFPDF

Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper ...
expand
SESSION: Scheduling
Greg Bronevetsky
Efficient operating system scheduling for performance-asymmetric multi-core architectures
Tong Li, Dan Baumberger, David A. Koufaty, Scott Hahn
Article No.: 53
doi>10.1145/1362622.1362694
Full text: PDFPDF

Recent research advocates asymmetric multi-core architectures, where cores in the same processor can have different performance. These architectures support single-threaded performance and multithreaded throughput at lower costs (e.g., die size and power). ...
expand
A job scheduling framework for large computing farms
Gabriele Capannini, Ranieri Baraglia, Diego Puppin, Laura Ricci, Marco Pasquali
Article No.: 54
doi>10.1145/1362622.1362695
Full text: PDFPDF

In this paper, we propose a new method, called Convergent Scheduling, for scheduling a continuous stream of batch jobs on the machines of large-scale computing farms. This method exploits a set of heuristics that guide the scheduler in making decisions. ...
expand
Optimizing center performance through coordinated data staging, scheduling and recovery
Zhe Zhang, Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Gregory G. Pike, John W. Cobb, Frank Mueller
Article No.: 55
doi>10.1145/1362622.1362696
Full text: PDFPDF

Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage ...
expand
SESSION: Gordon Bell prize finalists
David H. Bailey
A 281 Tflops calculation for X-ray protein structure analysis with special-purpose computers MDGRAPE-3
Yousuke Ohno, Eiji Nishibori, Tetsu Narumi, Takahiro Koishi, Tahir H. Tahirov, Hideo Ago, Masashi Miyano, Ryutaro Himeno, Toshikazu Ebisuzaki, Makoto Sakata, Makoto Taiji
Article No.: 56
doi>10.1145/1362622.1362698
Full text: PDFPDF

We have achieved a sustained calculation speed of 281 Tflops for the optimization of the 3-D structures of proteins from the X-ray experimental data by the Genetic Algorithm - Direct Space (GA-DS) method. In this calculation we used MDGRAPE-3, special-purpose ...
expand
First-principles calculations of large-scale semiconductor systems on the earth simulator
Takahisa Ohno, Takenori Yamamoto, Tatsunobu Kokubo, Akira Azami, Yuta Sakaguchi, Tsuyoshi Uda, Takahiro Yamasaki, Daisuke Fukata, Junichiro Koga
Article No.: 57
doi>10.1145/1362622.1362699
Full text: PDFPDF

First-principles simulations of large-scale semiconductor systems using the PHASE code on the Earth Simulator (ES) demonstrate high performance with respect to the theoretical peak performance. PHASE, designed for vector-parallel systems like the ES, ...
expand
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability
J. N. Glosli, D. F. Richards, K. J. Caspersen, R. E. Rudd, J. A. Gunnels, F. H. Streitz
Article No.: 58
doi>10.1145/1362622.1362700
Full text: PDFPDF

We report the computational advances that have enabled the first micron-scale simulation of a Kelvin-Helmholtz (KH) instability using molecular dynamics (MD). The advances are in three key areas for massively parallel computation such as on BlueGene/L ...
expand
WRF nature run
John Michalakes, Josh Hacker, Richard Loft, Michael O. McCracken, Allan Snavely, Nicholas J. Wright, Tom Spelce, Brent Gorda, Robert Walkup
Article No.: 59
doi>10.1145/1362622.1362701
Full text: PDFPDF

The Weather Research and Forecast (WRF) model is a limited-area model of the atmosphere for mesoscale research and operational numerical weather prediction (NWP). A petascale problem is a WRF nature run that provides very high-resolution "truth" against ...
expand

Powered by The ACM Guide to Computing Literature


The ACM Digital Library is published by the Association for Computing Machinery. Copyright © 2016 ACM, Inc.
Terms of Usage   Privacy Policy   Code of Ethics   Contact Us

Useful downloads: Adobe Reader    QuickTime    Windows Media Player    Real Player
Did you know the ACM DL App is now available?
Did you know your Organization can subscribe to the ACM Digital Library?
The ACM Guide to Computing Literature
All Tags
Export Formats
 
 
Save to Binder