|
|
SESSION: Keynote address |
|
|
|
|
Programming bits and atoms |
| |
Neil Gershenfeld
|
|
Article No.: 1 |
|
doi>10.1145/1362622.1362624 |
|
Full text: PDF
|
|
Other formats:
Mov High Resolution Mov Low Resolution
|
|
|
|
|
SESSION: Computational biology |
| |
Ann L. Chervenak
|
|
|
|
|
A preliminary investigation of a neocortex model implementation on the Cray XD1 |
| |
Kenneth L. Rice,
Christopher N. Vutsinas,
Tarek M. Taha
|
|
Article No.: 2 |
|
doi>10.1145/1362622.1362626 |
|
Full text: PDF
|
|
In this paper we study the acceleration of a new class of cognitive processing applications based on the structure of the neocortex. Specifically we examine the speedup of a visual cortex model for image recognition. We propose techniques to accelerate ...
In this paper we study the acceleration of a new class of cognitive processing applications based on the structure of the neocortex. Specifically we examine the speedup of a visual cortex model for image recognition. We propose techniques to accelerate the application on general purpose processors and on reconfigurable logic. We present implementations of our approach on a Cray XD1 and compare the performance potential of scaling the design utilizing reconfigurable logic based acceleration to a software only design. Our results indicate that acceleration using reconfigurable logic can provide a significant speedup over a software only implementation. expand
|
|
|
Anatomy of a cortical simulator |
| |
Rajagopal Ananthanarayanan,
Dharmendra S. Modha
|
|
Article No.: 3 |
|
doi>10.1145/1362622.1362627 |
|
Full text: PDF
|
|
Insights into brain's high-level computational principles will lead to novel cognitive systems, computing architectures, programming paradigms, and numerous practical applications. An important step towards this end is the study of large networks of ...
Insights into brain's high-level computational principles will lead to novel cognitive systems, computing architectures, programming paradigms, and numerous practical applications. An important step towards this end is the study of large networks of cortical spiking neurons. We have built a cortical simulator, C2, incorporating several algorithmic enhancements to optimize the simulation scale and time, through: computationally efficient simulation of neurons in a clock-driven and synapses in an event-driven fashion; memory efficient representation of simulation state; and communication efficient message exchanges. Using phenomenological, single-compartment models of spiking neurons and synapses with spike-timing dependent plasticity, we represented a rat-scale cortical model (55 million neurons, 442 billion synapses) in STB memory of a 32, 768-processor BlueGene/L. With 1 millisecond resolution for neuronal dynamics and 1--20 milliseconds axonal delays, C2 can simulate 1 second of model time in 9 seconds per Hertz of average neuronal firing rate. In summary, by combining state-of-the-art hardware with innovative algorithms and software design, we simultaneously achieved unprecedented time-to-solution on an unprecedented problem size. expand
|
|
|
Large-scale maximum likelihood-based phylogenetic analysis on the IBM BlueGene/L |
| |
Michael Ott,
Jaroslaw Zola,
Alexandros Stamatakis,
Srinivas Aluru
|
|
Article No.: 4 |
|
doi>10.1145/1362622.1362628 |
|
Full text: PDF
|
|
Phylogenetic inference is a grand challenge in Bioinformatics due to immense computational requirements. The increasing popularity of multi-gene alignments in biological studies, which typically provide a stable topological signal due to a more ...
Phylogenetic inference is a grand challenge in Bioinformatics due to immense computational requirements. The increasing popularity of multi-gene alignments in biological studies, which typically provide a stable topological signal due to a more favorable ratio of the number of base pairs to the number of sequences, coupled with rapid accumulation of sequence data in general, poses new challenges for high performance computing. In this paper, we demonstrate how state-of-the-art Maximum Likelihood (ML) programs can be efficiently scaled to the IBM BlueGene/L (BG/L) architecture, by porting RAxML, which is currently among the fastest and most accurate programs for phylogenetic inference under the ML criterion. We simultaneously exploit coarse-grained and fine-grained parallelism that is inherent in every ML-based biological analysis. Performance is assessed using datasets consisting of 212 sequences and 566,470 base pairs, and 2,182 sequences and 51,089 base pairs, respectively. To the best of our knowledge, these are the largest datasets analyzed under ML to date. The capability to analyze such datasets will help to address novel biological questions via phylogenetic analyses. Our experimental results indicate that the fine-grained parallelization scales well up to 1, 024 processors. Moreover, a larger number of processors can be efficiently exploited by a combination of coarse-grained and fine-grained parallelism. Finally, we demonstrate that our parallelization scales equally well on an AMD Opteron cluster with a less favorable network latency to processor speed ratio. We recorded super-linear speedups in several cases due to increased cache efficiency. expand
|
|
|
SESSION: Network switching and routing |
| |
Keith Underwood
|
|
|
|
|
Age-based packet arbitration in large-radix k-ary n-cubes |
| |
Dennis Abts,
Deborah Weisser
|
|
Article No.: 5 |
|
doi>10.1145/1362622.1362630 |
|
Full text: PDF
|
|
As applications scale to increasingly large processor counts, the interconnection network is frequently the limiting factor in application performance. In order to achieve application scalability, the interconnect must maintain high bandwidth while minimizing ...
As applications scale to increasingly large processor counts, the interconnection network is frequently the limiting factor in application performance. In order to achieve application scalability, the interconnect must maintain high bandwidth while minimizing variation in packet latency. As the offered load in the network increases with growing problem sizes and processor counts, so does the expected maximum packet latency in the network, directly impacting performance of applications with any synchronized communication. Age-based packet arbitration reduces the variance in packet latency as well as average latency. This paper describes the Cray XT router packet aging algorithm which allows globally fair arbitration by incorporating "age" in the packet output arbitration. We describe the parameters of the aging algorithm and how to arrive at appropriate settings. We show that an efficient aging algorithm reduces both the average packet latency and the variance in packet latency on communication-intensive benchmarks. expand
|
|
|
Performance adaptive power-aware reconfigurable optical interconnects for high-performance computing (HPC) systems |
| |
Avinash Kodi,
Ahmed Louri
|
|
Article No.: 6 |
|
doi>10.1145/1362622.1362631 |
|
Full text: PDF
|
|
As communication distances and bit rates increase, optoelectronic interconnects are being deployed for designing high-bandwidth low-latency interconnection networks for high performance computing (HPC) systems. While bandwidth scaling with efficient ...
As communication distances and bit rates increase, optoelectronic interconnects are being deployed for designing high-bandwidth low-latency interconnection networks for high performance computing (HPC) systems. While bandwidth scaling with efficient multiplexing techniques (wavelengths, time and space) are available, static assignment of wavelengths can be detrimental to network performance for non-uniform (adversial) workloads. Dynamic bandwidth re-allocation based on actual traffic pattern can lead to improved network performance by utilizing idle resources. While dynamic bandwidth re-allocation (DBR) techniques can alleviate interconnection bottlenecks, power consumption also increases considerably. In this paper, we propose to improve the performance of optical interconnects using DBR techniques and simultaneously optimize the power consumption using Dynamic Power Management (DPM) techniques. DBR, re-allocates idle channels to busy channels (wavelengths) for improving throughput and DPM regulates the bit rates and supply voltages for the individual channels. A reconfigurable opto-electronic architecture and a performance adaptive algorithm for implementing DBR and DPM are proposed in this paper. Our proposed reconfiguration algorithm achieves a significant reduction in power consumption and considerable improvement in throughput with a marginal increase in latency for various traffic patterns. expand
|
|
|
Evaluating network information models on resource efficiency and application performance in lambda-grids |
| |
Nut Taesombut,
Andrew A. Chien
|
|
Article No.: 7 |
|
doi>10.1145/1362622.1362632 |
|
Full text: PDF
|
|
A critical challenge for wide-area configurable networks is definition and widespread acceptance of Network Information Model (NIM). When a network comprises multiple domains, intelligent information sharing is required for a provider to maintain a competitive ...
A critical challenge for wide-area configurable networks is definition and widespread acceptance of Network Information Model (NIM). When a network comprises multiple domains, intelligent information sharing is required for a provider to maintain a competitive advantage and for customers to use a provider's network and make good resource selection decisions. We characterize the information that can be shared between domains and propose a spectrum of network information models. To evaluate the impact of the proposed models, we use a trace-driven simulation under a range of real providers' networks and assess how the available information affects applications' and providers' ability to utilize network resources. We find that domain topology information is crucial for achieving good resource efficiency, low application latency and network configuration cost, while domain link state information contributes to better resource utilization and system throughput. These results suggest that collaboration between service providers can provide better overall network productivity. expand
|
|
|
SESSION: System performance |
| |
Bronis R. de Supinski
|
|
|
|
|
Using MPI file caching to improve parallel write performance for large-scale scientific applications |
| |
Wei-keng Liao,
Avery Ching,
Kenin Coloma,
Arifa Nisar,
Alok Choudhary,
Jacqueline Chen,
Ramanan Sankaran,
Scott Klasky
|
|
Article No.: 8 |
|
doi>10.1145/1362622.1362634 |
|
Full text: PDF
|
|
Typical large-scale scientific applications periodically write checkpoint files to save the computational state throughout execution. Existing parallel file systems improve such write-only I/O patterns through the use of client-side file caching and ...
Typical large-scale scientific applications periodically write checkpoint files to save the computational state throughout execution. Existing parallel file systems improve such write-only I/O patterns through the use of client-side file caching and write-behind strategies. In distributed environments where files are rarely accessed by more than one client concurrently, file caching has achieved significant success; however, in parallel applications where multiple clients manipulate a shared file, cache coherence control can serialize I/O. We have designed a thread based caching layer for the MPI I/O library, which adds a portable caching system closer to user applications so more information about the application's I/O patterns is available for better coherence control. We demonstrate the impact of our caching solution on parallel write performance with a comprehensive evaluation that includes a set of widely used I/O benchmarks and production application I/O kernels. expand
|
|
|
Virtual machine aware communication libraries for high performance computing |
| |
Wei Huang,
Matthew J. Koop,
Qi Gao,
Dhabaleswar K. Panda
|
|
Article No.: 9 |
|
doi>10.1145/1362622.1362635 |
|
Full text: PDF
|
|
As the size and complexity of modern computing systems keep increasing to meet the demanding requirements of High Performance Computing (HPC) applications, manageability is becoming a critical concern to achieve both high performance and high productivity ...
As the size and complexity of modern computing systems keep increasing to meet the demanding requirements of High Performance Computing (HPC) applications, manageability is becoming a critical concern to achieve both high performance and high productivity computing. Meanwhile, virtual machine (VM) technologies have become popular in both industry and academia due to various features designed to ease system management and administration. While a VM-based environment can greatly help manageability on large-scale computing systems, concerns over performance have largely blocked the HPC community from embracing VM technologies. In this paper, we follow three steps to demonstrate the ability to achieve near-native performance in a VM-based environment for HPC. First, we propose Inter-VM Communication (IVC), a VM-aware communication library to support efficient shared memory communication among computing processes on the same physical host, even though they may be in different VMs. This is critical for multi-core systems, especially when individual computing processes are hosted on different VMs to achieve fine-grained control. Second, we design a VM-aware MPI library based on MVAPICH2 (a popular MPI library), called MVAPICH2-ivc, which allows HPC MPI applications to transparently benefit from IVC. Finally, we evaluate MVAPICH2-ivc on clusters featuring multi-core systems and high performance InfiniBand interconnects. Our evaluation demonstrates that MVAPICH2-ivc can improve NAS Parallel Benchmark performance by up to 11% in VM-based environment on eight-core Intel Clover-town systems, where each compute process is in a separate VM. A detailed performance evaluation for up to 128 processes (64 node dual-socket single-core systems) shows only a marginal performance overhead of MVAPICH2-ivc as compared with MVAPICH2 running in a native environment. This study indicates that performance should no longer be a barrier preventing HPC environments from taking advantage of the various features available through VM technologies. expand
|
|
|
Investigation of leading HPC I/O performance using a scientific-application derived benchmark |
| |
Julian Borrill,
Leonid Oliker,
John Shalf,
Hongzhang Shan
|
|
Article No.: 10 |
|
doi>10.1145/1362622.1362636 |
|
Full text: PDF
|
|
With the exponential growth of high-fidelity sensor and simulated data, the scientific community is increasingly reliant on ultrascale HPC resources to handle their data analysis requirements. However, to utilize such extreme computing power effectively, ...
With the exponential growth of high-fidelity sensor and simulated data, the scientific community is increasingly reliant on ultrascale HPC resources to handle their data analysis requirements. However, to utilize such extreme computing power effectively, the I/O components must be designed in a balanced fashion, as any architectural bottleneck will quickly render the platform intolerably inefficient. To understand I/O performance of data-intensive applications in realistic computational settings, we develop a lightweight, portable benchmark called MADbench2, which is derived directly from a large-scale Cosmic Microwave Background (CMB) data analysis package. Our study represents one of the most comprehensive I/O analyses of modern parallel filesystems, examining a broad range of system architectures and configurations, including Lustre on the Cray XT3 and Intel Itanium2 cluster; GPFS on IBM Power5 and AMD Opteron platforms; two BlueGene/L installations utilizing GPFS and PVFS2 filesystems; and CXFS on the SGI Altix3700. We present extensive synchronous I/O performance data comparing a number of key parameters including concurrency, POSIX- versus MPI-IO, and unique- versus shared-file accesses, using both the default environment as well as highly-tuned I/O parameters. Finally, we explore the potential of asynchronous I/O and quantify the volume of computation required to hide a given volume of I/O. Overall our study quantifies the vast differences in performance and functionality of parallel filesystems across state-of-the-art platforms, while providing system designers and computational scientists a lightweight tool for conducting further analyses. expand
|
|
|
SESSION: Grid scheduling |
| |
Satoshi Matsuoka
|
|
|
|
|
Automatic resource specification generation for resource selection |
| |
Richard Huang,
Henri Casanova,
Andrew A. Chien
|
|
Article No.: 11 |
|
doi>10.1145/1362622.1362638 |
|
Full text: PDF
|
|
With an increasing number of available resources in large-scale distributed environments, a key challenge is resource selection. Fortunately, several middleware systems provide resource selection services. However, a user is still faced with a difficult ...
With an increasing number of available resources in large-scale distributed environments, a key challenge is resource selection. Fortunately, several middleware systems provide resource selection services. However, a user is still faced with a difficult question: "What should I ask for?" Since most users end up using naïve and suboptimal resource specifications, we propose an automated way to answer this question. We present an empirical model that given a workflow application (DAG-structured) generates an appropriate resource specification, including number of resources, the range of clock rates among the resources, and network connectivity. The model employs application structure information as well as an optional utility function that trades off cost and performance. With extensive simulation experiments for different types of applications, resource conditions, and scheduling heuristics, we show that our model leads consistently to close to optimal application performance and often reduces resource usage. expand
|
|
|
Performance and cost optimization for multiple large-scale grid workflow applications |
| |
Rubing Duan,
Radu Prodan,
Thomas Fahringer
|
|
Article No.: 12 |
|
doi>10.1145/1362622.1362639 |
|
Full text: PDF
|
|
Scheduling large-scale applications on the Grid is a fundamental challenge and is critical to application performance and cost. Large-scale applications typically contain a large number of homogeneous and concurrent activities which are main bottlenecks, ...
Scheduling large-scale applications on the Grid is a fundamental challenge and is critical to application performance and cost. Large-scale applications typically contain a large number of homogeneous and concurrent activities which are main bottlenecks, but open great potentials for optimization. This paper presents a new formulation of the well-known NP-complete problems and two novel algorithms that addresses the problems. The optimization problems are formulated as sequential cooperative games among workflow managers. Experimental results indicate that we have successfully devised and implemented one group of effective, efficient, and feasible approaches. They can produce soultuins of significantly better performance and cost than traditional algorithms. Our algorithms have considerably low time complexity and can assign 1,000,000 activities to 10,000 processors within 0.4 second on one Opteron processor. Moreover, the solutions can be practically performed by workflow managers, and the violation of QoS can be easily detected, which are critical to fault tolerance. expand
|
|
|
Inter-operating grids through delegated matchmaking |
| |
Alexandru Iosup,
Dick H. J. Epema,
Todd Tannenbaum,
Matthew Farrellee,
Miron Livny
|
|
Article No.: 13 |
|
doi>10.1145/1362622.1362640 |
|
Full text: PDF
|
|
The grid vision of a single computing utility has yet to materíalize: while many grids with thousands of processors each exist, most work in isolation. An important obstacle for the effective and efficient inter-operation of grids is the problem ...
The grid vision of a single computing utility has yet to materíalize: while many grids with thousands of processors each exist, most work in isolation. An important obstacle for the effective and efficient inter-operation of grids is the problem of resource selection. In this paper we propose a solution to this problem that combines the hierarchical and decentralized approaches for interconnecting grids. In our solution, a hierarchy of grid sites is augmented with peer-to-peer connections between sites under the same administrative control. To operate this architecture, we employ the key concept of delegated matchmaking, which temporarily binds resources from remote sites to the local environment. With trace-based simulations we evaluate our solution under various infrastructural and load conditions, and we show that it outperforms other approaches to inter-operating grids. Specifically, we show that delegated matchmaking achieves up to 60% more goodput and completes 26% more jobs than its best alternative. expand
|
|
|
SESSION: Security and fault tolerance |
| |
Karsten Schwan
|
|
|
|
|
Automatic software interference detection in parallel applications |
| |
Vahid Tabatabaee,
Jeffrey K. Hollingsworth
|
|
Article No.: 14 |
|
doi>10.1145/1362622.1362642 |
|
Full text: PDF
|
|
We present an automated software interference detection methodology for Single Program, Multiple Data (SPMD) parallel applications. Interference comes from the system and unexpected processes. If not detected and corrected such interference may result ...
We present an automated software interference detection methodology for Single Program, Multiple Data (SPMD) parallel applications. Interference comes from the system and unexpected processes. If not detected and corrected such interference may result in performance degradation. Our goal is to provide a reliable metric for software interference that can be used in soft-failure protection and recovery systems. A unique feature of our algorithm is that we measure the relative timing of application events (i.e. time between MPI calls) rather than system level events such as CPU utilization. This approach lets our system automatically accommodate natural variations in an application's utilization of resources. We use performance irregularities and degradation as signs of software interference. However, instead of relying on temporal changes in performance, our system detects spatial performance degradation across multiple processors. We also include a case study that demonstrates our technique's effectiveness, resilience and robustness. expand
|
|
|
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements |
| |
Qi Gao,
Feng Qin,
Dhabaleswar K. Panda
|
|
Article No.: 15 |
|
doi>10.1145/1362622.1362643 |
|
Full text: PDF
|
|
While software reliability in large-scale systems becomes increasingly important, debugging in large-scale parallel systems remains a daunting task. This paper proposes an innovative technique to find hard-to-detect software bugs that can cause ...
While software reliability in large-scale systems becomes increasingly important, debugging in large-scale parallel systems remains a daunting task. This paper proposes an innovative technique to find hard-to-detect software bugs that can cause severe problems such as data corruptions and deadlocks in parallel programs automatically via detecting their abnormal behaviors in data movements. Based on the observation that data movements in parallel programs typically follow certain patterns, our idea is to extract data movement (DM)-based invariants at program runtime and check the violations of these invariants. These violations indicate potential bugs such as data races and memory corruption bugs that manifest themselves in data movements. We have built a tool, called DMTracker, based on the above idea: automatically extract DM-based invariants and detect the violations of them. Our experiments with two real-world bug cases in MVAPICH/MVAPICH2, a popular MPI library, have shown that DMTracker can effectively detect them and report abnormal data movements to help programmers quickly diagnose the root causes of bugs. In addition, DMTracker incurs very low runtime overhead, from 0.9% to 6.0%, in our experiments with High Performance Linpack (HPL) and NAS Parallel Benchmarks (NPB), which indicates that DMTracker can be deployed in production runs. expand
|
|
|
Scalable security for petascale parallel file systems |
| |
Andrew W. Leung,
Ethan L. Miller,
Stephanie Jones
|
|
Article No.: 16 |
|
doi>10.1145/1362622.1362644 |
|
Full text: PDF
|
|
Petascale, high-performance file systems often hold sensitive data and thus require security, but authentication and authorization can dramatically reduce performance. Existing security solutions perform poorly in these environments because they cannot ...
Petascale, high-performance file systems often hold sensitive data and thus require security, but authentication and authorization can dramatically reduce performance. Existing security solutions perform poorly in these environments because they cannot scale with the number of nodes, highly distributed data, and demanding workloads. To address these issues, we developed Maat, a security protocol designed to provide strong, scalable security to these systems. Maat introduces three new techniques. Extended capabilities limit the number of capabilities needed by allowing a capability to authorize I/O for any number of client-file pairs. Automatic Revocation uses short capability lifetimes to allow capability expiration to act as global revocation, while supporting non-revoked capability renewal. Secure Delegation allows clients to securely act on behalf of a group to open files and distribute access, facilitating secure joint computations. Experiments on the Maat prototype in the Ceph petascale file system show an overhead as little as 6--7%. expand
|
|
|
SESSION: System architecture |
| |
John B. Carter
|
|
|
|
|
The Cray BlackWidow: a highly scalable vector multiprocessor |
| |
Dennis Abts,
Abdulla Bataineh,
Steve Scott,
Greg Faanes,
Jim Schwarzmeier,
Eric Lundberg,
Tim Johnson,
Mike Bye,
Gerald Schwoerer
|
|
Article No.: 17 |
|
doi>10.1145/1362622.1362646 |
|
Full text: PDF
|
|
This paper describes the system architecture of the Cray BlackWidow scalable vector multiprocessor. The BlackWidow system is a distributed shared memory (DSM) architecture that is scalable to 32K processors, each with a 4-way dispatch scalar execution ...
This paper describes the system architecture of the Cray BlackWidow scalable vector multiprocessor. The BlackWidow system is a distributed shared memory (DSM) architecture that is scalable to 32K processors, each with a 4-way dispatch scalar execution unit and an 8-pipe vector unit capable of 20.8 Gflops for 64-bit operations and 41.6 Gflops for 32-bit operations at the prototype operating frequency of 1.3 GHz. Global memory is directly accessible with processor loads and stores and is globally coherent. The system supports thousands of outstanding references to hide remote memory latencies, and provides a rich suite of built-in synchronization primitives. Each BlackWidow node is implemented as a 4-way SMP with up to 128 Gbytes of DDR2 main memory capacity. The system supports common programming models such as MPI and OpenMP, as well as global address space languages such as UPC and CAF. We describe the system architecture and microarchitecture of the processor, memory controller, and router chips. We give preliminary performance results and discuss design tradeoffs. expand
|
|
|
GRAPE-DR: 2-Pflops massively-parallel computer with 512-core, 512-Gflops processor chips for scientific computing |
| |
Junichiro Makino,
Kei Hiraki,
Mary Inaba
|
|
Article No.: 18 |
|
doi>10.1145/1362622.1362647 |
|
Full text: PDF
|
|
We describe the GRAPE-DR (Greatly Reduced Array of Processor Elements with Data Reduction) system, which will consist of 4096 processor chips each with 512 cores operating at the clock frequency of 500 MHz. The peak speed of a processor chip is 512Gflops ...
We describe the GRAPE-DR (Greatly Reduced Array of Processor Elements with Data Reduction) system, which will consist of 4096 processor chips each with 512 cores operating at the clock frequency of 500 MHz. The peak speed of a processor chip is 512Gflops (single precision) or 256 Gflops (double precision). The GRAPE-DR chip works as an attached processor to standard PCs. Currently, a PCI-X board with single GRAPE-DR chip is in operation. We are developing a 4-chip board with PCI-Express interface, which will have the peak performance of 1 Tflops. The final system will be a cluster of 512 PCs each with two GRAPE-DR boards. We plan to complete the final system by early 2009. The application area of GRAPE-DR covers particle-based simulations such as astrophysical many-body simulations and molecular-dynamics simulations, quantum chemistry calculations, various applications which requires dense matrix operations, and many other compute-intensive applications. expand
|
|
|
A case for low-complexity MP architectures |
| |
Håkan Zeffer,
Erik Hagersten
|
|
Article No.: 19 |
|
doi>10.1145/1362622.1362648 |
|
Full text: PDF
|
|
Advances in semiconductor technology have driven shared-memory servers toward processors with multiple cores per die and multiple threads per core. This paper presents simple hardware primitives enabling flexible and low-complexity multi-chip designs ...
Advances in semiconductor technology have driven shared-memory servers toward processors with multiple cores per die and multiple threads per core. This paper presents simple hardware primitives enabling flexible and low-complexity multi-chip designs supporting an efficient inter-node coherence protocol implemented in software. We argue that our primitives and the example design presented in this paper have lower hardware overhead, have easier (and later) verification requirements, and provide the opportunity for flexible coherence protocols and simpler protocol bug corrections than traditional designs. Our evaluation is based on detailed full-system simulations of modern chip-multiprocessors and both commercial and HPC workloads. We compare a low-complexity system based on the proposed primitives with aggressive hardware multi-chip shared-memory systems and show that the performance is competitive across a large design space. expand
|
|
|
SESSION: Microarchitecture |
| |
Dennis Abts
|
|
|
|
|
Variable latency caches for nanoscale processor |
| |
Serkan Ozdemir,
Arindam Mallik,
Ja Chun Ku,
Gokhan Memik,
Yehea Ismail
|
|
Article No.: 20 |
|
doi>10.1145/1362622.1362650 |
|
Full text: PDF
|
|
Variability is one of the important issues in nanoscale processors. Due to increasing importance of interconnect structures in submicron technologies, the physical location and phenomena such as coupling have an increasing impact on the latency of operations. ...
Variability is one of the important issues in nanoscale processors. Due to increasing importance of interconnect structures in submicron technologies, the physical location and phenomena such as coupling have an increasing impact on the latency of operations. Therefore, traditional view of rigid access latencies to components wil result in suboptimal architectures. In this paper, we devise a cache architecture with variable access latency. Particularly, we a) develop a non-uniform access level 1 data-cache, b) study the impact of coupling and physical location on level 1 data cache access latencies, and c) develop and study an architecture where the variable latency cache can be accessed while the rest of the pipeline remains synchronous. To find the access latency with different input address transitions and environmental conditions, we first build a SPICE model at a 45nm technology for a cache similar to that of the level 1 data cache of the Intel Prescott architecture. Motivated by the large difference between the worst and best case latencies and the shape of the distribution curve, we change the cache architecture to allow variable latency accesses. Since the latency of the cache is not known at the time of instruction scheduling, we also modify the functional units with the addition of special queues that will temporarily store the dependent instructions and allow the data to be forwarded from the cache to the functional units correctly. Simulations based on SPEC2000 benchmarks show that our variable access latency cache structure can reduce the execution time by as much as 19.4% and 10.7% on average compared to a conventional cache architecture. expand
|
|
|
Data access history cache and associated data prefetching mechanisms |
| |
Yong Chen,
Surendra Byna,
Xian-He Sun
|
|
Article No.: 21 |
|
doi>10.1145/1362622.1362651 |
|
Full text: PDF
|
|
Data prefetching is an effective way to bridge the increasing performance gap between processor and memory. As computing power is increasing much faster than memory performance, we suggest that it is time to have a dedicated cache to store data access ...
Data prefetching is an effective way to bridge the increasing performance gap between processor and memory. As computing power is increasing much faster than memory performance, we suggest that it is time to have a dedicated cache to store data access histories and to serve prefetching to mask data access latency effectively. We thus propose a new cache structure, named Data Access History Cache (DAHC), and study its associated prefetching mechanisms. The DAHC behaves as a cache for recent reference information instead of as a traditional cache for instructions or data. Theoretically, it is capable of supporting many well known history-based prefetching algorithms, especially adaptive and aggressive approaches. We have carried out simulation experiments to validate DAHC design and DAHC-based data prefetching methodologies and to demonstrate performance gains. The DAHC provides a practical approach to reaping data prefetching benefits and its associated prefetching mechanisms are proven more effective than traditional approaches. expand
|
|
|
Scaling performance of interior-point method on large-scale chip multiprocessor system |
| |
Mikhail Smelyanskiy,
Victor W Lee,
Daehyun Kim,
Anthony D Nguyen,
Pradeep Dubey
|
|
Article No.: 22 |
|
doi>10.1145/1362622.1362652 |
|
Full text: PDF
|
|
In this paper we describe parallelization of interior-point method (IPM) aimed at achieving high scalability on large-scale chip-multiprocessors (CMPs). IPM is an important computational technique used to solve optimization problems in many areas of ...
In this paper we describe parallelization of interior-point method (IPM) aimed at achieving high scalability on large-scale chip-multiprocessors (CMPs). IPM is an important computational technique used to solve optimization problems in many areas of science, engineering and finance. IPM spends most of its computation time in a few sparse linear algebra kernels. While each of these kernels contains a large amount of parallelism, sparse irregular datasets seen in many optimization problems make parallelism difficult to exploit. As a result, most researchers have shown only a relatively low scalability of 4X-12X on medium to large scale parallel machines. This paper proposes and evaluates several algorithmic and hardware features to improve IPM parallel performance on large-scale CMPs. Through detailed simulations, we demonstrate how exploring multiple levels of parallelism with hardware support for low overhead task queues and parallel reduction enables IPM to achieve up to 48X parallel speedup on a 64-core CMP. expand
|
|
|
SESSION: PDE applications |
| |
Omar Ghattas
|
|
|
|
|
Data exploration of turbulence simulations using a database cluster |
| |
Eric Perlman,
Randal Burns,
Yi Li,
Charles Meneveau
|
|
Article No.: 23 |
|
doi>10.1145/1362622.1362654 |
|
Full text: PDF
|
|
We describe a new environment for the exploration of turbulent flows that uses a cluster of databases to store complete histories of Direct Numerical Simulation (DNS) results. This allows for spatial and temporal exploration of high-resolution data that ...
We describe a new environment for the exploration of turbulent flows that uses a cluster of databases to store complete histories of Direct Numerical Simulation (DNS) results. This allows for spatial and temporal exploration of high-resolution data that were traditionally too large to store and too computationally expensive to produce on demand. We perform analysis of these data directly on the databases nodes, which minimizes the volume of network traffic. The low network demands enable us to provide public access to this experimental platform and its datasets through Web services. This paper details the system design and implementation. Specifically, we focus on hierarchical spatial indexing, cache-sensitive spatial scheduling of batch workloads, localizing computation through data partitioning, and load balancing techniques that minimize data movement. We provide real examples of how scientists use the system to perform high-resolution turbulence research from standard desktop computing environments. expand
|
|
|
Parallel hierarchical visualization of large time-varying 3D vector fields |
| |
Hongfeng Yu,
Chaoli Wang,
Kwan-Liu Ma
|
|
Article No.: 24 |
|
doi>10.1145/1362622.1362655 |
|
Full text: PDF
|
|
We present the design of a scalable parallel pathline construction method for visualizing large time-varying 3D vector fields. A 4D (i.e., time and the 3D spatial domain) representation of the vector field is introduced to make a time-accurate depiction ...
We present the design of a scalable parallel pathline construction method for visualizing large time-varying 3D vector fields. A 4D (i.e., time and the 3D spatial domain) representation of the vector field is introduced to make a time-accurate depiction of the flow field. This representation also allows us to obtain pathlines through streamline tracing in the 4D space. Furthermore, a hierarchical representation of the 4D vector field, constructed by clustering the 4D field, makes possible interactive visualization of the flow field at different levels of abstraction. Based on this hierarchical representation, a data partitioning scheme is designed to achieve high parallel efficiency. We demonstrate the performance of parallel pathline visualization using data sets obtained from terascale flow simulations. This new capability will enable scientists to study their time-varying vector fields at the resolution and interactivity previously unavailable to them. expand
|
|
|
Low-constant parallel algorithms for finite element simulations using linear octrees |
| |
Hari Sundar,
Rahul S. Sampath,
Santi S. Adavani,
Christos Davatzikos,
George Biros
|
|
Article No.: 25 |
|
doi>10.1145/1362622.1362656 |
|
Full text: PDF
|
|
In this article we propose parallel algorithms for the construction of conforming finite-element discretization on linear octrees. Existing octree-based discretizations scale to billions of elements, but the complexity constants can be high. In our approach ...
In this article we propose parallel algorithms for the construction of conforming finite-element discretization on linear octrees. Existing octree-based discretizations scale to billions of elements, but the complexity constants can be high. In our approach we use several techniques to minimize overhead: a novel bottom-up tree-construction and 2:1 balance constraint enforcement; a Golomb-Rice encoding for compression by representing the octree and element connectivity as an Uniquely Decodable Code (UDC); overlapping communication and computation; and byte alignment for cache efficiency. The cost of applying the Laplacian is comparable to that of applying it using a direct indexing regular grid discretization with the same number of elements. Our algorithm has scaled up to four billion octants on 4096 processors on a Cray XT3 at the Pittsburgh Supercomputing Center. The overall tree construction time is under a minute in contrast to previous implementations that required several minutes; the evaluation of the discretization of a variable-coefficient Laplacian takes only a few seconds. expand
|
|
|
SESSION: File systems |
| |
Frank Mueller
|
|
|
|
|
Noncontiguous locking techniques for parallel file systems |
| |
Avery Ching,
Wei-keng Liao,
Alok Choudhary,
Robert Ross,
Lee Ward
|
|
Article No.: 26 |
|
doi>10.1145/1362622.1362658 |
|
Full text: PDF
|
|
Many parallel scientific applications use high-level I/O APIs that offer atomic I/O capabilities. Atomic I/O in current parallel file systems is often slow when multiple processes simultaneously access interleaved, shared files. Current atomic I/O solutions ...
Many parallel scientific applications use high-level I/O APIs that offer atomic I/O capabilities. Atomic I/O in current parallel file systems is often slow when multiple processes simultaneously access interleaved, shared files. Current atomic I/O solutions are not optimized for handling noncontiguous access patterns because current locking systems have a fixed file system block-based granularity and do not leverage high-level access pattern information. In this paper we present a hybrid lock protocol that takes advantage of new list and datatype byte-range lock description techniques to enable high performance atomic I/O operations for these challenging access patterns. We implement our scalable distributed lock manager (DLM) in the PVFS parallel file system and show that these techniques improve locking throughput over a naive noncontiguous locking approach by several orders of magnitude in an array of lock-only tests. Additionally, in two scientific I/O benchmarks, we show the benefits of avoiding false sharing with our byte-range granular DLM when compared against a block-based lock system implementation. expand
|
|
|
Integrating parallel file systems with object-based storage devices |
| |
Ananth Devulapalli,
Dennis Dalessandro,
Pete Wyckoff,
Nawab Ali,
P. Sadayappan
|
|
Article No.: 27 |
|
doi>10.1145/1362622.1362659 |
|
Full text: PDF
|
|
As storage systems evolve, the block-based design of today's disks is becoming inadequate. As an alternative, object-based storage devices (OSDs) offer a view where the disk manages data layout and keeps track of various attributes about data objects. ...
As storage systems evolve, the block-based design of today's disks is becoming inadequate. As an alternative, object-based storage devices (OSDs) offer a view where the disk manages data layout and keeps track of various attributes about data objects. By moving functionality that is traditionally the responsibility of the host OS to the disk, it is possible to improve overall performance and simplify management of a storage system. The capabilities of OSDs will also permit performance improvements in parallel file systems, such as further decoupling metadata operations and thus reducing metadata server bottlenecks. In this work we present an implementation of the Parallel Virtual File System (PVFS) integrated with a software emulator of an OSD and describe an infrastructure for client access. Even with the overhead of emulation, performance is comparable to a traditional server-fronted implementation, demonstrating that serverless parallel file systems using OSDs are an achievable goal. expand
|
|
|
Evaluation of active storage strategies for the lustre parallel file system |
| |
Juan Piernas,
Jarek Nieplocha,
Evan J. Felix
|
|
Article No.: 28 |
|
doi>10.1145/1362622.1362660 |
|
Full text: PDF
|
|
Active Storage provides an opportunity for reducing the amount of data movement between storage and compute nodes of a parallel filesystem such as Lustre, and PVFS. It allows certain types of data processing operations to be performed directly on the ...
Active Storage provides an opportunity for reducing the amount of data movement between storage and compute nodes of a parallel filesystem such as Lustre, and PVFS. It allows certain types of data processing operations to be performed directly on the storage nodes of modern parallel filesystems, near the data they manage. This is possible by exploiting the underutilized processor and memory resources of storage nodes that are implemented using general purpose servers and operating systems. In this paper, we present a novel user-space implementation of Active Storage for Lustre, and compare it to the traditional kernel-based implementation. Based on microbenchmark and application level evaluation, we show that both approaches can reduce the network traffic, and take advantage of the extra computing capacity offered by the storage nodes at the same time. However, our user-space approach has proved to be faster, more flexible, portable, and readily deployable than the kernel-space version. expand
|
|
|
SESSION: Performance tools and methods |
| |
Bernd Mohr
|
|
|
|
|
The ghost in the machine: observing the effects of kernel operation on parallel application performance |
| |
Aroon Nataraj,
Alan Morris,
Allen D. Malony,
Matthew Sottile,
Pete Beckman
|
|
Article No.: 29 |
|
doi>10.1145/1362622.1362662 |
|
Full text: PDF
|
|
The performance of a parallel application on a scalable HPC system is determined by user-level execution of the application code and system-level (OS kernel) operations. To understand the influences of system-level factors on application performance, ...
The performance of a parallel application on a scalable HPC system is determined by user-level execution of the application code and system-level (OS kernel) operations. To understand the influences of system-level factors on application performance, the measurement of OS kernel activities is key. We describe a technology to observe kernel actions and make this information available to application-level performance measurement tools. The benefits of merged application and OS performance information and its use in parallel performance analysis are demonstrated, both for profiling and tracing methodologies. In particular, we focus on the problem of kernel noise assessment as a stress test of the approach. We show new results for characterizing noise and introduce new techniques for evaluating noise interference and its effects on application execution. Our kernel measurement and noise analysis technologies are being developed as part of Linux OS environments for scalable parallel systems. expand
|
|
|
PNMPI tools: a whole lot greater than the sum of their parts |
| |
Martin Schulz,
Bronis R. de Supinski
|
|
Article No.: 30 |
|
doi>10.1145/1362622.1362663 |
|
Full text: PDF
|
|
PNMPI extends the PMPI profiling interface to support multiple concurrent PMPI-based tools by enabling users to assemble tool stacks. We extend this basic concept to include new services for tool interoperability and to switch between ...
PNMPI extends the PMPI profiling interface to support multiple concurrent PMPI-based tools by enabling users to assemble tool stacks. We extend this basic concept to include new services for tool interoperability and to switch between tool stacks dynamically. This allows PNMPI to support modules that virtualize MPI execution environments within an MPI job or that restrict the application of existing, unmodified tools to a dynamic subset of MPI calls or even call sites. Further, we extend PNMPI to platforms without dynamic linking, such as BlueGene/L, and we introduce an extended performance model along with experimental data from microbenchmarks to show that the performance overhead on any platform is negligible. More importantly, we provide significant new MPI tool components that are sufficient to compose interesting MPI tools. We present three detailed PNMPI usage scenarios that demonstrate that it significantly simplifies the creation of application-specific tools. expand
|
|
|
Multi-threading and one-sided communication in parallel LU factorization |
| |
Parry Husbands,
Katherine Yelick
|
|
Article No.: 31 |
|
doi>10.1145/1362622.1362664 |
|
Full text: PDF
|
|
Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem ...
Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has non-trivial dependence patterns which limit parallelism, and local computations require large matrices in order to achieve good single processor performance. We present an alternative programming model for this type of problem, which combines UPC's global address space with lightweight multithreading. We introduce the concept of memory-constrained lookahead where the amount of concurrency managed by each processor is controlled by the amount of memory available. We implement novel techniques for steering the computation to optimize for high performance and demonstrate the scalability and portability of UPC with Teraflop level performance on some machines, comparing favourably to other state-of-the-art MPI codes. expand
|
|
|
SESSION: Grid management |
| |
Philip M. Papadopoulos
|
|
|
|
|
Workstation capacity tuning using reinforcement learning |
| |
Aharon Bar-Hillel,
Amir Di-Nur,
Liat Ein-Dor,
Ran Gilad-Bachrach,
Yossi Ittach
|
|
Article No.: 32 |
|
doi>10.1145/1362622.1362666 |
|
Full text: PDF
|
|
Computer grids are complex, heterogeneous, and dynamic systems, whose behavior is governed by hundreds of manually-tuned parameters. As the complexity of these systems grows, automating the procedure of parameter tuning becomes indispensable. In this ...
Computer grids are complex, heterogeneous, and dynamic systems, whose behavior is governed by hundreds of manually-tuned parameters. As the complexity of these systems grows, automating the procedure of parameter tuning becomes indispensable. In this paper, we consider the problem of auto-tuning server capacity, i.e. the number of jobs a server runs in parallel. We present three different reinforcement learning algorithms, which generate a dynamic policy by changing the number of concurrent running jobs according to the job types and machine state. The algorithms outperform manually-tuned policies for the entire range of checked workloads, with average throughput improvement greater than 20%. On multi-core servers, the average throughput improvement is approximately 40%, which hints at the enormous improvement potential of such a tuning mechanism with the gradual transition to multi-core machines. expand
|
|
|
Anomaly detection and diagnosis in grid environments |
| |
Lingyun Yang,
Chuang Liu,
Jennifer M. Schopf,
Ian Foster
|
|
Article No.: 33 |
|
doi>10.1145/1362622.1362667 |
|
Full text: PDF
|
|
Identifying and diagnosing anomalies in application behavior is critical to delivering reliable application-level performance. In this paper we introduce a strategy to detect anomalies and diagnose the possible reasons behind them. Our approach extends ...
Identifying and diagnosing anomalies in application behavior is critical to delivering reliable application-level performance. In this paper we introduce a strategy to detect anomalies and diagnose the possible reasons behind them. Our approach extends the traditional window-based strategy by using signal-processing techniques to filter out recurring, background fluctuations in resource behavior. In addition, we have developed a diagnosis technique that uses standard monitoring data to determine which related changes in behavior may cause anomalies. We evaluate our anomaly detection and diagnosis technique by applying it in three contexts when we insert anomalies into the system at random intervals. The experimental results show that our strategy detects up to 96% of anomalies while reducing the false positive rate by up to 90% compared to the traditional window average strategy. In addition, our strategy can diagnose the reason for the anomaly approximately 75% of the time. expand
|
|
|
User-friendly and reliable grid computing based on imperfect middleware |
| |
Rob V. van Nieuwpoort,
Thilo Kielmann,
Henri E. Bal
|
|
Article No.: 34 |
|
doi>10.1145/1362622.1362668 |
|
Full text: PDF
|
|
Writing grid applications is hard. First, interfaces to existing grid middleware often are too low-level for application programmers who are domain experts rather than computer scientists. Second, grid APIs tend to evolve too quickly for applications ...
Writing grid applications is hard. First, interfaces to existing grid middleware often are too low-level for application programmers who are domain experts rather than computer scientists. Second, grid APIs tend to evolve too quickly for applications to follow. Third, failures and configuration incompatibilities require applications to use different solutions to the same problem, depending on the actual sites in use. This paper describes the Java Grid Application Toolkit (Java-GAT) that provides a high-level, middleware-independent and site-independent interface to the grid. The JavaGAT uses nested exceptions and intelligent dispatching of method invocations to handle errors and to automatically select suitable grid middleware implementations for requested operations. The JavaGAT's adaptor writing framework simplifies the implementation of interfaces to new middleware releases by combining nested exceptions and intelligent dispatching with rich default functionality. The many applications and middleware adaptors that have been provided by third-party developers indicate the viability of our approach. expand
|
|
|
SESSION: Network interfaces |
| |
Scott Pakin
|
|
|
|
|
Analyzing the impact of supporting out-of-order communication on in-order performance with iWARP |
| |
P. Balaji,
W. Feng,
S. Bhagvat,
D. K. Panda,
R. Thakur,
W. Gropp
|
|
Article No.: 35 |
|
doi>10.1145/1362622.1362670 |
|
Full text: PDF
|
|
Due to the growing need to tolerate network faults and congestion in high-end computing systems, supporting multiple network communication paths is becoming increasingly important. However, multi-path communication comes with the disadvantage of out-of-order ...
Due to the growing need to tolerate network faults and congestion in high-end computing systems, supporting multiple network communication paths is becoming increasingly important. However, multi-path communication comes with the disadvantage of out-of-order arrival of packets (because packets may traverse different paths). While modern networking stacks such as the Internet Wide-Area RDMA Protocol (iWARP) over 10-Gigabit Ethernet (10GE) support multi-path communication, their current implementations do not handle out-of-order packets primarily owing to the overhead on in-order communication that it adds. Specifically, in iWARP, supporting out-of-order packets requires every packet to carry additional information causing significant overhead on packets that arrive in-order. Thus, in this paper, we analyze the trade-offs in designing a feature-complete iWARP stack, i.e., one that provides support for out-of-order arriving packets, and thus, multi-path systems, while focusing on the performance of in-order communication. We propose three feature-complete designs of iWARP and analyze the pros and cons of each of these designs using performance experiments based on several micro-benchmarks as well as an iso-surface visual rendering application. Our analysis reveals that the iWARP design providing the best overall performance depends on the particular characteristics of the upper layers and that different designs are optimal based on the metric of interest. expand
|
|
|
Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors |
| |
Keith D. Underwood,
Michael J. Levenhagen,
Ron Brightwell
|
|
Article No.: 36 |
|
doi>10.1145/1362622.1362671 |
|
Full text: PDF
|
|
Partitioned global address space (PGAS) programming models have been identified as one of the few viable approaches for dealing with emerging many-core systems. These models tend to generate many small messages, which requires specific support from the ...
Partitioned global address space (PGAS) programming models have been identified as one of the few viable approaches for dealing with emerging many-core systems. These models tend to generate many small messages, which requires specific support from the network interface hardware to enable efficient execution. In the past, Cray included E-registers on the Cray T3E to support the SHMEM API; however, with the advent of multi-core processors, the balance of computation to communication capabilities has shifted toward computation. This paper explores the message rates that are achievable with multi-core processors and simplified PGAS support on a more conventional network interface. For message rate tests, we find that simple network interface hardware is more than sufficient. We also find that even typical data distributions, such as cyclic or block-cyclic, do not need specialized hardware support. Finally, we assess the impact of such support on the well known RandomAccess benchmark. expand
|
|
|
High-performance ethernet-based communications for future multi-core processors |
| |
Michael Schlansker,
Nagabhushan Chitlur,
Erwin Oertli,
Paul M. Stillwell, Jr,
Linda Rankin,
Dennis Bradford,
Richard J. Carter,
Jayaram Mudigonda,
Nathan Binkert,
Norman P. Jouppi
|
|
Article No.: 37 |
|
doi>10.1145/1362622.1362672 |
|
Full text: PDF
|
|
Data centers and HPC clusters often incorporate specialized networking fabrics to satisfy system requirements. However, Ethernet's low cost and high performance are causing a shift from specialized fabrics toward standard Ethernet. Although Ethernet's ...
Data centers and HPC clusters often incorporate specialized networking fabrics to satisfy system requirements. However, Ethernet's low cost and high performance are causing a shift from specialized fabrics toward standard Ethernet. Although Ethernet's low-level performance approaches that of specialized fabrics, the features that these fabrics provide such as reliable in-order delivery and flow control are implemented, in the case of Ethernet, by endpoint hardware and software. Unfortunately, current Ethernet endpoints are either slow (commodity NICs with generic TCP/IP stacks) or costly (offload engines). To address these issues, the JNIC project developed a novel Ethernet endpoint. JNIC's hardware and software were specifically designed for the requirements of high-performance communications within future data-centers and compute clusters. The architecture combines capabilities already seen in advanced network architectures with new innovations to create a comprehensive solution for scalable and high-performance Ethernet. We envision a JNIC architecture that is suitable for most in-data-center communication needs. expand
|
|
|
SESSION: Benchmarking |
| |
Allan Snavely
|
|
|
|
|
Optimization of sparse matrix-vector multiplication on emerging multicore platforms |
| |
Samuel Williams,
Leonid Oliker,
Richard Vuduc,
John Shalf,
Katherine Yelick,
James Demmel
|
|
Article No.: 38 |
|
doi>10.1145/1362622.1362674 |
|
Full text: PDF
|
|
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, ...
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms. expand
|
|
|
Cray XT4: an early evaluation for petascale scientific simulation |
| |
Sadaf R. Alam,
Jeffery A. Kuehn,
Richard F. Barrett,
Jeff M. Larkin,
Mark R. Fahey,
Ramanan Sankaran,
Patrick H. Worley
|
|
Article No.: 39 |
|
doi>10.1145/1362622.1362675 |
|
Full text: PDF
|
|
The scientific simulation capabilities of next generation high-end computing technology will depend on striking a balance among memory, processor, I/O, and local and global network performance across the breadth of the scientific simulation space. The ...
The scientific simulation capabilities of next generation high-end computing technology will depend on striking a balance among memory, processor, I/O, and local and global network performance across the breadth of the scientific simulation space. The Cray XT4 combines commodity AMD dual core Opteron processor technology with the second generation of Cray's custom communication accelerator in a system design whose balance is claimed to be driven by the demands of scientific simulation. This paper presents an evaluation of the Cray XT4 using micro-benchmarks to develop a controlled understanding of individual system components, providing the context for analyzing and comprehending the performance of several petascale-ready applications. Results gathered from several strategic application domains are compared with observations on the previous generation Cray XT3 and other high-end computing systems, demonstrating performance improvements across a wide variety of application benchmark problems. expand
|
|
|
An adaptive mesh refinement benchmark for modern parallel programming languages |
| |
Tong Wen,
Jimmy Su,
Phillip Colella,
Katherine Yelick,
Noel Keen
|
|
Article No.: 40 |
|
doi>10.1145/1362622.1362676 |
|
Full text: PDF
|
|
We present an Adaptive Mesh Refinement benchmark for evaluating programmability and performance of modern parallel programming languages. Benchmarks employed today by language developing teams, originally designed for performance evaluation of computer ...
We present an Adaptive Mesh Refinement benchmark for evaluating programmability and performance of modern parallel programming languages. Benchmarks employed today by language developing teams, originally designed for performance evaluation of computer architectures, do not fully capture the complexity of state-of-the-art computational software systems running on today's parallel machines or to be run on the emerging ones from the multi-cores to the peta-scale High Productivity Computer Systems. This benchmark, extracted from a real application framework, presents challenges for a programming language in both expressiveness and performance. It consists of an infrastructure for finite difference calculations on block-structured adaptive meshes and a solver for elliptic Partial Differential Equations built on this infrastructure. Adaptive Mesh Refinement algorithms are challenging to implement due to the irregularity introduced by local mesh refinement. We describe those challenges posed by this benchmark through two reference implementations (C++ /Fortran/MPI and Titanium) and in the context of three programming models. expand
|
|
|
SESSION: Grid performance |
| |
Daniel S. Katz
|
|
|
|
|
Exploring event correlation for failure prediction in coalitions of clusters |
| |
Song Fu,
Cheng-Zhong Xu
|
|
Article No.: 41 |
|
doi>10.1145/1362622.1362678 |
|
Full text: PDF
|
|
In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and ...
In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to describe spatial correlation. We further utilize the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. We implemented a failure prediction framework, called PREdictor of Failure Events Correlated Temporal-Spatially (hPREFECTs), which explores correlations among failures and forecasts the time-between-failure of future instances. We evaluate the performance of hPREFECTs in both offline prediction of failure by using the Los Alamos HPC traces and online prediction in an institute-wide clusters coalition environment. Experimental results show the system achieves more than 76% accuracy in offline prediction and more than 70% accuracy in online prediction during the time from May 2006 to April 2007. expand
|
|
|
Advanced data flow support for scientific grid workflow applications |
| |
Jun Qin,
Thomas Fahringer
|
|
Article No.: 42 |
|
doi>10.1145/1362622.1362679 |
|
Full text: PDF
|
|
Existing work does not provide a flexible dataset-oriented data flow mechanism to meet the complex requirements of scientific Grid workflow applications. In this paper we present a sophisticated approach to this problem by introducing a data collection ...
Existing work does not provide a flexible dataset-oriented data flow mechanism to meet the complex requirements of scientific Grid workflow applications. In this paper we present a sophisticated approach to this problem by introducing a data collection concept and the corresponding collection distribution constructs, which are inspired by HPF, however applied to Grid workflow applications. Based on these constructs, more fine-grained data flows can be specified at an abstract workflow language level, such as mapping a portion of a dataset to an activity, independently distributing multiple datasets, not necessarily with the same number of data elements, onto loop iterations. Our approach reduces data duplication, optimizes data transfers as well as simplifies the effort to port workflow applications onto the Grid. We have extended AGWL with these concepts and implemented the corresponding runtime support in ASKALON. We apply our approach to some real world scientific workflow applications and report performance results. expand
|
|
|
Falkon: a Fast and Light-weight tasK executiON framework |
| |
Ioan Raicu,
Yong Zhao,
Catalin Dumitrescu,
Ian Foster,
Mike Wilde
|
|
Article No.: 43 |
|
doi>10.1145/1362622.1362680 |
|
Full text: PDF
|
|
To enable the rapid execution of many tasks on compute clusters, we have developed Falkon, a Fast and Light-weight tasK executiON framework. Falkon integrates (1) multi-level scheduling to separate resource acquisition (via, e.g., requests to batch schedulers) ...
To enable the rapid execution of many tasks on compute clusters, we have developed Falkon, a Fast and Light-weight tasK executiON framework. Falkon integrates (1) multi-level scheduling to separate resource acquisition (via, e.g., requests to batch schedulers) from task dispatch, and (2) a streamlined dispatcher. Falkon's integration of multi-level scheduling and streamlined dispatchers delivers performance not provided by any other system. We describe Falkon architecture and implementation, and present performance results for both microbenchmarks and applications. Microbenchmarks show that Falkon throughput (487 tasks/sec) and scalability (to 54,000 executors and 2,000,000 tasks processed in just 112 minutes) are one to two orders of magnitude better than other systems used in production Grids. Large-scale astronomy and medical applications executed under Falkon by the Swift parallel programming system achieve up to 90% reduction in end-to-end run time, relative to versions that execute tasks via separate scheduler submissions. expand
|
|
|
SESSION: Storage, file systems, and GPU hashing |
| |
Brett M. Bode
|
|
|
|
|
RobuSTore: a distributed storage architecture with robust and high performance |
| |
Huaxia Xia,
Andrew A. Chien
|
|
Article No.: 44 |
|
doi>10.1145/1362622.1362682 |
|
Full text: PDF
|
|
Emerging large-scale scientific applications require to access large data objects in high and robust performance. We propose RobuSTore, a storage architecture that combines erasure codes and speculative access mechanisms for parallel write and read in ...
Emerging large-scale scientific applications require to access large data objects in high and robust performance. We propose RobuSTore, a storage architecture that combines erasure codes and speculative access mechanisms for parallel write and read in distributed environments. The mechanisms can effectively aggregate the bandwidth from a large number of distributed disks and statistically tolerate pear-disk performance variation. Our simulation results affirm the high and robust performance of RobuSTore in both write and read operations compared to traditional parallel storage systems. For example, for a 1GB data access using 64 disks, RobuSTore achieves average bandwidth of 186MBps for write and 400MBps for read, nearly 6x and 15x that achieved by a RAID-0 system. The standard deviation of access latency is only 0.5 second, about 9% of the write latency and 20% of the read latency, and a 5-fold improvement from RAID-0. The improvements are achieved at moderate cost: about 40% increase in I/O operations and 2x-3x increase in storage capacity utilization. expand
|
|
|
A user-level secure grid file system |
| |
Ming Zhao,
Renato J. Figueiredo
|
|
Article No.: 45 |
|
doi>10.1145/1362622.1362683 |
|
Full text: PDF
|
|
A grid-wide distributed file system provides convenient data access interfaces that facilitate fine-grained cross-domain data sharing and collaboration. However, existing widely-adopted distributed file systems do not meet the security requirements for ...
A grid-wide distributed file system provides convenient data access interfaces that facilitate fine-grained cross-domain data sharing and collaboration. However, existing widely-adopted distributed file systems do not meet the security requirements for grid systems. This paper presents a Secure Grid File System (SGFS) which supports GSI-based authentication and access control, end-to-end message privacy, and integrity. It employs user-level virtualization of NFS to provide transparent grid data access leveraging existing, unmodified clients and servers. It supports user and application-tailored security customization per SGFS session, and leverages secure management services to control and configure the sessions. The system conforms to the GSI grid security infrastructure and allows for seamless integration with other grid middleware. A SGFS prototype is evaluated with both file system benchmarks and typical applications, which demonstrates that it can achieve strong security with an acceptable overhead, and substantially outperform native NFS in wide-area environments by using disk caching. expand
|
|
|
Efficient gather and scatter operations on graphics processors |
| |
Bingsheng He,
Naga K. Govindaraju,
Qiong Luo,
Burton Smith
|
|
Article No.: 46 |
|
doi>10.1145/1362622.1362684 |
|
Full text: PDF
|
|
Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs). ...
Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs). With superior computing power and high memory bandwidth, GPUs have become a commodity multiprocessor platform for general-purpose high-performance computing. However, due to the random access nature of gather and scatter, a naive implementation of the two operations suffers from a low utilization of the memory bandwidth and consequently a long, unhidden memory latency. Additionally, the architectural details of the GPUs, in particular, the memory hierarchy design, are unclear to the programmers. Therefore, we design multi-pass gather and scatter operations to improve their data access locality, and develop a performance model to help understand and optimize these two operations. We have evaluated our algorithms in sorting, hashing, and the sparse matrix-vector multiplication in comparison with their optimized CPU counterparts. Our results show that these optimizations yield 2--4X improvement on the GPU bandwidth utilization and 30--50% improvement on the response time. Overall, our optimized GPU implementations are 2--7X faster than their optimized CPU counterparts. expand
|
|
|
SESSION: Modeling in action |
| |
Vladimir Getov
|
|
|
|
|
A genetic algorithms approach to modeling the performance of memory-bound computations |
| |
Mustafa M Tikir,
Laura Carrington,
Erich Strohmaier,
Allan Snavely
|
|
Article No.: 47 |
|
doi>10.1145/1362622.1362686 |
|
Full text: PDF
|
|
Benchmarks that measure memory bandwidth, such as STREAM, Apex-MAPS and MultiMAPS, are increasingly popular due to the "Von Neumann" bottleneck of modern processors which causes many calculations to be memory-bound. We present a scheme for predicting ...
Benchmarks that measure memory bandwidth, such as STREAM, Apex-MAPS and MultiMAPS, are increasingly popular due to the "Von Neumann" bottleneck of modern processors which causes many calculations to be memory-bound. We present a scheme for predicting the performance of HPC applications based on the results of such benchmarks. A Genetic Algorithm approach is used to "learn" bandwidth as a function of cache hit rates per machine with MultiMAPS as the fitness test. The specific results are 56 individual performance predictions including 3 full-scale parallel applications run on 5 different modern HPC architectures, with various CPU counts and inputs, predicted within 10% average difference with respect to independently verified runtimes. expand
|
|
|
Performance under failures of high-end computing |
| |
Ming Wu,
Xian-He Sun,
Hui Jin
|
|
Article No.: 48 |
|
doi>10.1145/1362622.1362687 |
|
Full text: PDF
|
|
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, ...
Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults' influence on application performance is necessary to use existing fault tolerant methods wisely. In this study, we first introduce some practical and effective performance models to predict the application completion time under system failures. These models separate the influence of failure rate, failure repair, checkpointing period, checkpointing cost, and parallel task allocation on parallel and sequential execution times. To benefit the end users of a given computing platform, we then develop effective fault-aware task scheduling algorithms to optimize application performance under system failures. Finally, extensive simulations and experiments are conducted to evaluate our prediction models and scheduling strategies with actual failure trace. expand
|
|
|
Bounding energy consumption in large-scale MPI programs |
| |
Barry Rountree,
David K. Lowenthal,
Shelby Funk,
Vincent W. Freeh,
Bronis R. de Supinski,
Martin Schulz
|
|
Article No.: 49 |
|
doi>10.1145/1362622.1362688 |
|
Full text: PDF
|
|
Power is now a first-order design constraint in large-scale parallel computing. Used carefully, dynamic voltage scaling can execute parts of a program at a slower CPU speed to achieve energy savings with a relatively small (possibly zero) time delay. ...
Power is now a first-order design constraint in large-scale parallel computing. Used carefully, dynamic voltage scaling can execute parts of a program at a slower CPU speed to achieve energy savings with a relatively small (possibly zero) time delay. However, the problem of when to change frequencies in order to optimize energy savings is NP-complete, which has led to many heuristic energy-saving algorithms. To determine how closely these algorithms approach optimal savings, we developed a system that determines a bound on the energy savings for an application. Our system uses a linear programming solver that takes as inputs the application communication trace and the cluster power characteristics and then outputs a schedule that realizes this bound. We apply our system to three scientific programs, two of which exhibit load imbalance---particle simulation and UMT2K. Results from our bounding technique show particle simulation is more amenable to energy savings than UMT2K. expand
|
|
|
SESSION: Performance optimization |
| |
Derek Chiou
|
|
|
|
|
Application development on hybrid systems |
| |
Roger D. Chamberlain,
Mark A. Franklin,
Eric J. Tyson,
Jeremy Buhler,
Saurabh Gayen,
Patrick Crowley,
James H. Buckley
|
|
Article No.: 50 |
|
doi>10.1145/1362622.1362690 |
|
Full text: PDF
|
|
Hybrid systems consisting of a multitude of different computing device types are interesting targets for high-performance applications. Chip multiprocessors, FPGAs, DSPs, and GPUs can be readily put together into a hybrid system; however, it is not at ...
Hybrid systems consisting of a multitude of different computing device types are interesting targets for high-performance applications. Chip multiprocessors, FPGAs, DSPs, and GPUs can be readily put together into a hybrid system; however, it is not at all clear that one can effectively deploy applications on such a system. Coordinating multiple languages, especially very different languages like hardware and software languages, is awkward and error prone. Additionally, implementing communication mechanisms between different device types unnecessarily increases development time. This is compounded by the fact that the application developer, to be effective, needs performance data about the application early in the design cycle. We describe an application development environment specifically targeted at hybrid systems, supporting data-flow semantics between application kernels deployed on a variety of device types. A specific feature of the development environment is the availability of performance estimates (via simulation) prior to actual deployment on a physical system. expand
|
|
|
Multi-level tiling: M for the price of one |
| |
DaeGon Kim,
Lakshminarayanan Renganarayanan,
Dave Rostron,
Sanjay Rajopadhye,
Michelle Mills Strout
|
|
Article No.: 51 |
|
doi>10.1145/1362622.1362691 |
|
Full text: PDF
|
|
Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. High-performance implementations use multiple levels of tiling to exploit the hierarchy of parallelism and cache/register locality. Efficient generation ...
Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. High-performance implementations use multiple levels of tiling to exploit the hierarchy of parallelism and cache/register locality. Efficient generation of multi-level tiled code is essential for effective use of multi-level tiling. Parameterized tiled code, where tile sizes are not fixed but left as symbolic parameters can enable several dynamic and run-time optimizations. Previous solutions to multi-level tiled loop generation are limited to the case where tile sizes are fixed at compile time. We present an algorithm that can generate multi-level parameterized tiled loops at the same cost as generating single-level tiled loops. The efficiency of our method is demonstrated on several benchmarks. We also present a method-useful in register tiling-for separating partial and full tiles at any arbitrary level of tiling. The code generator we have implemented is available as an open source tool. expand
|
|
|
Implementation and performance analysis of non-blocking collective operations for MPI |
| |
Torsten Hoefler,
Andrew Lumsdaine,
Wolfgang Rehm
|
|
Article No.: 52 |
|
doi>10.1145/1362622.1362692 |
|
Full text: PDF
|
|
Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper ...
Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI communication operations. LibNBC provides non-blocking versions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our implementation, we also present a microbenchmark for measuring both latency and overlap of computation and communication. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other high-performance MPI implementations. Our library introduces a very low overhead between the application and the underlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the potential for optimizing real-world applications. expand
|
|
|
SESSION: Scheduling |
| |
Greg Bronevetsky
|
|
|
|
|
Efficient operating system scheduling for performance-asymmetric multi-core architectures |
| |
Tong Li,
Dan Baumberger,
David A. Koufaty,
Scott Hahn
|
|
Article No.: 53 |
|
doi>10.1145/1362622.1362694 |
|
Full text: PDF
|
|
Recent research advocates asymmetric multi-core architectures, where cores in the same processor can have different performance. These architectures support single-threaded performance and multithreaded throughput at lower costs (e.g., die size and power). ...
Recent research advocates asymmetric multi-core architectures, where cores in the same processor can have different performance. These architectures support single-threaded performance and multithreaded throughput at lower costs (e.g., die size and power). However, they also pose unique challenges to operating systems, which traditionally assume homogeneous hardware. This paper presents AMPS, an operating system scheduler that efficiently supports both SMP-and NUMA-style performance-asymmetric architectures. AMPS contains three components: asymmetry-aware load balancing, faster-core-first scheduling, and NUMA-aware migration. We have implemented AMPS in Linux kernel 2.6.16 and used CPU clock modulation to emulate performance asymmetry on an SMP and NUMA system. For various workloads, we show that AMPS achieves a median speedup of 1.16 with a maximum of 1.44 over stock Linux on the SMP, and a median of 1.07 with a maximum of 2.61 on the NUMA system. Our results also show that AMPS improves fairness and repeatability of application performance measurements. expand
|
|
|
A job scheduling framework for large computing farms |
| |
Gabriele Capannini,
Ranieri Baraglia,
Diego Puppin,
Laura Ricci,
Marco Pasquali
|
|
Article No.: 54 |
|
doi>10.1145/1362622.1362695 |
|
Full text: PDF
|
|
In this paper, we propose a new method, called Convergent Scheduling, for scheduling a continuous stream of batch jobs on the machines of large-scale computing farms. This method exploits a set of heuristics that guide the scheduler in making decisions. ...
In this paper, we propose a new method, called Convergent Scheduling, for scheduling a continuous stream of batch jobs on the machines of large-scale computing farms. This method exploits a set of heuristics that guide the scheduler in making decisions. Each heuristics manages a specific problem constraint, and contributes to carry out a value that measures the degree of matching between a job and a machine. Scheduling choices are taken to meet the QoS requested by the submitted jobs, and optimizing the usage of hardware and software resources. We compared it with some of the most common job scheduling algorithms, i.e. Backfilling, and Earliest Deadline First. Convergent Scheduling is able to compute good assignments, while being a simple and modular algorithm. expand
|
|
|
Optimizing center performance through coordinated data staging, scheduling and recovery |
| |
Zhe Zhang,
Chao Wang,
Sudharshan S. Vazhkudai,
Xiaosong Ma,
Gregory G. Pike,
John W. Cobb,
Frank Mueller
|
|
Article No.: 55 |
|
doi>10.1145/1362622.1362696 |
|
Full text: PDF
|
|
Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage ...
Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault source leading to data unavailability and job resubmissions. This results in reduced center performance, partially due to the lack of coordination between I/O activities and job scheduling. In this work, we propose the coordination of job scheduling with data staging/offloading and on-demand staged data reconstruction to address the availability of job input data and to improve center-wide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center's standpoint, these techniques optimize resource usage and increase its data/service availability. From a user's standpoint, they reduce the job turnaround time and optimize the allocated time usage. expand
|
|
|
SESSION: Gordon Bell prize finalists |
| |
David H. Bailey
|
|
|
|
|
A 281 Tflops calculation for X-ray protein structure analysis with special-purpose computers MDGRAPE-3 |
| |
Yousuke Ohno,
Eiji Nishibori,
Tetsu Narumi,
Takahiro Koishi,
Tahir H. Tahirov,
Hideo Ago,
Masashi Miyano,
Ryutaro Himeno,
Toshikazu Ebisuzaki,
Makoto Sakata,
Makoto Taiji
|
|
Article No.: 56 |
|
doi>10.1145/1362622.1362698 |
|
Full text: PDF
|
|
We have achieved a sustained calculation speed of 281 Tflops for the optimization of the 3-D structures of proteins from the X-ray experimental data by the Genetic Algorithm - Direct Space (GA-DS) method. In this calculation we used MDGRAPE-3, special-purpose ...
We have achieved a sustained calculation speed of 281 Tflops for the optimization of the 3-D structures of proteins from the X-ray experimental data by the Genetic Algorithm - Direct Space (GA-DS) method. In this calculation we used MDGRAPE-3, special-purpose computer for molecular simulations, with the peak performance of 752 Tflops. In the GA-DS method, a set of selected parameters which define the crystal structures of proteins is optimized by the Genetic Algorithm. As a criterion to estimate the model parameters, we used the reliability factor R1 which indicates the statistical difference between the calculated and the measured diffraction data. To evaluate this factor it is necessary to reconstruct the diffraction patterns of the model structures every time the model is updated. Therefore, in this method the nonequispaced Discrete Fourier Transformation (DFT) used to calculate the diffraction patterns dominates most of the computation time. To accelerate DFT calculations, we used the special-purpose computer, MDGRAPE-3. A molecule, Carbamoyl-Phosphate Synthetase was investigated. The final reliability factors were much smaller than the typical values obtained in other methods such as the Molecular Replacement (MR) method. Our results successfully demonstrate that high-performance computing with GA-DS method on special-purpose computers is effective for the structure determination of biological molecules and the method has a potential to be widely used in near future. expand
|
|
|
First-principles calculations of large-scale semiconductor systems on the earth simulator |
| |
Takahisa Ohno,
Takenori Yamamoto,
Tatsunobu Kokubo,
Akira Azami,
Yuta Sakaguchi,
Tsuyoshi Uda,
Takahiro Yamasaki,
Daisuke Fukata,
Junichiro Koga
|
|
Article No.: 57 |
|
doi>10.1145/1362622.1362699 |
|
Full text: PDF
|
|
First-principles simulations of large-scale semiconductor systems using the PHASE code on the Earth Simulator (ES) demonstrate high performance with respect to the theoretical peak performance. PHASE, designed for vector-parallel systems like the ES, ...
First-principles simulations of large-scale semiconductor systems using the PHASE code on the Earth Simulator (ES) demonstrate high performance with respect to the theoretical peak performance. PHASE, designed for vector-parallel systems like the ES, demonstrates excellent parallel efficiency. We simulated an arsenic donor in silicon using up to 8,000 atom unit cell. A sustained peak performance of 14.6 TFlop/s was measured on 3,072 processing elements, which corresponds to 59% of the theoretical peak performance. Preliminary results using 10,648 atom unit cell are also presented. expand
|
|
|
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability |
| |
J. N. Glosli,
D. F. Richards,
K. J. Caspersen,
R. E. Rudd,
J. A. Gunnels,
F. H. Streitz
|
|
Article No.: 58 |
|
doi>10.1145/1362622.1362700 |
|
Full text: PDF
|
|
We report the computational advances that have enabled the first micron-scale simulation of a Kelvin-Helmholtz (KH) instability using molecular dynamics (MD). The advances are in three key areas for massively parallel computation such as on BlueGene/L ...
We report the computational advances that have enabled the first micron-scale simulation of a Kelvin-Helmholtz (KH) instability using molecular dynamics (MD). The advances are in three key areas for massively parallel computation such as on BlueGene/L (BG/L): fault tolerance, application kernel optimization, and highly efficient parallel I/O. In particular, we have developed novel capabilities for handling hardware parity errors and improving the speed of interatomic force calculations, while achieving near optimal I/O speeds on BG/L, allowing us to achieve excellent scalability and improve overall application performance. As a result we have successfully conducted a 2-billion atom KH simulation amounting to 2.8 CPU-millennia of run time, including a single, continuous simulation run in excess of 1.5 CPU-millennia. We have also conducted 9-billion and 62.5-billion atom KH simulations. The current optimized ddcMD code is benchmarked at 115.1 TFlop/s in our scaling study and 103.9 TFlop/s in a sustained science run, with additional improvements ongoing. These improvements enabled us to run the first MD simulations of micron-scale systems developing the KH instability. expand
|
|
|
WRF nature run |
| |
John Michalakes,
Josh Hacker,
Richard Loft,
Michael O. McCracken,
Allan Snavely,
Nicholas J. Wright,
Tom Spelce,
Brent Gorda,
Robert Walkup
|
|
Article No.: 59 |
|
doi>10.1145/1362622.1362701 |
|
Full text: PDF
|
|
The Weather Research and Forecast (WRF) model is a limited-area model of the atmosphere for mesoscale research and operational numerical weather prediction (NWP). A petascale problem is a WRF nature run that provides very high-resolution "truth" against ...
The Weather Research and Forecast (WRF) model is a limited-area model of the atmosphere for mesoscale research and operational numerical weather prediction (NWP). A petascale problem is a WRF nature run that provides very high-resolution "truth" against which more coarse simulations or perturbation runs may be compared for purposes of studying predictability, stochastic parameterization, and fundamental dynamics. We carried out a nature run involving an idealized high resolution rotating fluid on the hemisphere to investigate scales that span the k-3 to k-5/3 kinetic energy spectral transition of the observed atmosphere using 65,536 processors of the BG/L machine at LLNL. We worked through issues of parallel I/O and scalability. The primary result is not just the scalability and high Tflops number, but an important step towards understanding weather predictability at high resolution. expand
|