No abstract available.
Proceeding Downloads
LAPACK: a portable linear algebra library for high-performance computers
- E. Anderson,
- Z. Bai,
- J. Dongarra,
- A. Greenbaum,
- A. McKenney,
- J. Du Croz,
- S. Hammarling,
- J. Demmel,
- C. Bischof,
- D. Sorensen
The goal of the LAPACK project is to design and implement a portable linear algebra library for efficient use on a variety of high-performance computers. The library is based on the widely used LINPACK and EISPACK packages for solving linear equations, ...
Hierarchical blocking and data flow analysis for numerical linear algebra
The optimization of BLAS2 and BLAS3 for linear algebra on computers with hierarchical memory systems is discussed. A new blocking strategy called hierarchical blocking and data flow analysis is proposed and its applications are given. Numerical results ...
Multilinear algebra and parallel programming
We report on preliminary results of a joint project of the Center for Large Scale Computation at the City University of New York and the Department of Computer and Information Sciences at The Ohio State University to study the use of multilinear algebra ...
The impact of memory organization on the performance of matrix multiplication
Matrix multiplication may be considered as a model problem for analyzing the performance of more complex algorithms. On CRAY and IBM computer systems, there are library routines which for this task operate at high megaflop rates. Other programs from ...
A linear array of processors with partially shared memory for parallel solution of PDE
We propose a multiprocessor architecture with partially shared memory blocks, which is, we think, best suited for the successive approximation of scientific computing problems, such as matrix operations, partial differential equations etc. The topology ...
On randomly interleaved memories
Memory address interleaving, where an address k generated by a processor is mapped into the memory bank k (mod m), is a basic technique for increasing memory bandwidth. However, the access conflicts that can occur in interleaved memories sometimes ...
Tracing application program execution on the Cray X-MP and Cray 2
Important insights into program operation can be gained by observing dynamic execution behavior. Unfortunately, many high-performance machines provide execution profile summaries as the only tool for performance investigation. We have developed a ...
Parallel program debugging with on-the-fly anomaly detection
We describe an approach for parallel debugging that coordinates static analysis with efficient on-the-fly access anomaly detection. We are developing on-the-fly instrumentation mechanisms for the structured synchronization primitives of Parallel ...
Improving instruction cache behavior by reducing cache pollution
In this paper we describe compiler techniques for improving instruction cache performance. Through repositioning of the code in main memory, leaving memory locations unused, code duplication, and code propagation, the effectiveness of the cache can be ...
A parallel Monte Carlo search algorithm for the conformational analysis of proteins
In recent years several approaches have been proposed to overcome the multiple minima problem associated with non-linear optimization techniques used in the analysis of molecular conformations. One such technique based on a parallel Monte Carlo search ...
Folding RNA on the Cray-2
Predicting RNA folding is a very computationally intensive task, that depends heavily on the assumptions of the model of folding. The 'stem list method' provides a flexible framework to change the assumptions of the model, but the price for this ...
A parallel computational approach using a cluster of IBM ES/3090 600Js for physical mapping of chromosomes
A standard technique for mapping a chromosome is to randomly select pieces, to use restriction enzymes to cut these pieces into fragments, and then to use the fragments for estimating the probability of overlap of these pieces. We describe a ...
Experience with a performance analyzer for multithreaded applications
Determining the effectiveness of parallelization requires performance data about elapsed process time and total CPU time. Furthermore, it is desirable not to have to run a parallel application in a stand-alone environment in order to obtain the profile. ...
Performance evaluation of the IBM RISC System/6000: comparison of an optimized scalar processor with two vector processors
RISC System/6000 computers are workstations with a reduced instruction set processor recently developed by IBM. This report details the performance of the 6000-series computers as measured using a set of portable, standard-Fortran, computationally-...
The characterization of two scientific workloads using the CRAY X-MP performance monitor
The weekend production period on a CRAY X-MP was monitored for several months at each of two supercomputing sites. The hardware performance monitor available on the X-MP was used to collect the data at each site. Various metrics are computed using the ...
Supercomputer network selection: a case study
With the purchase of a Cray-2 supercomputer, Eli Lilly and Company (Lilly) needed a high performance network to provide communications with this computer. At the time of installation of the Cray, this network had to provide access from VAX/VMS computers ...
Very high performance networking for supercomputing
NASA Ames has installed a very high bandwidth, 1 gigabit/second, local area network provided by Ultra Network Technologies to study the feasibility and performance of networking supercomputers with minisupercomputers and workstations at the effective ...
Cost-performance analysis of heterogeneity in supercomputer architectures
Heterogeneity has appeared as a cost-effective approach to design high performance computers. This paper analyzes cost-performance of heterogeneity in supercomputer architectures. Queueing models are used to study performance of homogeneous and ...
Fast barrier synchronization hardware
Many recent studies have considered the importance of barrier synchronization overhead on parallel loop performance, especially for large-scale parallel machines. This paper describes a hardware scheme for supporting fast barrier synchronization. It ...
Switch-stacks: a scheme for microtasking nested parallel loops
This paper discusses run-time microtasking support for executing nested parallel loops on a shared memory multiprocessor system, and presents a new scheme called switch-stacks for implementing such support. We first discuss current approaches to flat ...
Parallelization of loops with exits on pipelined architectures
Modulo scheduling theory can be applied successfully to overlap Fortran DO loops on pipelined computers issuing multiple operations per cycle both with and without special loop architectural support [1, 2, 3]. This paper shows that a broader class of ...
Computation of large-scale constrained matrix problems: the splitting equilibration algorithm
The Constrained Matrix problem is a core problem in numerous applications in the social and economic sciences, including: the estimation of input-output tables, trade tables, and social/national accounts, the projection of migration flows over space and ...
High performance preconditioning on supercomputers for the 3D device simulator MINIMOS
Discretization and iterative solution of the semiconductor equations in a three-dimensional rectangular region lead to very large sparse linear systems. Nevertheless, design engineers and scientists of device physics need reliable results in short time ...
Techniques for improving the performance of sparse matrix factorization on multiprocessor workstations
In this paper we study the problem of factoring large sparse systems of equations on high-performance multiprocessor workstations. While these multiprocessor workstations are capable of very high peak floating point computation rates, most existing ...
Fault-tolerant routing in MIN-based supercomputers
In this paper we study methods for routing data in supercomputers that use multistage interconnection networks (MINs), in the presence of faulty components in the network. These methods are applicable to existing multiprocessors like IBM GF11 and RP3. ...
Uni-directional hypercubes
Uni-directional hypercubes are hypercube interconnection topologies with simplex uni-directional links. While accommodating large number of nodes, uni-directional hypercubes require less complicated communication hardware than conventional bi-...
Design and analysis of buffered crossbars and banyans with cut-through switching
The design and approximate analyses of discrete time buffered crossbar and banyans with cut-through switching are presented. The crossbar switches can contain either (1) input FIFO queueing, (2) input “bypass” queueing where the FIFO discipline is ...
A parallel object-oriented total architecture: A–NET
A-NET is a parallel object-oriented total architecture for highly parallel computation. Starting with a computation model, this paper describes parallel constructs of the designed language, called A-NETL; the A-NETL oriented machine instruction set ...
A parallel computer model supporting procedure-based communication
Procedure-based communication can convert variant communication patterns of parallel computation to a simple data sending and receiving process. This paper describes a general purpose, MIMD parallel architecture that effectively supports the procedure-...
A high-performance, memory-based interconnection system for multicomputer environments
The objective of this paper is to outline the design and operation of a very high-performance, memory-mapped interconnection system, called Merlin. The design can be effectively utilized to interconnect processors in a wide variety on environments, ...
Index Terms
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Recommendations
Acceptance Rates
| Year | Submitted | Accepted | Rate |
|---|---|---|---|
| SC '17 | 327 | 61 | 19% |
| SC '16 | 442 | 81 | 18% |
| SC '15 | 358 | 79 | 22% |
| SC '14 | 394 | 83 | 21% |
| SC '13 | 449 | 91 | 20% |
| SC '12 | 461 | 100 | 22% |
| SC '11 | 352 | 74 | 21% |
| SC '10 | 253 | 51 | 20% |
| SC '09 | 261 | 59 | 23% |
| SC '08 | 277 | 59 | 21% |
| SC '07 | 268 | 54 | 20% |
| SC '06 | 239 | 54 | 23% |
| SC '05 | 260 | 62 | 24% |
| SC '04 | 200 | 60 | 30% |
| SC '03 | 207 | 60 | 29% |
| SC '02 | 230 | 67 | 29% |
| SC '01 | 240 | 60 | 25% |
| SC '00 | 179 | 62 | 35% |
| Supercomputing '95 | 241 | 69 | 29% |
| Supercomputing '93 | 300 | 72 | 24% |
| Supercomputing '92 | 220 | 75 | 34% |
| Supercomputing '91 | 215 | 83 | 39% |
| Overall | 6,373 | 1,516 | 24% |


