Author image not provided
 Wenmei Hwu

Authors:
Add personal information
ACM Fellow badge
  Affiliation history
Bibliometrics: publication history
Average citations per article22.61
Citation Count2,465
Publication count109
Publication years1985-2012
Available for download69
Average downloads per article652.28
Downloads (cumulative)45,007
Downloads (12 Months)1,439
Downloads (6 Weeks)171
SEARCH
ROLE
Arrow RightAuthor only
· Editor only
· Advisor only
· Other only
· All roles


AUTHOR'S COLLEAGUES
See all colleagues of this author

SUBJECT AREAS
See all subject areas




BOOKMARK & SHARE


109 results found Export Results: bibtexendnoteacmrefcsv

Result 1 – 20 of 109
Result page: 1 2 3 4 5 6

Sort by:

1
August 2012 Computer: Volume 45 Issue 8, August 2012
Publisher: IEEE Computer Society Press
Bibliometrics:
Citation Count: 3

A study of the implementation patterns among massively threaded applications for many-core GPUs reveals that each of the seven most commonly used algorithm and data optimization techniques can enhance the performance of applicable kernels by 2 to 10× in current processors while also improving future scalability. The featured Web extra ...

2 published by ACM
February 2012 PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Publisher: ACM
Bibliometrics:
Citation Count: 10
Downloads (6 Weeks): 3,   Downloads (12 Months): 43,   Downloads (Overall): 1,019

Full text available: PDFPDF
With the emergence of highly multithreaded architectures, performance monitoring techniques face new challenges in efficiently locating sources of performance discrepancies in the program source code. For example, the state-of-the-art performance counters in highly multithreaded graphics processing units (GPUs) report only the overall occurrences of microarchitecture events at the end of ...
Keywords: memory hierarchy, gpu, performance evaluation
Also published in:
September 2012  ACM SIGPLAN Notices - PPOPP '12: Volume 47 Issue 8, August 2012

3
March 2011 Computing in Science and Engineering: Volume 13 Issue 2, March 2011
Publisher: IEEE Educational Activities Department
Bibliometrics:
Citation Count: 3

Researchers built the EcoG GPU-based cluster to show that a system can be designed around GPU computing and still be power efficient.
Keywords: scientific computing, CUDA, GPUs, Graphics processing, Graphics processing, GPUs, Nvidia, CUDA, scientific computing, Nvidia

4 published by ACM
September 2010 PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Publisher: ACM
Bibliometrics:
Citation Count: 25
Downloads (6 Weeks): 8,   Downloads (12 Months): 43,   Downloads (Overall): 766

Full text available: PDFPDF
We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a ...
Keywords: GPU, data layout transformation, parallel programming

5
June 2010 ISCA'10: Proceedings of the 2010 international conference on Computer Architecture
Publisher: Springer-Verlag
Bibliometrics:
Citation Count: 0

Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied ...

6 published by ACM
April 2010 CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Publisher: ACM
Bibliometrics:
Citation Count: 21
Downloads (6 Weeks): 1,   Downloads (12 Months): 26,   Downloads (Overall): 653

Full text available: PdfPdf
In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and ...
Keywords: CPU, SPMD, multicore, CUDA

7 published by ACM
March 2010 ACM SIGARCH Computer Architecture News - ASPLOS '10: Volume 38 Issue 1, March 2010
Publisher: ACM
Bibliometrics:
Citation Count: 55
Downloads (6 Weeks): 9,   Downloads (12 Months): 97,   Downloads (Overall): 2,326

Full text available: PDFPDF
Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. This paper presents a new programming model for heterogeneous ...
Keywords: data-centric programming models, heterogeneous systems, asymmetric distributed shared memory
Also published in:
March 2010  ACM SIGPLAN Notices - ASPLOS '10: Volume 45 Issue 3, March 2010 March 2010  ASPLOS XV: Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems

8 published by ACM
January 2010 PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Publisher: ACM
Bibliometrics:
Citation Count: 72
Downloads (6 Weeks): 13,   Downloads (12 Months): 107,   Downloads (Overall): 2,159

Full text available: PDFPDF
This paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool ...
Keywords: gpu, analytical model, parallel programming, performance estimation
Also published in:
May 2010  ACM SIGPLAN Notices - PPoPP '10: Volume 45 Issue 5, May 2010

9
June 2009 ISBI'09: Proceedings of the Sixth IEEE international conference on Symposium on Biomedical Imaging: From Nano to Macro
Publisher: IEEE Press
Bibliometrics:
Citation Count: 1

With the explosive development of advanced image reconstruction algorithms, there is an urgent need for acceleration of these algorithms to facilitate their use in practical applications. This paper describes our experience using graphics processing units (GPUs) for advanced MR image reconstruction from non-Cartesian data. We show that implementation of MR ...

10 published by ACM
June 2009 ICS '09: Proceedings of the 23rd international conference on Supercomputing
Publisher: ACM
Bibliometrics:
Citation Count: 1
Downloads (6 Weeks): 1,   Downloads (12 Months): 15,   Downloads (Overall): 746

Full text available: PDFPDF
In this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in CUDA kernels onto reconfigurable devices. The use of the CUDA programming model offers ...
Keywords: coarse grained parallelism, cuda programming model, fpga, gpu, high level synthesis, high performance computing

11
June 2009 Microprocessors & Microsystems: Volume 33 Issue 4, June, 2009
Publisher: Elsevier Science Publishers B. V.
Bibliometrics:
Citation Count: 0

To efficiently accommodate standards changes and algorithmic improvements, functional reconfigurability is increasingly desired for media processing. Such adaptability, however, generally comes at significant power cost. This work suggests that another dimension of adaptation can be beneficial -power adaptation. Through a unique compiler-hardware approach, we (1) demonstrate an extension to the ...
Keywords: Compilers, Data management, Adaptation, Architecture, Power

12
May 2009 IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 1

Modern GPUs such as the NVIDIA GeForce GTX280, ATI Radeon 4860, and the upcoming Intel Larrabee are massively parallel, many-core processors. Today, application developers for these many-core chips are reporting 10X–100X speedup over sequential code on traditional microprocessors. According to the semiconductor industry roadmap, these processors could scale up to ...

13
May 2009 IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 5

To address the problem of performing long time simulations of biochemical pathways under in vivo cellular conditions, we have developed a lattice-based, reaction-diffusion model that uses the graphics processing unit (GPU) as a computational co-processor. The method has been specifically designed from the beginning to take advantage of the GPU's ...

14 published by ACM
March 2009 GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 1,   Downloads (12 Months): 2,   Downloads (Overall): 182

Full text available: PDFPDF
As computational power increases, tele-immersive applications are an emerging trend. These applications make extensive demands on computational resources through their heavy use of real-time 3D reconstruction algorithms. Since computer vision developers do not necessarily have parallel programming expertise, it is important to give them the tools and capabilities to naturally ...
Keywords: tele-immersion codes, program optimization

15 published by ACM
March 2009 GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Publisher: ACM
Bibliometrics:
Citation Count: 9
Downloads (6 Weeks): 6,   Downloads (12 Months): 25,   Downloads (Overall): 660

Full text available: PDFPDF
The visualization of molecular orbitals (MOs) is important for analyzing the results of quantum chemistry simulations. The functions describing the MOs are computed on a three-dimensional lattice, and the resulting data can then be used for plotting isocontours or isosurfaces for visualization as well as for other types of analyses. ...
Keywords: molecular orbital, GPGPU, CUDA, GPU computing

16
November 2008 Languages and Compilers for Parallel Computing: 21th International Workshop, LCPC 2008, Edmonton, Canada, July 31 - August 2, 2008, Revised Selected Papers
Publisher: Springer-Verlag
Bibliometrics:
Citation Count: 66

The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing ...

17
November 2008 Languages and Compilers for Parallel Computing: 21th International Workshop, LCPC 2008, Edmonton, Canada, July 31 - August 2, 2008, Revised Selected Papers
Publisher: Springer-Verlag
Bibliometrics:
Citation Count: 58

CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on ...

18
October 2008 Journal of Parallel and Distributed Computing: Volume 68 Issue 10, October, 2008
Publisher: Academic Press, Inc.
Bibliometrics:
Citation Count: 34

Contemporary many-core processors such as the GeForce 8800 GTX enable application developers to utilize various levels of parallelism to enhance the performance of their applications. However, iterative optimization for such a system may lead to a local performance maximum, due to the complexity of the system. We propose program optimization ...
Keywords: GPU computing, Optimization space exploration, Parallel computing

19
July 2008 IEEE Micro: Volume 28 Issue 4, July 2008
Publisher: IEEE Computer Society Press
Bibliometrics:
Citation Count: 3

The one invited paper and four selected papers in this special issue provide a good sampling of advances in the broad space of accelerator applications and architectures.

20 published by ACM
June 2008 ICS '08: Proceedings of the 22nd annual international conference on Supercomputing
Publisher: ACM
Bibliometrics:
Citation Count: 11
Downloads (6 Weeks): 2,   Downloads (12 Months): 12,   Downloads (Overall): 607

Full text available: PDFPDF
Data-parallel co-processors have the potential to improve performance in highly parallel regions of code when coupled to a general-purpose CPU. However, applications often have to be modified in non-intuitive and complicated ways to mitigate the cost of data marshalling between the CPU and the co-processor. In some applications the overheads ...
Keywords: co-processors



The ACM Digital Library is published by the Association for Computing Machinery. Copyright © 2018 ACM, Inc.
Terms of Usage   Privacy Policy   Code of Ethics   Contact Us