Author image not provided
 Ninghui Sun

Authors:
Add personal information
  Affiliation history
Bibliometrics: publication history
Average citations per article3.29
Citation Count207
Publication count63
Publication years2004-2017
Available for download21
Average downloads per article734.19
Downloads (cumulative)15,418
Downloads (12 Months)4,142
Downloads (6 Weeks)390
SEARCH
ROLE
Arrow RightAuthor only


AUTHOR'S COLLEAGUES
See all colleagues of this author

SUBJECT AREAS
See all subject areas




BOOKMARK & SHARE


63 results found Export Results: bibtexendnoteacmrefcsv

Result 1 – 20 of 63
Result page: 1 2 3 4

Sort by:

1 published by ACM
June 2017 ICS '17: Proceedings of the International Conference on Supercomputing
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 31,   Downloads (12 Months): 245,   Downloads (Overall): 245

Full text available: PDFPDF
GPUs are widely used in accelerating deep neural networks (DNNs) for their high bandwidth and parallelism. But tuning the performance of DNN computations is challenging, as it requires a thorough understanding of both underlying architectures and algorithm implementations. Traditional research, which focused on analyzing performance by CUDA C language or ...
Keywords: GPUs, GEMM, convolution, performance model

2 published by ACM
October 2016 Communications of the ACM: Volume 59 Issue 11, November 2016
Publisher: ACM
Bibliometrics:
Citation Count: 3
Downloads (6 Weeks): 110,   Downloads (12 Months): 1,375,   Downloads (Overall): 2,018

Full text available: HtmlHtml  PDFPDF
Machine Learning (ML) tasks are becoming pervasive in a broad range of applications, and in a broad range of systems (from embedded systems to data centers). As computer architectures evolve toward heterogeneous multi-cores composed of a mix of cores and hardware accelerators, designing hardware accelerators for ML techniques can simultaneously ...

3 published by ACM
June 2015 ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing
Publisher: ACM
Bibliometrics:
Citation Count: 4
Downloads (6 Weeks): 3,   Downloads (12 Months): 63,   Downloads (Overall): 333

Full text available: PDFPDF
Stencil computations comprise an important class of kernels in many scientific computing applications. As the diversity of both architectures and programming models grow, autotuning is emerging as a critical strategy for achieving portable performance across a broad range of execution contexts for stencil computations. However, costly tuning overhead is a ...
Keywords: oss, autotuning, stencil

4 published by ACM
May 2015 ACM Transactions on Computer Systems (TOCS): Volume 33 Issue 2, June 2015
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 32,   Downloads (12 Months): 328,   Downloads (Overall): 1,018

Full text available: PDFPDF
Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across ...
Keywords: Hardware accelerator, convolutional neural network, deep neural network, deep learning

5 published by ACM
March 2015 ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 5,   Downloads (12 Months): 106,   Downloads (Overall): 605

Full text available: PDFPDF
This paper presents PARD, a programmable architecture for resourcing-on-demand that provides a new programming interface to convey an application's high-level information like quality-of-service requirements to the hardware. PARD enables new functionalities like fully hardware-supported virtualization and differentiated services in computers. PARD is inspired by the observation that a computer is ...
Keywords: QoS, data center, hardware/software interface
Also published in:
May 2015  ACM SIGPLAN Notices - ASPLOS '15: Volume 50 Issue 4, April 2015 May 2015  ACM SIGARCH Computer Architecture News - ASPLOS'15: Volume 43 Issue 1, March 2015

6
September 2014 The Journal of Supercomputing: Volume 69 Issue 3, September 2014
Publisher: Kluwer Academic Publishers
Bibliometrics:
Citation Count: 0

Traversal is a fundamental procedure in most parallel graph algorithms. To explore the massive fine-grained parallelism in graph traversal, the fine-grained data synchronization is critical. On commodity multi-core processors, the widely adopted solution is fine-grained locks (i.e., one lock per vertex). However, in emerging graph analytics of massive irregular graphs ...
Keywords: Parallel programming, Data synchronization, Fine-grained parallelism, Graph traversal

7 published by ACM
June 2014 ICS '14: Proceedings of the 28th ACM international conference on Supercomputing
Publisher: ACM
Bibliometrics:
Citation Count: 3
Downloads (6 Weeks): 2,   Downloads (12 Months): 33,   Downloads (Overall): 319

Full text available: PDFPDF
Phase change memory (PCM) is promising to become an alternative main memory thanks to its better scalability and lower leakage than DRAM. However, the long write latency of PCM puts it at a severe disadvantage against DRAM. In this paper, we propose a Dynamic Write Consolidation (DWC) scheme to improve ...
Keywords: phase change memory, write consolidation, performance optimization

8 published by ACM
February 2014 ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Publisher: ACM
Bibliometrics:
Citation Count: 95
Downloads (6 Weeks): 189,   Downloads (12 Months): 1,675,   Downloads (Overall): 4,694

Full text available: PDFPDF
Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across ...
Keywords: accelerator, memory, neural networks
Also published in:
April 2014  ACM SIGPLAN Notices - ASPLOS '14: Volume 49 Issue 4, April 2014 April 2014  ACM SIGARCH Computer Architecture News - ASPLOS '14: Volume 42 Issue 1, March 2014

9 published by ACM
February 2014 ACM Transactions on Architecture and Code Optimization (TACO): Volume 11 Issue 1, February 2014
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 4,   Downloads (12 Months): 22,   Downloads (Overall): 225

Full text available: PDFPDF
DRAM access traces (i.e., off-chip memory references) can be extremely valuable for the design of memory subsystems and performance tuning of software. Hardware snooping on the off-chip memory interface is an effective and nonintrusive approach to monitoring and collecting real-life DRAM accesses. However, compared with software-based approaches, hardware snooping approaches ...
Keywords: DRAM access trace, high-level event, lock, semantic gap, Hybrid tracing mechanism, object, function

10
September 2013 ISLPED '13: Proceedings of the 2013 International Symposium on Low Power Electronics and Design
Publisher: IEEE Press
Bibliometrics:
Citation Count: 1
Downloads (6 Weeks): 0,   Downloads (12 Months): 6,   Downloads (Overall): 32

Full text available: PDFPDF
Simulation is an important method to evaluate future computer systems. However, the increasing complexity of the target systems has made the development of simulators very difficult. Furthermore, detailed simulation of large-scale parallel architecture is so slow that full evaluation of real application becomes a great challenge. This paper presents SimICT, ...
Keywords: framework, power evaluation, parallel simulation

11 published by ACM
June 2013 PLDI '13: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation
Publisher: ACM
Bibliometrics:
Citation Count: 8
Downloads (6 Weeks): 3,   Downloads (12 Months): 89,   Downloads (Overall): 487

Full text available: PDFPDF
Sparse Matrix Vector multiplication (SpMV) is an important kernel in both traditional high performance computing and emerging data-intensive applications. By far, SpMV libraries are optimized by either application-specific or architecture-specific approaches, making the libraries become too complicated to be used extensively in real applications. In this work we develop a ...
Keywords: data mining, algebraic multi-grid, sparse matrix-vector multiplication, SpMV, auto-tuning
Also published in:
June 2013  ACM SIGPLAN Notices - PLDI '13: Volume 48 Issue 6, June 2013

12
February 2013 CGO '13: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 2
Downloads (6 Weeks): 0,   Downloads (12 Months): 17,   Downloads (Overall): 32

Full text available: PDFPDF
For graph traversal applications, fine synchronization is required to exploit massive fine parallelism. However, in the conventional solution using fine-grained locks, locks themselves suffer huge memory cost as well as poor locality for inherent irregular access to vertices. In this paper, we propose a novel fine lock solution-vLock. The key ...
Keywords: vLock,Graph Algorithms,Fine Synchronization

13
August 2012 Euro-Par'12: Proceedings of the 18th international conference on Parallel Processing
Publisher: Springer-Verlag
Bibliometrics:
Citation Count: 0

This paper addresses the workload partition strategies in the simulation of manycore architectures. The key observation behind this paper is that, compared to traditional multicores, manycores feature more non-uniform memory access and unpredictable network traffic; these features degrades simulation speed and accuracy of Parallel Discrete Event Simulators (PDES) when one ...
Keywords: manycore, multicore, parallel simulation, workload partition

14 published by ACM
June 2012 ICS '12: Proceedings of the 26th ACM international conference on Supercomputing
Publisher: ACM
Bibliometrics:
Citation Count: 1
Downloads (6 Weeks): 2,   Downloads (12 Months): 15,   Downloads (Overall): 230

Full text available: PDFPDF
In heterogeneous systems that include CPUs and GPUs, the data transfers between these components play a critical role in determining the performance of applications. Software pipelining is a common approach to mitigate the overheads of those transfers. In this paper we investigate advanced software-pipelining optimizations for the double-precision general matrix ...
Keywords: dgemm, gpu, heterogeneous architecture, high performance computing

15
May 2012 IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 3

For the first time, this paper systematically identifies three categories of throughput oriented workloads in data centers: services, data processing applications, and interactive real-time applications, whose targets are to increase the volume of throughput in terms of processed requests or data, or supported maximum number of simultaneous subscribers, respectively, and ...
Keywords: High volume throughput computing, Throughput oriented workloads, Data center systems, Metrics, Benchmarks

16
May 2012 IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 0

Next Generation Sequencing (NGS) is gaining interests due to the increased requirements and the decreased sequencing cost. The important and prerequisite step of most NGS applications is the mapping of short sequences, called reads, to the template reference sequences. Both the explosion of NGS data with over billions of reads ...
Keywords: memory optimization, next-generation sequencing

17
May 2012 IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 2

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. Today's HPC applications typically tolerate fail-stop failures by check pointing. However, check pointing will lose its efficiency when system becoming very large. An alternative method is algorithm-based fault ...
Keywords: Exascale, Algorithm-Based Fault Tolerance, High Performance Linpack

18
April 2012 FCCM '12: Proceedings of the 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 7

The explosion of Next Generation Sequencing (NGS) data with over one billion reads per day poses a great challenge to the capability of current computing systems. In this paper, we proposed a CPU-FPGA heterogeneous architecture for accelerating a short reads mapping algorithm, which was built upon the concept of hash-index. ...
Keywords: short reads mapping, FPGA, hash, accelerator

19
March 2012 IEEE Micro: Volume 32 Issue 2, March 2012
Publisher: IEEE Computer Society Press
Bibliometrics:
Citation Count: 1

Godson-T is a research many-core processor designed for parallel scientific computing that delivers efficient performance and flexible programmability simultaneously. It also has many features to achieve high efficiency for on-chip resource utilization, such as a region-based cache coherence protocol, data transfer agents, and hardware-supported synchronization mechanisms. Finally, it also features ...
Keywords: many-core processor, parallel computing, microarchitecture, Godson-T, Pthreads

20 published by ACM
February 2012 FPGA '12: Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Publisher: ACM
Bibliometrics:
Citation Count: 2
Downloads (6 Weeks): 0,   Downloads (12 Months): 12,   Downloads (Overall): 209

Full text available: PDFPDF
The wide acceptance of bioinformatics, medical imaging and multimedia applications, which have a data-centric favor to them, require more efficient and application-specific systems to be built. Due to the advances in modern FPGA technologies recently, there has been a resurgence in research aimed at accelerator design that leverages FPGAs to ...
Keywords: stream processing, FFT, FPGA, cryo-electron microscopy, memory access patterns



The ACM Digital Library is published by the Association for Computing Machinery. Copyright © 2018 ACM, Inc.
Terms of Usage   Privacy Policy   Code of Ethics   Contact Us