1 published by ACM
April 2019 ICPE '19: Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
Publisher: ACM
Citation Count: 0
Downloads (6 Weeks): 15,   Downloads (12 Months): 94,   Downloads (Overall): 94

Full text available: PDFPDF
Detection of software bottlenecks which hinder utilizing hardware resources is a classic but complex problem due to the layered structures of the software bottlenecks. However, model-based approaches require a performance model given, which is impractical to maintain under today's agile development environment, and profile-based approaches do not handle the layered ...
Keywords: layered bottlenecks, thread dependency graph, wake-up profile

2 published by ACM
October 2017 MEMSYS '17: Proceedings of the International Symposium on Memory Systems
Publisher: ACM
Citation Count: 0
Downloads (6 Weeks): 4,   Downloads (12 Months): 52,   Downloads (Overall): 175

Full text available: PDFPDF
While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. There is also a renewed interest in Near Data Processing (NDP) due ...
Keywords: apache spark, in-storage processing, processing in memory

3 published by ACM
February 2016 CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization
Publisher: ACM
Citation Count: 0
Downloads (6 Weeks): 3,   Downloads (12 Months): 27,   Downloads (Overall): 231

Full text available: PDFPDF
In this paper, we show a binary optimizer can achieve competitive performance relative to a state-of-the-art source code compiler by re-constructing high-level information (HLI) from binaries. Recent advances in compiler technologies have resulted in a large performance gap between binaries compiled with old compilers and those compiled with latest ones. ...
Keywords: binary optimizer, compiler, performance

November 2014 Proceedings of the VLDB Endowment: Volume 8 Issue 3, November 2014
Publisher: VLDB Endowment
Citation Count: 15
Downloads (6 Weeks): 8,   Downloads (12 Months): 33,   Downloads (Overall): 112

Full text available: PDFPDF
Set intersection is one of the most important operations for many applications such as Web search engines or database management systems. This paper describes our new algorithm to efficiently find set intersections with sorted arrays on modern processors with SIMD instructions and high branch misprediction penalties. Our algorithm efficiently exploits ...

5 published by ACM
May 2013 CF '13: Proceedings of the ACM International Conference on Computing Frontiers
Publisher: ACM
Citation Count: 0
Downloads (6 Weeks): 2,   Downloads (12 Months): 7,   Downloads (Overall): 130

Full text available: PDFPDF
OpenCL is an open standard for heterogeneous parallel programming, exploiting multi-core CPUs, GPUs, or other accelerators as parallel computing resources. Recent work has extended the OpenCL parallel programming model for distributed heterogeneous clusters. For such loosely coupled acceleration architectures, the design of OpenCL programs to maximize performance is quite different ...
Keywords: data chunking, pipelining, OpenCL, heterogeneous cluster, linear regression, parallel execution, two-step cluster

6 published by ACM
June 2012 SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage Conference
Publisher: ACM
Citation Count: 4
Downloads (6 Weeks): 0,   Downloads (12 Months): 6,   Downloads (Overall): 190

Full text available: PDFPDF
A dynamic binary translator (DBT) is a runtime system that translates binary code on the fly, for example to emulate the execution of the binary code on a processor with a different instruction set. One of the major sources of the overhead is the resolution of the branch target addresses ...
Keywords: dynamic binary translation, indirect branch

January 2006 IBM Systems Journal: Volume 45 Issue 1, January 2006
Publisher: IBM Corp.
Citation Count: 34

The Cell Broadband Engine TM processor employs multiple accelerators, called synergistic processing elements (SPEs), for high performance. Each SPE has a high-speed local store attached to the main memory through direct memory access (DMA), but a drawback of this design is that the local store is not large enough for ...

June 1996
Citation Count: 1

Due to large remote-memory latencies, reducing the impact of cache misses is critical for large scale shared-memory multiprocessors. This thesis quantitatively compares two classes of software-controlled prefetch schemes for reducing the impact: consumer-oriented and producer-oriented schemes. Examining the behavior of these schemes leads us to characterize the communication behavior of ...


10 published by ACM
May 1995 ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture
Publisher: ACM
Citation Count: 1,293
Downloads (6 Weeks): 13,   Downloads (12 Months): 316,   Downloads (Overall): 3,927

Full text available: PDFPDF
The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them ...
Also published in:
May 1995  ACM SIGARCH Computer Architecture News - Special Issue: Proceedings of the 22nd annual international symposium on Computer architecture (ISCA '95): Volume 23 Issue 2, May 1995

