Author image not provided
 Thomas Hérault

Authors:
Add personal information
  Affiliation history
Bibliometrics: publication history
Average citations per article8.16
Citation Count571
Publication count70
Publication years2001-2017
Available for download16
Average downloads per article236.81
Downloads (cumulative)3,789
Downloads (12 Months)263
Downloads (6 Weeks)39
SEARCH
ROLE
Arrow RightAuthor only
· Editor only
· All roles


AUTHOR'S COLLEAGUES
See all colleagues of this author

SUBJECT AREAS
See all subject areas




BOOKMARK & SHARE


71 results found Export Results: bibtexendnoteacmrefcsv

Result 1 – 20 of 71
Result page: 1 2 3 4

Sort by:

1
January 2018 International Journal of High Performance Computing Applications: Volume 32 Issue 1, 1 2018
Publisher: Sage Publications, Inc.
Bibliometrics:
Citation Count: 0

Building an infrastructure for exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This article describes the design and evaluation of a robust failure detector that can maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection ...
Keywords: MPI, failure detection, fault tolerance

2 published by ACM
November 2017 ScalA '17: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 11,   Downloads (12 Months): 50,   Downloads (Overall): 50

Full text available: PDFPDF
Successfully exploiting distributed collections of heterogeneous many-cores architectures with complex memory hierarchy through a portable programming model is a challenge for application developers. The literature is not short of proposals addressing this problem, including many evolutionary solutions that seek to extend the capabilities of current message passing paradigms with intra-node ...
Keywords: data-flow, dynamic task-graph, PaRSEC, task-based runtime

3
November 2016 SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Publisher: IEEE Press
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 4,   Downloads (12 Months): 56,   Downloads (Overall): 187

Full text available: PDFPDF
Building an infrastructure for Exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This paper describes the design and evaluation of a robust failure detector, able to maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection ...
Keywords: fault-tolerance, MPI, failure detection

4
February 2016 Parallel Computing: Volume 52 Issue C, February 2016
Publisher: Elsevier Science Publishers B. V.
Bibliometrics:
Citation Count: 0

Algorithms for finding the optimal distribution compatible with a given data partition.Analysis of the algorithms for different cost metrics.NP-completeness proof for the redistribution problem followed by a computational kernel.Experimental results for the 1D-stencil kernel and the QR factorization algorithm. The classical redistribution problem aims at optimally scheduling communications when reshuffling ...
Keywords: Parsec, Redistribution, Stencil, QR factorization, Linear algebra, Data partition

5 published by ACM
November 2015 SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Publisher: ACM
Bibliometrics:
Citation Count: 2
Downloads (6 Weeks): 2,   Downloads (12 Months): 18,   Downloads (Overall): 159

Full text available: PDFPDF
The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous ...
Keywords: MPI, agreement, fault-tolerance

6 published by ACM
September 2015 EuroMPI '15: Proceedings of the 22nd European MPI Users' Group Meeting
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 2,   Downloads (12 Months): 16,   Downloads (Overall): 143

Full text available: PDFPDF
This paper considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message ...
Keywords: communication performance, spare node, fault mitigation, fault tolerance

7
August 2015 OpenSHMEM 2015: Revised Selected Papers of the Second Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies - Volume 9397
Publisher: Springer-Verlag New York, Inc.
Bibliometrics:
Citation Count: 0

This work details the opportunities and challenges of porting a Petascale, MPI-based application --LAMMPS-- to OpenSHMEM. We investigate the major programming challenges stemming from the differences in communication semantics, address space organization, and synchronization operations between the two programming models. This work provides several approaches to solve those challenges for ...

8
July 2015
Bibliometrics:
Citation Count: 9

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on ...

9
May 2015 IPDPS '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 1

As the scale of modern computing systems grows, failures will happen more frequently. On the way to Exactable a generic, low-overhead, resilient extension becomes a desired aptitude of any programming paradigm. In this paper we explore three additions to a dynamic task-based runtime to build a generic framework providing soft ...
Keywords: soft error resilience, runtime, fault tolerance

10 published by ACM
February 2015 ACM Transactions on Parallel Computing - Special Issue on PPOPP 2012: Volume 1 Issue 2, January 2015
Publisher: ACM
Bibliometrics:
Citation Count: 1
Downloads (6 Weeks): 3,   Downloads (12 Months): 31,   Downloads (Overall): 244

Full text available: PDFPDF
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used for scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). ...
Keywords: ABFT, high performance computing, fault-tolerance, linear algebra

11
December 2014 Concurrency and Computation: Practice & Experience: Volume 26 Issue 17, December 2014
Publisher: John Wiley and Sons Ltd.
Bibliometrics:
Citation Count: 2

In this paper, we present a unified model for several well-known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies with message logging. We identify a set of crucial parameters, instantiate them, and ...
Keywords: checkpoint/restart, hierarchical checkpoint with message logging, checkpointing waste optimization problem, coordinated checkpoint

12
November 2014 WOLFHPC '14: Proceedings of the 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 0

Increased parallelism and use of heterogeneous computing resources is now an established trend in High Performance Computing (HPC), a trend that, looking forward to Exascale, seems bound to intensify. Despite the evolution of hardware over the past decade, the programming paradigm of choice was invariably derived from Coarse Grain Parallelism ...

13
November 2014 WOLFHPC '14: Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing
Publisher: IEEE Press
Bibliometrics:
Citation Count: 2
Downloads (6 Weeks): 1,   Downloads (12 Months): 12,   Downloads (Overall): 53

Full text available: PDFPDF
Increased parallelism and use of heterogeneous computing resources is now an established trend in High Performance Computing (HPC), a trend that, looking forward to Exascale, seems bound to intensify. Despite the evolution of hardware over the past decade, the programming paradigm of choice was invariably derived from Coarse Grain Parallelism ...

14 published by ACM
October 2014 PGAS '14: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models
Publisher: ACM
Bibliometrics:
Citation Count: 1
Downloads (6 Weeks): 1,   Downloads (12 Months): 1,   Downloads (Overall): 25

Full text available: PDFPDF
OpenSHMEM scalability is strongly dependent on the capability of its communication layer to efficiently handle multiple threads. In this paper, we present an early evaluation of the thread safety specification in the Unified Common Communication Substrate (UCCS) employed in OpenSHMEM. Results demonstrate that thread safety can be provided at an ...

15
June 2014 ISPDC '14: Proceedings of the 2014 IEEE 13th International Symposium on Parallel and Distributed Computing
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 1

The classical redistribution problem aims at optimally scheduling communications when moving from an initial data distribution to a target distribution where each processor will host a subset of data items. However, modern computing platforms are equipped with a powerful interconnection switch, and the cost of a given communication is (almost) ...

16
May 2014 IPDPSW '14: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 2

Algorithm Based Fault Tolerant (ABFT) approaches promise unparalleled scalability and performance in failure-prone environments. With the advances in the theoretical and practical understanding of algorithmic traits enabling such approaches, a growing number of frequently used algorithms (including all widely used factorization) have been proven ABFT-capable. In the context of larger ...
Keywords: fault-tolerance, resilience, high-performance computing, checkpoint, ABFT, model, performance evaluation

17
December 2013 PRDC '13: Proceedings of the 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 3

In this paper, we revisit traditional check pointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some ...
Keywords: High-performance computing, checkpointing, silent data corruption, verification, error recovery

18
December 2013 Computing: Volume 95 Issue 12, December 2013
Publisher: Springer-Verlag New York, Inc.
Bibliometrics:
Citation Count: 12

As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound ...
Keywords: 68M15, 68M14, MPI, User-level fault mitigation, Fault tolerance

19
November 2013 Computing in Science and Engineering: Volume 15 Issue 6, November 2013
Publisher: IEEE Educational Activities Department
Bibliometrics:
Citation Count: 12

New high-performance computing system designs with steeply escalating processor and core counts, burgeoning heterogeneity and accelerators, and increasingly unpredictable memory access times call for one or more dramatically new programming paradigms. These new approaches must react and adapt quickly to unexpected contentions and delays, and they must provide the execution ...

20
August 2013 Euro-Par'13: Proceedings of the 19th international conference on Parallel Processing
Publisher: Springer-Verlag
Bibliometrics:
Citation Count: 1

Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for providing fault tolerance to general-purpose applications, reaches its own limits at such scales. One of the reasons explaining this unnerving situation comes from the focus that ...



The ACM Digital Library is published by the Association for Computing Machinery. Copyright © 2018 ACM, Inc.
Terms of Usage   Privacy Policy   Code of Ethics   Contact Us