Author image not provided
 Yves L Robert

Authors:
Add personal information
  Affiliation history
Bibliometrics: publication history
Average citations per article4.88
Citation Count902
Publication count185
Publication years1986-2017
Available for download18
Average downloads per article222.72
Downloads (cumulative)4,009
Downloads (12 Months)350
Downloads (6 Weeks)60
SEARCH
ROLE
Arrow RightAuthor only
· Editor only
· All roles


AUTHOR'S COLLEAGUES
See all colleagues of this author

SUBJECT AREAS
See all subject areas




BOOKMARK & SHARE


190 results found Export Results: bibtexendnoteacmrefcsv

Result 1 – 20 of 190
Result page: 1 2 3 4 5 6 7 8 9 10

Sort by:

1 published by ACM
August 2018 ICPP 2018: Proceedings of the 47th International Conference on Parallel Processing
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 13,   Downloads (12 Months): 13,   Downloads (Overall): 13

Full text available: PDFPDF
This work presents a realistic performance model to execute scientific workflows on high-bandwidth-memory architectures such as the Intel Knights Landing. We provide a detailed analysis of the execution time on such platforms, taking into account transfers from both fast and slow memory and their overlap with computations. We discuss several ...

2 published by ACM
August 2018 ICPP 2018: Proceedings of the 47th International Conference on Parallel Processing
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 6,   Downloads (12 Months): 6,   Downloads (Overall): 6

Full text available: PDFPDF
This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but ...

3
January 2018 International Journal of High Performance Computing Applications: Volume 32 Issue 1, 1 2018
Publisher: Sage Publications, Inc.
Bibliometrics:
Citation Count: 0

Cache-partitioned architectures allow subsections of the shared last-level cache LLC to be exclusively reserved for some applications. This technique dramatically limits interactions between applications that are concurrently executing on a multicore machine. Consider n applications that execute concurrently, with the objective to minimize the makespan, defined as the maximum completion ...
Keywords: Co-scheduling, cache partitioning, complexity results

4
January 2018 International Journal of High Performance Computing Applications: Volume 32 Issue 1, 1 2018
Publisher: Sage Publications, Inc.
Bibliometrics:
Citation Count: 0

Recently, the benefits of co-scheduling several applications have been demonstrated in a fault-free context, both in terms of performance and energy savings. However, large-scale computer systems are confronted by frequent failures, and resilience techniques must be employed for large applications to execute efficiently. Indeed, failures may create severe imbalance between ...
Keywords: Resilience, co-scheduling, complexity results, heuristics, redistribution, simulations

5
January 2018 International Journal of High Performance Computing Applications: Volume 32 Issue 1, 1 2018
Publisher: Sage Publications, Inc.
Bibliometrics:
Citation Count: 0

Building an infrastructure for exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This article describes the design and evaluation of a robust failure detector that can maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection ...
Keywords: MPI, failure detection, fault tolerance

6 published by ACM
June 2017 FTXS '17: Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 4,   Downloads (12 Months): 32,   Downloads (Overall): 39

Full text available: PDFPDF
In this paper, we design and analyze strategies to replicate the execution of an application on two different platforms subject to failures, using checkpointing on a shared stable storage. We derive the optimal pattern size~$W$ for a periodic checkpointing strategy where both platforms concurrently try and execute $W$ units of ...
Keywords: checkpoint, heterogeneous platforms, replication

7 published by ACM
June 2017 FTXS '17: Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 4,   Downloads (12 Months): 27,   Downloads (Overall): 37

Full text available: PDFPDF
This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications, based on algorithms (ABFT), invariant preservation or data analytics, replication remains the most transparent and least intrusive technique. We explore the right ...
Keywords: checkpoint, replication, silent errors

8
June 2017
Bibliometrics:
Citation Count: 0

Full of practical examples, Introduction to Scheduling presents the basic concepts and methods, fundamental results, and recent developments of scheduling theory. With contributions from highly respected experts, it provides self-contained, easy-to-follow, yet rigorous presentations of the material. The book first classifies scheduling problems and their complexity and then presents examples ...

9
March 2017 Supercomputing Frontiers and Innovations: an International Journal: Volume 4 Issue 1, March 2017
Publisher: South Ural State University
Bibliometrics:
Citation Count: 0

The objective of the PULSAR project was to design a programming model suitable for largescale machines with complex memory hierarchies, and to deliver a prototype implementation of a runtime system supporting that model. PULSAR tackled the challenge by proposing a programming model based on systolic processing and virtualization. The PULSAR ...
Keywords: dataflow scheduling, hardware accelerators, multicore processors, runtime scheduling, systolic arrays, virtualization, distributed computing, massively parallel computing

10
January 2017 International Journal of High Performance Computing Applications: Volume 31 Issue 1, 1 2017
Publisher: Sage Publications, Inc.
Bibliometrics:
Citation Count: 0

Errors have become a critical problem for high-performance computing. Checkpointing protocols are often used for error recovery after fail-stop failures. However, silent errors cannot be ignored, and their peculiarity is that such errors are identified only when the corrupted data is activated. To cope with silent errors, we need a ...
Keywords: High-performance computing, silent data corruption, silent error, verification, checkpointing, fault tolerance

11
January 2017 IEEE Transactions on Parallel and Distributed Systems: Volume 28 Issue 1, January 2017
Publisher: IEEE Press
Bibliometrics:
Citation Count: 0

The traditional single-level checkpointing method suffers from significant overhead on large-scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in recent years. The multilevel checkpoint approach allows different levels of checkpoints to be set (each with different checkpoint overheads and recovery abilities), in order to further improve the fault ...

12
December 2016 Journal of Scheduling: Volume 19 Issue 6, December 2016
Publisher: Kluwer Academic Publishers
Bibliometrics:
Citation Count: 0

This paper investigates co-scheduling algorithms for processing a set of parallel applications. Instead of executing each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several applications concurrently. We partition the original application set into a series of packs, which are ...

13
November 2016 SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Publisher: IEEE Press
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 4,   Downloads (12 Months): 56,   Downloads (Overall): 187

Full text available: PDFPDF
Building an infrastructure for Exascale applications requires, in addition to many other key components, a stable and efficient failure detector. This paper describes the design and evaluation of a robust failure detector, able to maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection ...
Keywords: fault-tolerance, MPI, failure detection

14 published by ACM
July 2016 ACM Transactions on Parallel Computing (TOPC): Volume 3 Issue 2, August 2016
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 3,   Downloads (12 Months): 46,   Downloads (Overall): 46

Full text available: PDFPDF
In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, ...
Keywords: resilience, silent data corruption, silent error, verification, checkpoint, failure, HPC, fail-stop error

15
February 2016 Parallel Computing: Volume 52 Issue C, February 2016
Publisher: Elsevier Science Publishers B. V.
Bibliometrics:
Citation Count: 0

Algorithms for finding the optimal distribution compatible with a given data partition.Analysis of the algorithms for different cost metrics.NP-completeness proof for the redistribution problem followed by a computational kernel.Experimental results for the 1D-stencil kernel and the QR factorization algorithm. The classical redistribution problem aims at optimally scheduling communications when reshuffling ...
Keywords: Parsec, Redistribution, Stencil, QR factorization, Linear algebra, Data partition

16
December 2015 HIPC '15: Proceedings of the 2015 IEEE 22nd International Conference on High Performance Computing (HiPC)
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 0

Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each comes with a given cost and recall (fraction of all errors that are actually detected). The main contribution of this paper is to characterize the optimal computational pattern for an application: which detector(s) to use, how ...

17
November 2015 PRDC '15: Proceedings of the 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC)
Publisher: IEEE Computer Society
Bibliometrics:
Citation Count: 0

In this paper, we discuss several scheduling algorithms to execute independent tasks with voltage overscaling. Given a frequency to execute the tasks, operating at a voltage below threshold leads to significant energy savings but also induces timing errors. A verification mechanism must be enforced to detect these errors. Contrarily to ...

18
November 2015 Journal of Parallel and Distributed Computing: Volume 85 Issue C, November 2015
Publisher: Academic Press, Inc.
Bibliometrics:
Citation Count: 0

This paper introduces hybrid LU-QR algorithms for solving dense linear systems of the form A x = b . Throughout a matrix factorization, these algorithms dynamically alternate LU with local pivoting and QR elimination steps based upon some robustness criterion. LU elimination steps can be very efficiently parallelized, and are ...
Keywords: Performance, Stability, LU factorization, Numerical algorithms, QR factorization

19
October 2015 Future Generation Computer Systems: Volume 51 Issue C, October 2015
Publisher: Elsevier Science Publishers B. V.
Bibliometrics:
Citation Count: 1

Processor failures in post-petascale parallel computing platforms are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback-recovery, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback-recovery, has been recently advocated. We first derive ...
Keywords: Process replication, Parallel computing, Rollback-recovery, Fault-tolerance, Checkpoint

20 published by ACM
June 2015 FTXS '15: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale
Publisher: ACM
Bibliometrics:
Citation Count: 0
Downloads (6 Weeks): 2,   Downloads (12 Months): 6,   Downloads (Overall): 45

Full text available: PDFPDF
We propose a software-based approach using dynamic voltage overscaling to reduce the energy consumption of HPC applications. This technique aggressively lowers the supply voltage below nominal voltage, which introduces timing errors, and we use Algorithm-Based Fault-Tolerance (ABFT) to provide fault tolerance for matrix operations. We introduce a formal model, and ...
Keywords: voltage overscaling, abft, energy efficiency, timing errors



The ACM Digital Library is published by the Association for Computing Machinery. Copyright © 2018 ACM, Inc.
Terms of Usage   Privacy Policy   Code of Ethics   Contact Us