The increasing demands of science and engineering applications push the limits of current large-scale systems, and is expected to achieve exascale (10^18 FLOPS) performance early in the next decade. One of the lesser studied challenge at extreme scales is the reliability of the computing system itself, primarily due to the very large number of cores and components utilized and to the sharp decrease of the Mean Time Between Failures on such systems (in the order of tens of minutes). This project departs from the traditional single component fault management model, and explores how multiple software libraries (and application components) used in the context of a single parallel application can interact to provide the holistic fault management support necessary for parallel applications targeting capability computing. This exploration will not be limited to software developed using a single parallel programming paradigm, but will be extended to encompass the more challenging case where multiple programming paradigms can be combined to achieve a common goal, to simulate a set of large scale scientific applications in use today.The goal of this project is to depart from the current siloed resilience mechanisms, and propose cross-layer composition solutions that can fundamentally address these resilience challenges at extreme scales.This exploration will not be limited to software developed using a single parallel programming paradigm, but will be extended to encompass the more challenging case where multiple programming paradigms can be combined to achieve a common goal, to simulate a set of large scale scientific applications in use today. More specifically, this proposal will address the following research challenges: (1) development of a theoretical foundation for a deeper understanding of the challenges and opportunities arising from combining different resilience models and methodologies; (2) design of a flexible programming abstraction to allow different resilience models and mechanisms to be combined to cooperate and address resilience in a more holistic manner; and (3) development of basic, programming paradigm independent, constructs necessary to implement cross-layer and domain-specific approaches to support resilience and to understand related performance / quality trade-offs. The proposed approach will be validated by exposing these generic abstractions in two different programming paradigms (MPI and OpenSHMEM), by creating and developing specialized concepts for each of these paradigms. This will enable the assessment of the validity of the concepts and the corresponding overheads imposed by the different software layers, using few software frameworks and applications.
|Effective start/end date||8/15/17 → 7/31/20|
- National Science Foundation (NSF)
Large scale systems