Scalable crash consistency for staging-based in-situ scientific workflows

Shaohua Duan, Manish Parashar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As applications move towards extreme scales, data-related challenges are becoming significant concerns for scientific workflows, and in-situ/in-transit data processing have been proposed to address these challenges. However, increasing scales are expected to result in an increase in the rate of failures and the cost of resilience. Even worse, since coupled applications in workflows frequently interact and exchange a large amount of data, simply applying state of the art fault tolerance techniques to individual application components can not guarantee data consistency in workflows after failure recovery. Furthermore, naive use of fault tolerance techniques, such as checkpoint/restart, to the entire workflows prohibits the diversity of resilience of application components in workflows, and finally incurs a significant latency, storage overheads, and performance degradation. This paper addressed fault tolerance challenge for extreme scale in-situ scientific workflows. We present a loose coupled checkpoint/restart framework for in-situ workflows. This proposed approach provides a scalable and flexible fault tolerance scheme for in-situ workflows while still maintaining the data consistency and low resiliency cost. Specifically, we introduce a data logging mechanism in data staging which is composed by the queue based algorithm and user interface to keep data/events consistent during failure recovery. We have implemented our approach within the DataSpaces, an open-source data staging middleware, and evaluated it using synthetic workflows on a Cray XC40 system (Cori) at different scales. We demonstrated that, in the presence of failures, uncoordinated checkpoint and hybrid checkpoint with data logging scheme improved the workflow execution time by up to 13.48% in comparison with global coordinated checkpoint/restart approach.

Original languageEnglish (US)
Title of host publicationProceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages340-348
Number of pages9
ISBN (Electronic)9781728174457
DOIs
StatePublished - May 2020
Event34th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020 - New Orleans, United States
Duration: May 18 2020May 22 2020

Publication series

NameProceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020

Conference

Conference34th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
Country/TerritoryUnited States
CityNew Orleans
Period5/18/205/22/20

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Hardware and Architecture
  • Safety, Risk, Reliability and Quality
  • Control and Optimization

Keywords

  • Checkpointing
  • Crash consistency
  • Data staging
  • In-situ workflows

Fingerprint

Dive into the research topics of 'Scalable crash consistency for staging-based in-situ scientific workflows'. Together they form a unique fingerprint.

Cite this