HeteroCheckpoint: Efficient checkpointing for accelerator-based systems

Sudarsun Kannan, Naila Farooqui, Ada Gavrilovska, Karsten Schwan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Scopus citations

Abstract

Moving toward exascale, the number of GPUs in HPC machines is bound to increase, and applications will spend increasing amounts of time running on those GPU devices. While GPU usage has already led to substantial speedup for HPC codes, their failure rates due to overheating are at least 10 times higher than those seen for the CPUs now commonly used on HPC machines. This makes it increasingly important for GPUs to have robust checkpoint/restart mechanisms. This paper introduces a unified CPU-GPU checkpoint mechanism, which can efficiently checkpoint the combined GPU-CPU memory state resident on machine nodes. Efficiency is gained in part by addressing the end-to-end data movements required for check pointing - from GPU to storage - by introducing novel pre-copy and checksum methods. These methods reduce checkpoint data movement cost seen by HPC applications, with initial measurements using different benchmark applications showing up to 60% reduced checkpoint overhead. Additional exploration of the use of next-generation storage, like NVM, show further promises of reduced check pointing overheads.

Original languageEnglish (US)
Title of host publicationProceedings of the International Conference on Dependable Systems and Networks
PublisherIEEE Computer Society
Pages738-743
Number of pages6
ISBN (Electronic)9781479922338
DOIs
StatePublished - Sep 18 2014
Externally publishedYes
Event44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014 - Atlanta, United States
Duration: Jun 23 2014Jun 26 2014

Publication series

NameProceedings of the International Conference on Dependable Systems and Networks

Other

Other44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014
Country/TerritoryUnited States
CityAtlanta
Period6/23/146/26/14

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Keywords

  • Checkpoint
  • GPUs
  • NVM
  • Pre-Copy

Fingerprint

Dive into the research topics of 'HeteroCheckpoint: Efficient checkpointing for accelerator-based systems'. Together they form a unique fingerprint.

Cite this