Checkpointing orchestration: Toward a scalable HPC fault-tolerant environment

Hui Jin, Tao Ke, Yong Chen, Xian He Sun

Research output: Chapter in Book/Report/Conference proceedingConference contribution

15 Scopus citations

Abstract

Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems. In these systems, check pointing generates a huge number of concurrent I/O writes. The burst of writes plus the worsening I/O-wall problem often leads to network and I/O congestion, and makes the overall system performance painfully slow. Recognizing contention as a dominant performance factor, in this paper we propose a systematic approach named check pointing orchestration to reduce write contention, which combines the marshaling of concurrent checkpoint requests and the adopting of vertical data access in coordination. A prototype of the proposed check pointing orchestration approach has been implemented at the system-level under Open MPI over the PVFS2 file system. Extensive experiments based on NPB benchmarks have been conducted to verify the design and implementation. Experimental results show that check pointing orchestration reduced the check pointing cost at a degree of more than 30%. Check pointing cost was halved for 4 out of 5 the C class NPB benchmarks.

Original languageEnglish
Title of host publicationProceedings - 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012
Pages276-283
Number of pages8
DOIs
StatePublished - 2012
Event12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012 - Ottawa, ON, Canada
Duration: May 13 2012May 16 2012

Publication series

NameProceedings - 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012

Conference

Conference12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012
CountryCanada
CityOttawa, ON
Period05/13/1205/16/12

Keywords

  • Checkpointing
  • Fault Tolerance
  • Parallel File System

Fingerprint Dive into the research topics of 'Checkpointing orchestration: Toward a scalable HPC fault-tolerant environment'. Together they form a unique fingerprint.

  • Cite this

    Jin, H., Ke, T., Chen, Y., & Sun, X. H. (2012). Checkpointing orchestration: Toward a scalable HPC fault-tolerant environment. In Proceedings - 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012 (pp. 276-283). [6217432] (Proceedings - 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2012). https://doi.org/10.1109/CCGrid.2012.61