Optimizing HPC fault-tolerant environment: An analytical approach

Hui Jin, Yong Chen, Huaiyu Zhu, Xian He Sun

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

36 Scopus citations

Abstract

The increasingly large ensemble size of modern High-Performance Computing (HPC) systems has drastically increased the possibility of failures. Performance under failures and its optimization become timely important issues facing the HPC community. In this study, we propose an analytical model to predict the application performance. The model characterizes the impact of coordinated checkpointing and system failures on application performance, considering all the factors including workload, the number of nodes, failure arrival rate, recovery cost, and checkpointing interval and overhead. Based on the model, we gauge three parameters, the number of compute nodes, checkpointing interval, and the number of spare nodes to conduct a comprehensive study of performance optimization under failures. Performance scalability under failures is also studied to explore the performance improvement space for different parameters. Experimental results from both synthetic and actual system failure logs confirm that the proposed model and optimization methodologies are effective and feasible.

Original languageEnglish
Title of host publicationProceedings - 39th International Conference on Parallel Processing, ICPP 2010
Pages525-534
Number of pages10
DOIs
StatePublished - 2010
Event39th International Conference on Parallel Processing, ICPP 2010 - San Diego, CA, United States
Duration: Sep 13 2010Sep 16 2010

Publication series

NameProceedings of the International Conference on Parallel Processing
ISSN (Print)0190-3918

Conference

Conference39th International Conference on Parallel Processing, ICPP 2010
Country/TerritoryUnited States
CitySan Diego, CA
Period09/13/1009/16/10

Keywords

  • Checkpointing
  • Fault tolerance
  • High-performance computing
  • Performance optimization
  • Scalability

Fingerprint

Dive into the research topics of 'Optimizing HPC fault-tolerant environment: An analytical approach'. Together they form a unique fingerprint.

Cite this