TY - GEN
T1 - Optimizing HPC fault-tolerant environment
T2 - 39th International Conference on Parallel Processing, ICPP 2010
AU - Jin, Hui
AU - Chen, Yong
AU - Zhu, Huaiyu
AU - Sun, Xian He
PY - 2010
Y1 - 2010
N2 - The increasingly large ensemble size of modern High-Performance Computing (HPC) systems has drastically increased the possibility of failures. Performance under failures and its optimization become timely important issues facing the HPC community. In this study, we propose an analytical model to predict the application performance. The model characterizes the impact of coordinated checkpointing and system failures on application performance, considering all the factors including workload, the number of nodes, failure arrival rate, recovery cost, and checkpointing interval and overhead. Based on the model, we gauge three parameters, the number of compute nodes, checkpointing interval, and the number of spare nodes to conduct a comprehensive study of performance optimization under failures. Performance scalability under failures is also studied to explore the performance improvement space for different parameters. Experimental results from both synthetic and actual system failure logs confirm that the proposed model and optimization methodologies are effective and feasible.
AB - The increasingly large ensemble size of modern High-Performance Computing (HPC) systems has drastically increased the possibility of failures. Performance under failures and its optimization become timely important issues facing the HPC community. In this study, we propose an analytical model to predict the application performance. The model characterizes the impact of coordinated checkpointing and system failures on application performance, considering all the factors including workload, the number of nodes, failure arrival rate, recovery cost, and checkpointing interval and overhead. Based on the model, we gauge three parameters, the number of compute nodes, checkpointing interval, and the number of spare nodes to conduct a comprehensive study of performance optimization under failures. Performance scalability under failures is also studied to explore the performance improvement space for different parameters. Experimental results from both synthetic and actual system failure logs confirm that the proposed model and optimization methodologies are effective and feasible.
KW - Checkpointing
KW - Fault tolerance
KW - High-performance computing
KW - Performance optimization
KW - Scalability
UR - http://www.scopus.com/inward/record.url?scp=78649621898&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2010.80
DO - 10.1109/ICPP.2010.80
M3 - Conference contribution
AN - SCOPUS:78649621898
SN - 9780769541563
T3 - Proceedings of the International Conference on Parallel Processing
SP - 525
EP - 534
BT - Proceedings - 39th International Conference on Parallel Processing, ICPP 2010
Y2 - 13 September 2010 through 16 September 2010
ER -