TY - GEN
T1 - REMEM
T2 - 2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010
AU - Jin, Hui
AU - Sun, Xian He
AU - Chen, Yong
AU - Ke, Tao
PY - 2010
Y1 - 2010
N2 - Checkpointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based checkpointing has been extensively studied in research but made little success in practice due to its complexity and potential reliability concerns. In this study we present the design and implementation of REMEM, a REmote MEMory checkpointing system to extend the checkpointing storage from disk to remote memory. A unique feature of REMEM is that it can be integrated into existing disk-based checkpointing systems seamlessly. A user can flexibly switch between REMEM and disk as checkpointing storage to balance the efficiency and reliability. The implementation of REMEM on Open MPI is also introduced. The experimental results confirm that REMEM and the proposed adaptive checkpointing storage selection are promising in both performance, reliability and scalability.
AB - Checkpointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based checkpointing has been extensively studied in research but made little success in practice due to its complexity and potential reliability concerns. In this study we present the design and implementation of REMEM, a REmote MEMory checkpointing system to extend the checkpointing storage from disk to remote memory. A unique feature of REMEM is that it can be integrated into existing disk-based checkpointing systems seamlessly. A user can flexibly switch between REMEM and disk as checkpointing storage to balance the efficiency and reliability. The implementation of REMEM on Open MPI is also introduced. The experimental results confirm that REMEM and the proposed adaptive checkpointing storage selection are promising in both performance, reliability and scalability.
UR - http://www.scopus.com/inward/record.url?scp=79952372510&partnerID=8YFLogxK
U2 - 10.1109/CloudCom.2010.102
DO - 10.1109/CloudCom.2010.102
M3 - Conference contribution
AN - SCOPUS:79952372510
SN - 9780769543024
T3 - Proceedings - 2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010
SP - 319
EP - 326
BT - Proceedings - 2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010
Y2 - 30 November 2010 through 3 December 2010
ER -