REMEM: REmote MEMory as checkpointing storage

Hui Jin, Xian He Sun, Yong Chen, Tao Ke

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

Checkpointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based checkpointing has been extensively studied in research but made little success in practice due to its complexity and potential reliability concerns. In this study we present the design and implementation of REMEM, a REmote MEMory checkpointing system to extend the checkpointing storage from disk to remote memory. A unique feature of REMEM is that it can be integrated into existing disk-based checkpointing systems seamlessly. A user can flexibly switch between REMEM and disk as checkpointing storage to balance the efficiency and reliability. The implementation of REMEM on Open MPI is also introduced. The experimental results confirm that REMEM and the proposed adaptive checkpointing storage selection are promising in both performance, reliability and scalability.

Original languageEnglish
Title of host publicationProceedings - 2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010
Pages319-326
Number of pages8
DOIs
StatePublished - 2010
Event2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010 - Indianapolis, IN, United States
Duration: Nov 30 2010Dec 3 2010

Publication series

NameProceedings - 2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010

Conference

Conference2nd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2010
Country/TerritoryUnited States
CityIndianapolis, IN
Period11/30/1012/3/10

Fingerprint

Dive into the research topics of 'REMEM: REmote MEMory as checkpointing storage'. Together they form a unique fingerprint.

Cite this