A Highly Reliable Metadata Service for Large-Scale Distributed File Systems

Jiang Zhou, Yong Chen, Weiping Wang, Shuibing He, Dan Meng

Research output: Contribution to journalArticlepeer-review

7 Scopus citations


Many massive data processing applications nowadays often need long, continuous, and uninterrupted data accesses. Distributed file systems are used as the back-end storage to provide the global namespace management and reliability guarantee. Due to increasing hardware failures and software issues with the growing system scale, metadata service reliability has become a critical issue as it has a direct impact on file and directory operations. Existing metadata management mechanisms can provide fault tolerance capability to some level but are inadequate. They often have limitations in system availability, state consistence, and performance overhead and lack an effective mechanism to offer metadata reliability. This paper introduces a novel highly reliable metadata service to address these issues in large-scale file systems. Different from traditional strategies, this proposed reliable metadata service adopts a new active-standby architecture for fault tolerance and uses a holistic approach to improve file system availability. A new shared storage pool (SSP) is designed for transparent metadata synchronization and replication between active and standby servers. Based on the SSP, a new policy called multiple actives multiple standbys (MAMS) is presented to perform metadata service recovery in case of failures. A new global state recovery strategy and a smart client fault tolerance mechanism are achieved to maintain the continuity of metadata service. We have implemented such highly reliable metadata service in a prototype file system CFS (Clover file system) and conducted extensive tests to evaluate it. Experimental results confirm that it can significantly improve file system reliability with fast failover under different failure scenarios while having negligible influence on performance. Compared with typical reliability designs in Hadoop Avatar, Hadoop HA, and Boom-FS file systems, the mean-time-to-recovery (MTTR) with the highly reliable metadata service was reduced by 80.23, 65.46 and 28.13 percent, respectively.

Original languageEnglish
Article number8812918
Pages (from-to)374-392
Number of pages19
JournalIEEE Transactions on Parallel and Distributed Systems
Issue number2
StatePublished - Feb 1 2020


  • Distributed file systems
  • fault tolerance
  • metadata reliability
  • metadata service
  • shared metadata storage


Dive into the research topics of 'A Highly Reliable Metadata Service for Large-Scale Distributed File Systems'. Together they form a unique fingerprint.

Cite this