Pfault: A general framework for analyzing the reliability of high-performance parallel file systems

Jinrui Cao, Om Rameshwar Gatla, Mai Zheng, Dong Dai, Vidya Eswarappa, Yan Mu, Yong Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Scopus citations

Abstract

High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that of local storage systems, largely due to the lack of an effective analysis methodology. In this paper, we introduce PFault, a general framework for analyzing the failure handling of PFSes. PFault automatically emulates the failure state of each storage device in the target PFS based on a set of well-defined fault models, and enables analyzing the recoverability of the PFS under faults systematically. To demonstrate the practicality, we apply PFault to study Lustre, one of the most widely used PFSes. Our analysis reveals a number of cases where Lustre's checking and repairing utility LFSCK fails with unexpected symptoms (e.g., I/O error, hang, reboot). Moreover, with the help of PFault, we are able to identify a resource leak problem where a portion of Lustre's internal namespace and storage space become unusable even after running LFSCK. On the other hand, we also verify that the latest Lustre has made noticeable improvement in terms of failure handling comparing to a previous version. We hope our study and framework can help improve PFSes for reliable high-performance computing.

Original languageEnglish
Title of host publicationICS 2018
Subtitle of host publicationInternational Conference on Supercomputing
PublisherAssociation for Computing Machinery
Pages1-11
Number of pages11
ISBN (Electronic)9781450357838
DOIs
StatePublished - Jun 12 2018
Event32nd International Conference on Supercomputing, ICS 2018 - Beijing, China
Duration: Jun 12 2018Jun 15 2018

Publication series

NameProceedings of the International Conference on Supercomputing

Conference

Conference32nd International Conference on Supercomputing, ICS 2018
CountryChina
CityBeijing
Period06/12/1806/15/18

Keywords

  • High performance computing
  • Parallel file systems
  • Reliability

Cite this

Cao, J., Gatla, O. R., Zheng, M., Dai, D., Eswarappa, V., Mu, Y., & Chen, Y. (2018). Pfault: A general framework for analyzing the reliability of high-performance parallel file systems. In ICS 2018: International Conference on Supercomputing (pp. 1-11). (Proceedings of the International Conference on Supercomputing). Association for Computing Machinery. https://doi.org/10.1145/3205289.3205302