TY - GEN
T1 - Pfault
AU - Cao, Jinrui
AU - Gatla, Om Rameshwar
AU - Zheng, Mai
AU - Dai, Dong
AU - Eswarappa, Vidya
AU - Mu, Yan
AU - Chen, Yong
N1 - Funding Information:
We are thankful to the anonymous reviewers for their valuable feedback. This research is supported in part by the National Science Foundation under grants CNS-1338078, IIP-1362134, CCF-1409946, CNS-1566554, CCF-1717630, and CCF-1718336.
Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/6/12
Y1 - 2018/6/12
N2 - High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that of local storage systems, largely due to the lack of an effective analysis methodology. In this paper, we introduce PFault, a general framework for analyzing the failure handling of PFSes. PFault automatically emulates the failure state of each storage device in the target PFS based on a set of well-defined fault models, and enables analyzing the recoverability of the PFS under faults systematically. To demonstrate the practicality, we apply PFault to study Lustre, one of the most widely used PFSes. Our analysis reveals a number of cases where Lustre's checking and repairing utility LFSCK fails with unexpected symptoms (e.g., I/O error, hang, reboot). Moreover, with the help of PFault, we are able to identify a resource leak problem where a portion of Lustre's internal namespace and storage space become unusable even after running LFSCK. On the other hand, we also verify that the latest Lustre has made noticeable improvement in terms of failure handling comparing to a previous version. We hope our study and framework can help improve PFSes for reliable high-performance computing.
AB - High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that of local storage systems, largely due to the lack of an effective analysis methodology. In this paper, we introduce PFault, a general framework for analyzing the failure handling of PFSes. PFault automatically emulates the failure state of each storage device in the target PFS based on a set of well-defined fault models, and enables analyzing the recoverability of the PFS under faults systematically. To demonstrate the practicality, we apply PFault to study Lustre, one of the most widely used PFSes. Our analysis reveals a number of cases where Lustre's checking and repairing utility LFSCK fails with unexpected symptoms (e.g., I/O error, hang, reboot). Moreover, with the help of PFault, we are able to identify a resource leak problem where a portion of Lustre's internal namespace and storage space become unusable even after running LFSCK. On the other hand, we also verify that the latest Lustre has made noticeable improvement in terms of failure handling comparing to a previous version. We hope our study and framework can help improve PFSes for reliable high-performance computing.
KW - High performance computing
KW - Parallel file systems
KW - Reliability
UR - http://www.scopus.com/inward/record.url?scp=85055818808&partnerID=8YFLogxK
U2 - 10.1145/3205289.3205302
DO - 10.1145/3205289.3205302
M3 - Conference contribution
AN - SCOPUS:85055818808
T3 - Proceedings of the International Conference on Supercomputing
SP - 1
EP - 11
BT - ICS 2018
PB - Association for Computing Machinery
Y2 - 12 June 2018 through 15 June 2018
ER -