TY - GEN
T1 - A generic framework for testing parallel file systems
AU - Cao, Jinrui
AU - Wang, Simeng
AU - Dai, Dong
AU - Zheng, Mai
AU - Chen, Yong
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/1/30
Y1 - 2017/1/30
N2 - Large-scale parallel file systems are of prime importance today. However, despite of the importance, their failure-recovery capability is much less studied compared with local storage systems. Recent studies on local storage systems have exposed various vulnerabilities that could lead to data loss under failure events, which raise the concern for parallel file systems built on top of them.This paper proposes a generic framework for testing the failure handling of large-scale parallel file systems. The framework captures all disk I/O commands on all storage nodes of the target system to emulate realistic failure states, and checks if the target system can recover to a consistent state without incurring data loss. We have built a prototype for the Lustre file system. Our preliminary results show that the framework is able to uncover the internal I/O behavior of Lustre under different workloads and failure conditions, which provides a solid foundation for further analyzing the failure recovery of parallel file systems.
AB - Large-scale parallel file systems are of prime importance today. However, despite of the importance, their failure-recovery capability is much less studied compared with local storage systems. Recent studies on local storage systems have exposed various vulnerabilities that could lead to data loss under failure events, which raise the concern for parallel file systems built on top of them.This paper proposes a generic framework for testing the failure handling of large-scale parallel file systems. The framework captures all disk I/O commands on all storage nodes of the target system to emulate realistic failure states, and checks if the target system can recover to a consistent state without incurring data loss. We have built a prototype for the Lustre file system. Our preliminary results show that the framework is able to uncover the internal I/O behavior of Lustre under different workloads and failure conditions, which provides a solid foundation for further analyzing the failure recovery of parallel file systems.
UR - http://www.scopus.com/inward/record.url?scp=85015323588&partnerID=8YFLogxK
U2 - 10.1109/PDSW-DISCS.2016.013
DO - 10.1109/PDSW-DISCS.2016.013
M3 - Conference contribution
AN - SCOPUS:85015323588
T3 - Proceedings of PDSW-DISCS 2016: 1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 49
EP - 54
BT - Proceedings of PDSW-DISCS 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 1st Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2016
Y2 - 14 November 2016
ER -