TY - GEN
T1 - MAMS
T2 - 44th International Conference on Parallel Processing, ICPP 2015
AU - Zhou, Jiang
AU - Chen, Yong
AU - Wang, Weiping
AU - Meng, Dan
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/12/8
Y1 - 2015/12/8
N2 - Most mass data processing applications nowadays often need long, continuous, and uninterrupted data access. Parallel/distributed file systems often use multiple metadata servers to manage the global namespace and provide a reliability guarantee. With the rapid increase of data amount and system scale, the probability of hardware or software failures keeps increasing, which easily leads to multiple points of failures. Metadata service reliability has become a crucial issue as it affects file and directory operations in the event of failures. Existing reliable metadata management mechanisms can provide fault tolerance but have disadvantages in system availability, state consistence, and performance overhead. This paper introduces a new highly reliable policy called MAMS (multiple actives multiple standbys) to ensure multiple metadata service reliability in file systems. Different from traditional strategies, the MAMS divides metadata servers into different replica groups and maintains more than one standby node for failover in each group. Combining the global view with distributed protocols, the MAMS achieves an automatic state transition and service takeover. We have implemented the MAMS policy in a prototyping file system and conducted extensive tests to validate and evaluate it. The experimental results confirm that the MAMS policy can achieve a faster transparent fault tolerance in different error scenarios with less influence on metadata operations. Compared with typical designs in Hadoop Avatar, Hadoop HA, and Boom-FS file systems, the mean time to recovery (MTTR) with the MAMS was reduced by 80.23%, 65.46% and 28.13%, respectively.
AB - Most mass data processing applications nowadays often need long, continuous, and uninterrupted data access. Parallel/distributed file systems often use multiple metadata servers to manage the global namespace and provide a reliability guarantee. With the rapid increase of data amount and system scale, the probability of hardware or software failures keeps increasing, which easily leads to multiple points of failures. Metadata service reliability has become a crucial issue as it affects file and directory operations in the event of failures. Existing reliable metadata management mechanisms can provide fault tolerance but have disadvantages in system availability, state consistence, and performance overhead. This paper introduces a new highly reliable policy called MAMS (multiple actives multiple standbys) to ensure multiple metadata service reliability in file systems. Different from traditional strategies, the MAMS divides metadata servers into different replica groups and maintains more than one standby node for failover in each group. Combining the global view with distributed protocols, the MAMS achieves an automatic state transition and service takeover. We have implemented the MAMS policy in a prototyping file system and conducted extensive tests to validate and evaluate it. The experimental results confirm that the MAMS policy can achieve a faster transparent fault tolerance in different error scenarios with less influence on metadata operations. Compared with typical designs in Hadoop Avatar, Hadoop HA, and Boom-FS file systems, the mean time to recovery (MTTR) with the MAMS was reduced by 80.23%, 65.46% and 28.13%, respectively.
KW - Cluster file systems
KW - Fault tolerance
KW - Metadata management
KW - Multiple metadata service
KW - Parallel file systems
UR - http://www.scopus.com/inward/record.url?scp=84976519550&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2015.82
DO - 10.1109/ICPP.2015.82
M3 - Conference contribution
AN - SCOPUS:84976519550
T3 - Proceedings of the International Conference on Parallel Processing
SP - 729
EP - 738
BT - Proceedings - 2015 44th International Annual Conference on Parallel Processing, ICPP 2015
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 1 September 2015 through 4 September 2015
ER -