MSFRD: Mutation Similarity based SSD Failure Rating and Diagnosis for Complex and Volatile Production Environments

Authors: 

Yuqi Zhang, Tianyi Zhang, Wenwen Hao, Shuyang Wang, Na Liu, and Xing He, Samsung R&D Institute China Xi'an, Samsung Electronics; Yang Zhang, Weixin Wang, Yongguang Cheng, Huan Wang, Jie Xu, Feng Wang, and Bo Jiang, ByteDance Inc.; Yongwong Gwon, Jongsung Na, Zoe Kim, and Geunrok Oh, Samsung Electronics

Abstract: 

SSD failures have an increasing impact on storage reliability and performance in data centers. Some manufacturers have customized fine-grained Telemetry attributes to analyze and identify SSD failures. Based on Telemetry data, this paper proposes the mutation similarity based failure rating and diagnosis (MSFRD) scheme to predict failures in dynamic environment of data centers and improve failure handling efficiency. MSFRD dynamically detects the internal mutations of SSDs in real time and measures their similarity to the mutations of historical failed SSDs and healthy SSDs for failure prediction and early rating. Based on the rating, unavailable SSDs with serious failures are handled immediately, while available SSDs with less serious failures will be continuously tracked and diagnosed. The MSFRD is evaluated on real Telemetry datasets collected from large-scale SSDs in data centers. Compared with the existing schemes, MSFRD improves precision by 23.8% and recall by 38.9% on average for failure prediction. The results also show the effectiveness of MSFRD on failure rating and progressive diagnosis.

USENIX ATC '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {298593,
author = {Yuqi Zhang and Tianyi Zhang and Wenwen Hao and Shuyang Wang and Na Liu and Xing He and Yang Zhang and Weixin Wang and Yongguang Cheng and Huan Wang and Jie Xu and Feng Wang and Bo Jiang and Yongwong Gwon and Jongsung Na and Zoe Kim and Geunrok Oh},
title = {{MSFRD}: Mutation Similarity based {SSD} Failure Rating and Diagnosis for Complex and Volatile Production Environments},
booktitle = {2024 USENIX Annual Technical Conference (USENIX ATC 24)},
year = {2024},
isbn = {978-1-939133-41-0},
address = {Santa Clara, CA},
pages = {869--884},
url = {https://www.usenix.org/conference/atc24/presentation/zhang-yuqi},
publisher = {USENIX Association},
month = jul
}

Presentation Video