RL-Watchdog: A Fast and Predictable SSD Liveness Watchdog on Storage Systems

Authors: 

Jin Yong Ha, Seoul National University; Sangjin Lee, Chung-Ang University; Heon Young Yeom, Seoul National University; Yongseok Son, Chung-Ang University

Abstract: 

This paper proposes a reinforcement learning-based watchdog (RLW) that examines solid-state drive (SSD) liveness or failures by faults (e.g., controller/power faults and high temperature) quickly, precisely, and online to minimize application data loss. To do this, we first provide a lightweight watchdog (LWW) to actively and lightly examine SSD liveness by issuing a liveness-dedicated command to the SSD. Second, we introduce a reinforcement learning-based timeout predictor (RLTP) which predicts the timeout of the dedicated command, enabling the detection of a failure point regardless of the SSD model. Finally, we propose fast failure notification (FFN) to immediately notify the applications of the failure to minimize their potential data loss. We implement RLW with three techniques in a Linux kernel 6.0.0 and evaluate it in a single SSD and RAID using realistic power fault injection. The experimental results reveal that RLW reduces the data loss by up to 96.7% compared with the existing scheme, and its accuracy in predicting failure points reaches up to 99.8%.

USENIX ATC '24 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {298619,
author = {Jin Yong Ha and Sangjin Lee and Heon Young Yeom and Yongseok Son},
title = {{RL-Watchdog}: A Fast and Predictable {SSD} Liveness Watchdog on Storage Systems},
booktitle = {2024 USENIX Annual Technical Conference (USENIX ATC 24)},
year = {2024},
isbn = {978-1-939133-41-0},
address = {Santa Clara, CA},
pages = {1083--1100},
url = {https://www.usenix.org/conference/atc24/presentation/ha},
publisher = {USENIX Association},
month = jul
}

Presentation Video