RL-Watchdog: A Fast and Predictable SSD Liveness Watchdog on Storage Systems

Authors: 

Jin Yong Ha, Seoul National University; Sangjin Lee, Chung-Ang University; Heon Young Yeom, Seoul National University; Yongseok Son, Chung-Ang University

Abstract: 

This paper proposes a reinforcement learning-based watchdog (RLW) that examines solid-state drive (SSD) liveness or failures by faults (e.g., controller/power faults and high temperature) quickly, precisely, and online to minimize application data loss. To do this, we first provide a lightweight watchdog (LWW) to actively and lightly examine SSD liveness by issuing a liveness-dedicated command to the SSD. Second, we introduce a reinforcement learning-based timeout predictor (RLTP) which predicts the timeout of the dedicated command, enabling the detection of a failure point regardless of the SSD model. Finally, we propose fast failure notification (FFN) to immediately notify the applications of the failure to minimize their potential data loss. We implement RLW with three techniques in a Linux kernel 6.0.0 and evaluate it in a single SSD and RAID using realistic power fault injection. The experimental results reveal that RLW reduces the data loss by up to 96.7% compared with the existing scheme, and its accuracy in predicting failure points reaches up to 99.8%.