Linhua Tang, Huawei Ireland Research Centre
The construction of large-scale distributed systems poses significant challenges due to inherent complexities and the inevitability of failures across various levels, from hardware malfunctions to software bugs. Embracing the 'design for failure' philosophy, this paper delves into advanced isolation techniques aimed at reducing the blast radius—both spatially and temporally—thereby enhancing system resilience. Spatial containment strategies, such as cell-based architecture, compartmentalize failures to localized areas, preventing cascading effects. Temporal mitigation focuses on rapid recovery and self-healing mechanisms, which aim to restore system health promptly after a failure occurs. Furthermore, the paper explores the application of formal methods in verifying the robustness of these designs, providing a rigorous approach to ensure the reliability and effectiveness of implemented solutions. This research underscores the importance of proactive architectural planning and continuous verification in maintaining the stability of complex distributed systems.
Linhua Tang, Huawei Ireland Research Centre
Linhua Tang (also known as James) is a software engineer and tech lead for global server load balancing and formal methods at Huawei Ireland Research Center. Before that, he worked at Microsoft and Amazon in different distributed systems.
author = {Linhua Tang},
title = {Blast Radius Reduction for {Large-Scale} Distributed Systems},
year = {2024},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}