Science Reliability Engineering for High Performance Computing

Thursday, 31 October, 2024 - 14:5015:30 GMT

Nicholas Jones, LANL

Abstract: 

High Performance Computing (HPC) as an industry has long stood on very human facing operational workflows. These workflows exist because HPC systems are generally purpose built machines for small sets of code bases with very specific performance metrics. This purpose built nature has resulted in HPC having very bespoke one-off systems, resulting in process and infrastructure that benefit a small set of code bases well, but aren't resilient to generational churn. To combat the difficulty from generational churn we've adopted an SRE mindset for our new administrative stack OpenCHAMI. This lets us keep our figures of merit (exact reproducibility, parallel bandwidth, and compute time to solution) aligned with what benefits our customer base the most.

Nicholas Jones, LANL

Nick is a scientist at Los Alamos National Lab, where he works on system security architecture, CI/CD infrastructure, and shared computing environments and strategies across the National Nuclear Security Administration Laboratories.

BibTeX
@conference {302217,
author = {Nicholas Jones},
title = {Science Reliability Engineering for High Performance Computing},
year = {2024},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}