sponsors
usenix conference policies
Root Cause Analysis—Intermediate
Hoover Room
This version of the class is aimed at the senior sysadmin. You have a decade or more experience in the industry, you are T-shaped (specialize in one or two areas but have expertise across a range of technologies), and you have accumulated numerous technical skills; now you want to deepen your meta-expertise. We will create the fog of war and then you’ll practice applying a methodology to focus your attention, working with your team to divvy up tasks, escalate key insights to each other, integrate clues from a range of sources, and produce reports for business leadership. In this version of the class, we spend more time in small groups and more time practicing communication skills than we do in the beginner version. In addition to the technical contributors, each team will need a problem manager—perhaps an unusually broad engineer, perhaps a resource or project manager comfortable with coordinating teams of techs.
Troubleshooting is hard. In hindsight, the answer to a problem is often obvious, but in the chaos and confusion of the moment—with too much data flowing in, time pressure, misleading clues—slicing through the distractions and focusing on the key elements is tough. This is a hands-on seminar: you will work through case studies taken from real-world situations. We divide into groups of 5-7, review a simplified version of Advance7′s Rapid Problem Resolution (RPR) methodology, and then oscillate on a half-hour cycle between coming together as a class and splitting into groups. During class time, I will describe the scenario, explain the current RPR step, and offer to role-play key actors. During group time, I will walk around, coaching and answering questions
The course material includes log extracts, packet traces, strace output, network diagrams, Cacti snapshots, and vendor tech support responses, all taken from actual RCA efforts. Preview the deck to get a feel for how your day will look. BYOL (Bring Your Own Laptop) for some hands-on, interactive, team-oriented, real-world puzzle solving.
Sysadmins and network engineers involved in trouble-shooting multidisciplinary problems; problem managers and problem analysts wanting experience coordinating teams.
Practice in employing a structured approach to analyzing problems which span multiple technology spaces.
Case studies:
- HPC Cluster Woes: Intermittently, interactive performance on a high-performance computing cluster grinds to a halt, nodes hang, jobs vanish from the queue…
- Storage Stumbles: Most of the company relies on an 800TB wide-striped storage system, with a multi-protocol (SMB, NFS, iSCSI) front-end from one manufacturer plugged into a Fibre-Channel attached back-end from another manufacturer. Intermittently, the back-end fries a disk, IO latency spikes, clients crash…
connect with us