- LISA '12 Home
- Registration Information
- Registration Discounts
- Organizers
- At a Glance
- Calendar
- Conference Themes
- Training Program
- Technical Sessions
- Workshops
- Data Storage Day
- ION San Diego
- Posters
- Birds-of-a-Feather Sessions
- Exhibition
- Sponsors
- Activities
- Why Attend?
- Hotel and Travel Information
- Services
- Students and Grants
- Questions?
- Help Promote
- Flyer PDF
- Brochure PDF
- For Participants
- Call for Participation
- Past Proceedings
sponsors
usenix conference policies
Root Cause Analysis
Seabreeze
Troubleshooting is hard. I don't claim to be an expert at either doing it or teaching it. On the other hand, I have several decades of experience wielding packet analyzers, debuggers, and log parsers and have accumulated various strategies that I believe you'll find useful. This is a hands-on seminar: you will work through case studies taken from real-world situations. We divide into groups of 3–5, review a simplified version of Advance7's Rapid Problem Resolution (RPR) methodology, and then oscillate, on about a half-hour cycle, between coming together as a class and working in groups. During class time, I describe the scenario, explain the current RPR step, and offer to role-play key actors. During group time, I walk around, coaching and answering questions.
The course material includes log extracts, packet traces, strace output, network diagrams, Cacti snapshots, and vendor tech support responses, all taken from actual RCA efforts. I bring a dozen baseball caps emblazoned with Sys Admin or Storage Admin or End-User and will role-play those personas as needed.
An example: You ask the sysadmin to reboot the server. Meh, OK, the server has rebooted, but after a couple of minutes, the CPU utilization is pegged at 100% again. What do you want to do next?
BYOL (Bring Your Own Laptop) loaded with Wireshark and a graphics viewer (PDF and PNG) for some hands-on, interactive, team-oriented, real-world puzzle solving.
Draft deck visible at:
http://www.skendric.com/problem/rca/Root-Cause-Analysis-LISA-2012.pdf
System admininstrators and network engineers tasked with troubleshooting multidisciplinary problems.
Practice in employing a structured approach to analyzing problems that span multiple technology spaces.
Case studies, e.g.:
- Hourly Data Transfer Fails—Every hour, an application at the clinic wakes up, contacts its partner at a central hospital, and exchanges data, thus keeping the patient databases synchronized. Several times a day, this process fails, alerting the database administrator with the helpful message "A Network Error has occurred."
- Many Applications Crash—Outlook crashes, Word documents fail to save, Windows Explorer hangs: The office automation applications servicing ~1500 users intermittently report a range of error messages. Suspicion falls on the mass-storage device hosting home and shared directories.
- Slow Downloads—Intermittently, both internal and external users see slow downloads from the public Web site. Is it the load-balancer, or the firewall?
connect with us