USENIX supports diversity, equity, and inclusion and condemns hate and discrimination.
Non-Abstract Large System Design for Sysadmins
John Looney put the "Large" in "Large Installation System Administration Conference" on Sunday. In an all-day training session, attendees learned about "non-abstract large system design." The non-abstract aspect came in to play over the course of the day as groups worked to architect the infrastructure for a hypothetical image-sharing service. Looney mixed lecture, discussion, and group sessions in order to allow attendees to learn the most in the time constraints.
Each group had a mentor from Google to provide expertise and guidance along the way. My own professional experience had taught me to think about clusters of 500-1000 machines, but the relative homogeneity of high-performance compute clusters was no preparation for analyzing the complex needs of a full-stack service. I found the opportunity to immediately apply what I learned very beneficial for keeping focus.
This course, new to LISA '13, was inspired by Google's site reliability engineer (SRE) interviews. Only a small percentage of SRE candidates do well in the large system design portion of the interview. This is often due to a lack of exposure to systems at Google's scale. It's very clear that even the brightest sysadmin needs practice to do this sort of design well. Even at much smaller sites, the skills and methodology presented in this course can lead to better design.
Looney began with a discussion of requirements gathering. Asking the right questions and properly understanding the business needs are critical to a successful system design. Requirements include knowing resources and constraints in personnel, technology, finance, and usage. Ensuring the business users assign a value to system downside is important in this stage.
Next came a discussion of service-led design. Although the acronym was never explicitly mentioned, this portion of the course reminded me very much of ITIL training I have attended. The design begins with identifying service level indicators (SLIs): metrics that are unambiguous and tied to success. Each element of the design should have at least one SLI that measures user-impacting operations. SLIs form the basis of service level objectives (SLOs), which are goals that the system will try to achieve. Finally, service level agreements (SLAs) define what is promised from the system, included what happens when failures occur. SLAs are also used to validate the design to ensure the customer's desires are met and to serve as a starting point for cost/reliability tradeoffs.
The course then turned to a variety of principles to keep in mind when designing largesystems. Perhaps the hardest one for many sysadmins to come to terms with is the notion that hardware failure is a normal state of affairs and that one should never become too attached to any particular server. Looney also presented L. Peter Deutsch's fallacies of distributed computing:
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn't change
- There is one administrator
- The transport cost is zero
- The network is homogenous
In the afternoon, load and monitoring were discussed. Looney presented various load balancing strategies, along with their advantages and disadvantages. Load and overload naturally lead into monitoring and alerting. It's important that monitoring systems scale with the systems they monitor. Monitoring and reporting the right data is also important. Alerts should be actionable, and should be driven by a darn good reason for waking up the on-call admin in the middle of the night.
The session closed with Looney and course attendees sharing war stories, including a graph that rose imperceptably slowly and a literally-on-fire datacenter. My group didn't quite finish our design, but we were off to a good start, and we certainly had a greater appreciation for what goes into some of the large services we use every day.
Comments
Really, a very good course
I'll be taking this exercise and running my own ops team through it soon enough. We keep getting told from On High to think big, and that exercise is just that. Choices we make now will have to be refactored as we grow, and maybe this will let more of us make the right choices to reduce that future-refactoring worklow.