Scaling Production Repairs and QA in a Live Environment
(or How To Keep Up Without Breaking The World!)
By Shane Knapp, Google Inc.
Intro: Who is 'me'?
Vital Statistics:- Been with Google since November 2003 as a member of the HWOps (Hardware Operations) team.
- Started out in the datacenters, moved up through ranks to Systems Administrator and now Technical Project Management.
- Team lead for projects that developed and released the QA and repairs workflow managers.
- Currently working on improving operational software qualification processes, new platform introductions, and hardware diagnosis.
What this talk will and will not cover:
Will:- Background on our repairs and QA infrastructure.
- Design decisions, good and bad.
- Tips, tricks, and what not to do (or get caught doing).
- How to approach system design at a large scale.
Will not:- Numbers: machine count, machines in repairs, power usage, etc.
- Specific hardware information.
- Production machine management.
- How to game search algorithms :)
HWOps from 2003 to the present
Dark ages (late 1998 - mid 2003):- Paper-and-pencil repairs and QA records.
- Manual machine diagnoses.
- Manual repairs and QA processes.
HWOps from 2003 to the present
Renaissance (late 2003 - present):- Growth: Machine count, clusters, platforms, machine configs, employees, campus cafes.
- Automation: (AI1, AI2, qa_master, touchpad).
- Exponential growth of both datacenter technicians and systems administrators.
- Centralized data storage.
- Introduction of automated machine diagnoses.
- Next-gen repairs/QA machine manager is in testing.
The basic (current) repairs workflow
How we got to where we are today
Goals:- Scale backend services, machine management, and diagnosis tools and processes to keep Google's fleet up and running during explosibe growth.
How to keep up with the curve:
- Have a central, "standardized", and reliable data store.
- Everything was done in a live environment, so careful testing and releases were essential.
- Streamlining processes is key: This is done through improved tools and processes.
- Leaving certain areas open-ended enough for external collaboration.
How the tools and processes grew, part 1
- Very little math done initially: most improvements were done on areas with obvious problems, and were based on anecdotal evidence.
- Made choices to lock down the high-level processes, leaving floor-level interaction open for both regional and collaborative development.
- HWOps software development initially took place in a black-box from the rest of Google, but that practice has been changing to embrace in-house technologies.
How the tools and processes grew, part 2
- Constant vigilance was required for all changes to production infrastructure and its impact on HWOps.
- The best tools from the field were discovered and enveloped into the Swiss Army Knife of HWOps tools.
- All projects have members from many regions, skill levels, and backgrounds. This diversity allows project hand-offs as needed.
- General employee growth means more people, ideas, and resources!
Moving forward: Lessons learned, part 1
- Maintainability: Code needs to be standardized in both language and style.
- Flexibility: Processes need to lock down the important bits
(machine state, etc) but still remain extensible (adding a new 'phase'
to a workflow).
- Monolithic systems: Sometimes good, sometimes bad, always complex!
Moving forward: Lessons learned, part 2
- Plan, plan,
plan. A quick solution is sometimes your downfall, and long-term
projects with unknown dependencies can be your Achilles Heel.
- Accepting that sometimes the solution at hand is not the 'best', but 'good enough'.
- Engineering in a live environment is akin to base jumping: dangerous, yet exhilarating... And you need to keep an eye out for those pointy rocks.
Moving forward: Lessons learned, part 3
- Choose technologies carefully (Python vs. Perl vs. Ruby), and then use it consistently.
- Databases: Sleeping with the enemy.
- Automate everything you can. Workflows, workflows, workflows!
- Statistical analysis is your friend.
- And the biggest lesson learned: Be Careful!
War stories and open floor