LISA16 Training Program

S1 Linux Performance Tuning

Theodore Ts'o, Google

9:00 am–5:00 pm

Constitution Ballroom A

The Linux operating system is commonly used both in the data center and for scientific computing applications; it is used in embedded systems as small as a wristwatch, as well as in large mainframes. As a result, the Linux system has many tuning knobs so that it can be optimized for a wide variety of workloads. Some tuning of the Linux operating system has been done "out of the box" by enterprise-optimized distributions, but there are still many opportunities for a system administrator to improve the performance of his or her workload on a Linux system.

This class will cover the tools that can be used to monitor and analyze a Linux system, and key tuning parameters to optimize Linux for specific server applications, covering the gamut from memory usage to filesystem and storage stacks, networking, and application tuning.

Who should attend:

Intermediate and advanced Linux system administrators who want to understand their systems better and get the most out of them.

Take back to work:

The ability to hone your Linux systems for the specific tasks they need to perform.

Topics include:

Strategies for performance tuning
Characterizing your workload's requirements
Finding bottlenecks
Tools for measuring system performance
Memory usage tuning
Filesystem and storage tuning
Network tuning
Latency vs. throughput
Capacity planning
Profiling
Memory cache and TLB tuning
Application tuning strategies

Theodore Ts'o is the first North American Linux Kernel Developer, and started working with Linux in September 1991. He previously served as CTO for the Linux Foundation, and is currently employed at Google. Theodore is a Debian developer, and is the maintainer of the ext4 file system in the Linux kernel. He is the maintainer and original author of the e2fsprogs userspace utilities for the ext2, ext3, and ext4 file systems.

S2 Automation Tools Bootcamp

Tyler Fitch, Chef

9:00 am–5:00 pm

Constitution Ballroom B

Overview
The Automation Tools Bootcamp is a tutorial for individuals looking for exposure to and usage of new IT automation tools. We will learn about and then use Vagrant, Chef, Packer, Docker, Terraform and Artifactory to deploy a small application in local VMs.

We will cover a progression of tasks, leveraging information from previous sections to deploy a small app that runs identically on your local development machine or on a shared server. Get rid of the “it works for me” mentality when you know your local VM is identical to your co-workers' and your shared environments.

Who should attend:

Operations, QA, those who choose to call themselves DevOps, and even managers can come learn.

Take back to work:

These automation tools are freely available to engineers, enabling them to safely break local environments until the change in configuration has been perfected. Basic exposure to these tools will allow attendees to return to work with new ways to tackle the problems they face daily.

Topics include:

Vagrant, Chef, Packer, Docker, Terraform, and Artifactory

Tyler is an Architect in Chef’s Customer Success program, championing successful patterns and delightful experiences in automation to enterprise customers. Prior to working at Chef, he spent a decade as an engineer for Adobe, developing and automating commerce services for adobe.com using a variety of technologies. He lives in Vancouver, Washington, and when he’s not programming enjoys lacrosse and using his passport.

LISA16: Architecture

S3 SRE Classroom: Non-Abstract

Salim Virji, Google

9:00 am–5:00 pm

Commonwealth Ballroom

With this hands-on tutorial, you will develop an understanding for designing, building, and running reliable Internet services at a large scale.

Who should attend:

This tutorial is suitable for executives who need to specify and evaluate systems, engineers who build systems, and IT professionals who want to run first-class services built with reliable systems.

Take back to work:

You will take back an understanding of how to evaluate system designs, how to specify and build large systems, and how to operate these systems in the real world in a way that will scale as the system grows.

Topics include:

Designing Reliable Systems
Building Reliable Systems
Running Reliable Systems

Salim Virji is a Site Reliability Engineer at Google. He has worked on infrastructure software, back-end systems, front-end applications, and delightful ways to connect them all. He lives and works in New York City.

Half Day Morning

S4 Professional Conduct and Ethics for System Administrators

Lee Damon, University of Washington

9:00 am–12:30 pm

Gardner Room

This introductory tutorial will start by examining some of the ethical responsibilities that come along with access to other users' data, accounts, and confidential information. We will look at several case studies involving both local and cloud usage. All attendees are strongly encouraged to participate in the discussion. Numerous viewpoints will be considered in order to give students a perspective from which to develop their own reasoned response to ethical challenges.

Who should attend:

Anyone who is a system administrator or has access to personal/confidential information, or anyone who manages system administrators or makes policy decisions about computer systems and their users. There are no prerequisites for this class.

Take back to work:

After completing this tutorial you will be better prepared and able to resolve ethically questionable situations and will have the means to support your decisions.

Topics include:

Why it is important to set your ethical standards before it comes up
Who is impacted by "expectations of ethical conduct"
Why this isn't just an expectation of system administrators
Implicit expectations of ethical behavior
Ethics and The Cloud
Coercion to violate ethics
Well-intentioned violations of privacy
Collection, retention, and protection of personal data
Management directives vs. friendships
Software piracy/copying in a company, group, or department

Lee Damon has a B.S. in Speech Communication from Oregon State University. He has been a UNIX system administrator since 1985 and has been active in SAGE (US) & LOPSA since their inceptions. He assisted in developing a mixed AIX/SunOS environment at IBM Watson Research and has developed mixed environments for Gulfstream Aerospace and QUALCOMM. He is currently leading the development effort for the Nikola project at the University of Washington Electrical Engineering department. Among other professional activities, he is a charter member of LOPSA and SAGE and past chair of the SAGE Ethics and Policies working groups. He chaired LISA '04 and co-chaired CasITConf '11, '13, and '14.

S6 Statistics for Operations: Making Sense out of Data

Kyrre Begnum, Oslo University College of Applied Sciences
Nicole Forsgren, DORA

9:00 am–12:30 pm

Fairfax Room

This tutorial is a course in statistics with a specific focus on system administrators and the types of data they face. We assume little prior knowledge of statistics and cover the most common concepts in descriptive statistics and apply them to data taken from real life examples. Our aim is to provide insight into what methods provide good interpretation of data such as distributions, probability and formulating basic statements about the properties of observed data.

The first part will cover descriptive statistics for single datasets, including mean, median, mode, range and distributions. When discussing distributions, we will cover probabilities through percentiles (e.g., a normal distribution is very uncommon in ops data). This session will use a prepared dataset and spreadsheet (LibreOffice or OpenOffice, because it works on all platforms). We have data from the number of players from an online game over a 6-month period. In this exercise, we will analyze the distribution and try to make statements like, “What is the likelihood that we see more than 27,000 simultaneous players?” One of the lessons is that the top 5% in the distribution counts for almost a doubling in players, which is interesting. We then extend the discussion to include organizational implications: Imagine if your job is to buy resources for a service like this, and you have to double your rig in order to cope with something that is only 5% likely to happen? How would you explain it in a meeting?

The second part will discuss comparisons using two common methods that can be calculated in a spreadsheet: correlations and regressions. Correlations will be used as a tool to identify interesting relationships among data; ranked correlation may be considered for two data sets that have the same «flow» but on separate ranges (e.g., the correlation between web requests and database requests). Regression can also be used to identify relationships. For example, using a regression plot between two variables, one could identify bottlenecks by comparing the load of two tiers (db tier vs web tier). In a scalable system, we would expect a nice 45-degree linear relationship between the two. However, if the database tier struggles before the web tier, we would see the linear approximation slope «upward» (if the db load is on the y axis) as the load increases.

Throughout we will have a focus on takeaways and trying to couple the different statistical methods with the type of answers they can provide, like: “Can the average of a dataset explain the outer limits of my data?”. It is easy to fall off the wagon with a topic like statistics. We are aware of this risk and will utilize active learning techniques such as socrative and kahoot to engage the audience and make them participate more.

Who should attend:

Sysadmins who are faced with data overload and wish they had some knowledge of how statistics can be used to make more sense of it. We assume little prior knowledge of statistics, but a basic mathematical proficiency is recommended.

Take back to work:

A fundamental understanding of how descriptive statistics can help provide additional insight on the data in the sysadmin world and that will allow for further self-study on statistics.
A basic set of statistical approaches that can be used to identify fundamental properties of the data they see in their own environments, and identify patterns in that data.
Learn how to make accurate and clear statements about their metrics that are valuable to the organization.

Topics include:

Descriptive statistics for single datasets, including: mean, median, mode, range, and distributions
Basic analysis of distributions and probabilities using percentiles typically seen in ops
Interpretation of analyses to include team and business implications
Regression analysis to suggest predictive relationships, with an emphasis on interpretation and implications
Correlation analysis and broad pattern detection (if time allows)

Kyrre Begnum works as an Associate Professor at Oslo and Akershus University College of Applied Sciences where he teaches sysadmin courses at the MSc and BSc levels. Kyrre holds a PhD from the University of Oslo with a focus on understanding the behavior of large systems. He has experience with large scale virtual machine management, cloud architectures and developing sysadmin tools. His research focus is on practical and understandable approaches that bring advanced models to real life scenarios.

Dr. Nicole Forsgren is an IT impacts expert who shows leaders and practitioners how to unlock the potential of technology change in their organizations. Best known for her work with tech professionals and as the lead investigator on the State of DevOps Reports, she is CEO and Chief Scientist at DORA (DevOps Research and Assessment) and an Academic Partner at Clemson University. In a previous life, she was a professor, sysadmin, and hardware performance analyst.

Half Day Afternoon

S5 How to Not Get Paged: Managing On-Call to Reduce Outages

Thomas Limoncelli, StackOverflow.com

1:30 pm–5:00 pm

Fairfax Room

People think of "on call” as responding to a pager that beeps because of an outage. In this class, you will learn how to run an on-call system that improves uptime and reduces how often you are paged. We will start with a monitoring philosophy that prevent outages. Then we will discuss how to construct an on-call schedule—possibly in more detail than you've cared about before—but, as a result, it will be more fair and less stressful. We'll discuss how to conduct “fire drills” and “game day exercises” that create antifragile systems. Lastly, we'll discuss how to conduct a postmortem exercise that promotes better communication and prevents future problems.

Who should attend:

Managers or Sysadmins with oncall responsibility

Take back to work:

Knowledge that makes being on call more fair and less stressful
Strategies for using monitoring to improve uptime and reliability
Team-training techniques such as "fire drills" and "game day exercises"
How to conduct better postmortems/learning retrospectives

Topics include:

Why your monitoring strategy is broken and how to fix it
Building a more fair on-call schedule
Monitoring to detect outages vs. monitoring to improve reliability
Alert review strategies
Conducting “fire drills” and “game day exercises”
"Blameless postmortem documents"

Tom is an internationally recognized author, speaker, system administrator, and DevOps advocate. His latest book, the 3rd edition of The Practice of System and Network Administration, launched last month. He is also known for The Practice of Cloud System Administration, and Time Management for System Administrators (O'Reilly). He works in New York City at StackOverflow.com. He's previously worked at Google, Bell Labs/Lucent, AT&T, and others. His blog is and he tweets @YesThatTom. He lives in New Jersey.

S7 Documentation Techniques for System Administrators

Mike Ciavarella, Coffee Bean Software Pty Ltd

1:30 pm–5:00 pm

Gardner Room

Sysadmins freely acknowledge how important documentation is to their daily lives, and in the same sentence will loudly complain that they don’t have time to produce documentation. This class is about how to produce effective, useful and timely documentation as part of your normal sysadmin activities. Particular emphasis is placed on documentation as a time-saving tool rather than a workload imposition.

Who should attend:

System administrators of all types and levels who need to produce documentation for the systems they manage, or who want to improve their documentation skills. Documentation can be the difference that turns you from a good sysadmin to a great sysadmin!

Take back to work:

The skills to improve personal and team documentation quality
A solid understanding of how to establish and maintain effective documentation practices

Topics include:

Why system administrators need to document
Documentation as part of your daily workflow
Targeting your audience
Common mistakes made in documentation
Tools to assist the documentation process (including effective use of wikis)

Monday, December 5, 2016

Half Day Morning

M1 "I Never Want to Live through This Again!": Running Excellent Retrospectives

Courtney Eckhardt, Heroku
Lex Neva, Heroku

9:00 am–12:30 pm

Fairfax Room

Your site’s back up, you’re back in business. Do you have a way to make sure that problem doesn’t happen again? And if you do, do you like how it works?

Heroku uses a blameless retrospective process to understand and learn from our operational incidents. We’ve recently released the templates and documentation we use in this process, but experience has taught us that facilitating a retrospective is a skill that’s best taught person to person.

This tutorial will take you through a retrospective based on the internal and external communications of a real Heroku operational incident. We’ve designed it to help you experience first-hand the relaxed, collaborative space that we achieve in our best retrospectives. We’ll practice tactics like active listening, redirecting blame, and reframing conversations. Along the way, we’ll discuss how we developed this process, what issues we were trying to solve, and how we’re still iterating on it.

Who should attend:

Managers, tech leads, and anyone interested in retrospective culture and iterating on processes.

Take back to work:

Attendees will have the materials and first-hand experience to advocate for (or to begin) an incident retrospective process at their workplace, or to improve a process they might already be using.

Topics include:

Why run a retrospective
Goal of a retrospective
Blameless retrospectives
Facilitating: redirecting blame, reframing, drawing people out
How to structure a retrospective
Preparing for a retrospective
Five “why”s/infinite “how”s
Human error
Achieving follow-through on remediation items

Courtney comes from a background in customer support and internet anti-abuse policy. She combines this human-focused experience with the principle of Conway’s Law and the work of Kathy Sierra and Don Norman into a wide-reaching and humane concept of operational reliability.

Lex Neva is probably not a super-villain. He has six years of experience keeping large services running, including Linden Lab's Second Life, DeviantArt.com, and his current position as a Heroku SRE. While originally trained in computer science, he’s found that he most enjoys applying his software engineering skills to operations. A veteran of many large incidents, he has strong opinions on incident response, on-call sustainability, and reliable infrastructure design, and he currently runs SRE Weekly (sreweekly.com).

M4 Demystifying Systemd

Ben Breard, Red Hat

9:00 am–12:30 pm

Constitution Ballroom B

It's 2016 and at this point why would anyone care about an init system? Well, apparently not only is process management essential to the operating system, all the hype around things like containers and resource management are making this topic sexy. This session will be a hands-on, interactive look at the architecture, capabilities, and administrative how-tos of systemd. Anyone who's new to systemd or looking to dig deeper into some of the advanced features should attend. Please bring a laptop with a virtual machine running a distribution of your choice that uses systemd.

Who should attend:

Linux system administrators, package maintainers and developers who are transitioning to systemd, or who are considering doing so.

Take back to work:

Understanding of how systemd works, where to find the configuration files, and how to maintain them.

Topics include:

The basic principles of systemd
systemd's major components
Anatomy of a systemd unit file
Understanding and optimizing the boot sequence
Improved system logging with the journal
Resource management via systemd's cgroups interface
Simple security management with systemd and the kernel's capabilities
systemd, containers, and virtualization

Ben Breard is the Technology Product Manager for Linux Containers at Red Hat where he focuses on driving the container roadmap, RHEL Atomic Host, and evangelizing open source technology in his free time. Previously he was a Solutions Architect and and worked closely with key customers around cloud/systems management, virtualization, and all things RHEL. Ben joined Red Hat in 2010 and currently works out of Dallas, Texas.

M6 Advanced Communication: Practical Tactics and Strategy

John H. Nyhuis
Lee Damon, University of Washington

9:00 am–12:30 pm

Gardner Room