LISA19 Program Grid
View the program in mobile-friendly grid format.
Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)
Monday, October 28
7:45 am–8:45 am
Continental Breakfast
Ballroom Foyer
8:45 am–9:00 am
Opening Remarks
Salon EF
Program Co-Chairs: Patrick Cable, Threat Stack, Inc., and Mike Rembetsy, Bloomberg
9:00 am–10:30 am
Keynote Addresses
Salon EF
The Container Operator's Manual
Alice Goldfuss
Containers have been the future for six years now, featured on the stage of every major distributed systems conference in the world. But beyond the hype and the swag is a real technical solution, with real technical challenges, used for real problems at scale. And for the companies and engineers looking to adopt this solution, there’s little content on what awaits them.
In this talk, we’ll discuss some of the advantages and disadvantages of running containers, in production, at scale. We’ll address why to use containers, why not to, and the tradeoffs required at both the technical and human levels for implementing them. You will walk away with a better understanding of how containers could fit into your own architecture and what you’ll need to do to make that rollout a reality. Containers can be a great infrastructure solution, but no one should drive them without a manual.
Alice Goldfuss[node:field-speakers-institution]
Alice Goldfuss is a systems punk with years of experience working on cutting-edge container platforms. She’s an international speaker who enjoys building modern infrastructure at-scale and streaming Capture the Flag challenges on the weekends. Alice has written articles, consulted on publications, built communities, and sipped many cups of tea. She hasn’t written a book, but you’ve probably read her tweets.
In Search of Security Shangri-la
Rich Smith, Duo Security
Security has never been a hotter topic in the mainstream than it is now, from data breaches impacting entire populations, through state sponsored adversaries destabilizing geopolitical norms, to Mr. Robot - the global appetite for "the cyber" seems insatiable. In a world run by software where we can no longer ignore the importance of security and privacy, why are organisations still struggling to effectively include security into their wider technology processes? Ask developers and ops engineers and you will quickly hear how painful security teams are to work with as well as how security’s requirements and approaches are often slow to evolve to new ways technology is used to drive business value.
As someone who has worked in the security industry for almost 20 years, I agree with them.
In this talk I’ll share my journey from hacker to practitioner, how I denounced the Church of No, and some of the lessons I’ve learnt in the hopes they will help us all take a small step towards the devsecops utopia we have been promised for so long is just around the corner.
The search for security Shangri-La is ongoing but the more of us who are looking the better progress we’ll make.
Rich Smith, Duo Security
Rich Smith is the Head of Duo Labs (now part of Cisco), supporting the advanced security research agenda for Duo Security. Prior to joining Duo, Rich was Director of Security at Etsy, co-founder of Icelandic red team startup, Syndis, and has held roles on security teams at Immunity, Kyrus, Morgan Stanley, and HP Labs.
Rich has worked professionally in the security space since the late 90’s in roles ranging across security research, building security organizations, security consulting, penetration testing, red teaming, exploit development, and attack tooling development. More recently, Rich co-authored a new book for O’Reilly titled Agile Application Security: Enabling Security in a Continuous Delivery Pipeline.
He has worked in both the public and private sectors in the U.S., Europe, and Scandinavia, and currently spends most of his time bouncing between Detroit, Reykjavik and New York City.
10:30 am–11:00 am
Break with Refreshments
Ballroom Foyer
11:00 am–12:30 pm
Track I
Salon F
Fuzzy Lines: Aligning Teams to Monitor Your Application Ecosystem
Kim Schlesinger and Sarah Zelechoski, Fairwinds
DevOps is the dream, but when you can’t make cross-functional agile teams a reality, you will need to foster collaboration between several different teams, and potentially two different companies. From miscommunication between teams to differing priorities to broken SLAs, the struggle is real.
To overcome these difficulties, you must focus on the relationship between your ops and dev teams. This alliance is what matters most and is better when all teams have a set of shared values, responsibilities, and recurring processes and tools. After attending this talk, audience members will have a list of concrete processes and tools to foster cross-team cohesion including how to create shared expectations and responsibilities, setting up regular meetings and standardizing alerting across teams through monitoring as code.
This talk will benefit anyone who works with multiple ops or dev teams to deploy and monitor infrastructure, services and applications.
Kim Schlesinger, Fairwinds
Kim Schlesinger is a Site Reliability Engineer at Fairwinds. Prior to being an SRE, Kim was an Instructor, Web Developer, and Curriculum Designer for the Full-Stack Immersive Program at Galvanize, a codeschool in Denver, Colorado. Kim loves working at the intersection of tech and adult education.
Sarah Zelechoski, Fairwinds
Sarah Zelechoski began her career as an astrophysicist and has pivoted often between systems and development work, but really hit her stride in operations. She has run production operations and infrastructure management for more than 30 customers of all shapes and sizes. Sarah is currently the Vice President of Engineering at Fairwinds where her team focuses on providing expert-level guidance and a curated framework around using Kubernetes and other CNCF projects to solve challenging and interesting problems. Sarah's greatest passion is in helping others, which encompasses advocating for engineers and rekindling interest in the lost art of Service in the tech space.
How Math, Science, and Star Trek Help Us Understand the Value of Team Diversity
Fredric Mitchell, Bright Plum, Inc.
The greatest asset of open source software is the ability to fork and improve. When it comes to our community’s culture, could we be better?
This session explores the mathematical algorithms and scientific studies describing the advantage of diverse teams. The goal is to apply our systems thinking in solving technical problems to understanding one's role in solving social problems.
Fredric Mitchell, Bright Plum, Inc.
Fredric Mitchell is the principal of Bright Plum, Inc., a neighborhood tech consulting shop, and advisor to The Whether, a startup aimed at helping companies recruit diverse entry-level talent. He has a wide variety of experience working in the government, NGO, and education sectors. Fredric has presented at numerous events throughout the United States and Costa Rica and is a regular contributor to open source communities. He has a B.S. in Electrical Engineering from Washington University in St. Louis.
Track II
Salon E
fs123: An Open-Source, Network Filesystem with Pervasive Caching
Michael Fenn, D. E. Shaw Research, LLC
fs123
is a read-only, scalable, high-performance, network filesystem running
in production, delivering petabytes of simulation data and tens of thousands of
software packages to thousands of geographically-distributed clients.
The fs123
protocol is layered over HTTP and leverages that ecosystem (load
balancers, proxies, redirects, etc.). The fs123
protocol provides mechanisms
for loosely-coupled servers to assert that two files are the same to the
client, which allows horizontal scaling of fs123
servers.
fs123
is WAN-friendly, requiring a minimum number of round-trips for each
operation as a result of design decisions that require the client to make a
minimum number of round trips and tunable caching features.
Since 2016, fs123
has run in production on ~5000 machines across several
sites. The client uses the FUSE low-level API and can work through network
outages (or even offline) once the on-disk cache is primed. The libevent-based
server easily delivers data at 40 Gbps.
Michael Fenn, D. E. Shaw Research, LLC
Michael Fenn is a Research Engineer at D. E. Shaw Research, LLC and holds a B. S. and M. S. in Computer Science from Clemson University. He first became interested in fault-tolerant filesystems after suffering through one-too-many NFS outages. When he's not at work, he enjoys automotive track events and doing burnouts.
Deep Dive into Kubernetes Internals for Builders and Operators
Jérôme Petazzoni, Tiny Shell Script LLC
Note: this talk is also available as a hands-on tutorial. If you prefer learning by doing, check it out!
If you operate (or plan to operate) Kubernetes, it's helpful to understand its internals: what are the components of the control plane? What are their respective roles? How do they communicate?
To get the most out of this talk, you should be familiar with basic Kubernetes concepts like deployments, pods, and services.
We'll start by explaining exactly what happens between the execution of commands like "kubectl run" and "kubectl expose" and the moment when the containers are actually running and available on the cluster.
Then we'll build a simplified cluster, one component at a time, until it can execute that "kubectl run" command, and we'll see that it's not as complicated as it sounds.
We will show how kube-proxy provides connectivity to services, and how CNI plugins provide connectivity to pods themselves.
Finally, we'll highlight some of the differences between that experiment and a production-grade cluster.
Jérôme Petazzoni, Tiny Shell Script LLC
Jérôme was part of the team that built, scaled, and operated the dotCloud PAAS, before that company became Docker. He worked seven years at the container startup, where he wore countless hats and ran containers in production before it was cool. He loves to share what he knows, which led him to give hundreds of talks and demos on containers, Docker, and Kubernetes. He trained thousands of people to deploy their apps in confidence on these platforms and continues to do so as an independent consultant. He values diversity and strives to be a good ally, or at least a decent social justice sidekick. He also collects musical instruments and can arguably play the theme of Zelda on a dozen of them.
Workshops
Salon ABCD
Linux Productivity Tools
Ketan Maheshwari, Oak Ridge National Laboratory
Who should attend: Anyone working regularly on Linux command-line environment or looking to learn more about getting things done via Linux command-line.
Take back to work: Tools and techniques to improve efficiency on a Linux command-line environment. A desktop reference to accomplish several day-to-day tasks quickly.
Topics include:
- pipes and redirection
- ssh-tunnels, ssh-config
- awk, grep, regex
- xargs, gnu-parallel
- tmux, cron
Prerequisites:
- Basic exposure to Linux commands eg. ls, cd, mkdir, etc.
Ketan Maheshwari, Oak Ridge National Laboratory
Ketan is a Linux Systems Engineer at the Oak Ridge National Laboratory. He is a Linux enthusiast and enjoys learning new tools as well as new applications of old tools.
12:30 pm–2:00 pm
Lunch at the Expo
Exhibit Hall
2:00 pm–3:30 pm
Track I
Salon F
Multi-GPU Accelerated Processing of Time-Series Data of Huge Academic Backbone Network in ELK Stack
Ruo Ando, Center for Cybersecurity Research and Development, National Institute of Informatics
We report our operational experience in deploying multi-GPU accelerated monitoring system of huge academic backbone network in ELK stack. Science Information Network (SINET) is a Japanese academic backbone network for more than 800 research institutions and universities. Since 2016, our SOC team has been running the monitoring system in the SINET's gateway for handling hundreds of millions of session data generated by PaloAlto-7080 per day. For providing the deep insights with SOC operators, Multi-GPU server (DGX-1) is running on the workflow between Elastic Stack and Splunk. We qualitatively introduce the past bottlenecks (2016–2018) in coping with PA-7080’s traffic stream stored in ELK stack. To name a few, we illustrate some techniques such as multi-process invocation of scroll API, parallel CUDA Thrust API invocation and massively parallel access to highly concurrent container. We also report the performance measurements in processing randomly generated 729 GB session data in about 910 minutes.
Ruo Ando, Center for Cybersecurity Research and Development, National Institute of Informatics
Ruo Ando is an associate professor of NII (National Institute of Informatics) by special appointment in Japan. He has a Ph.D. in computer science. Before joining NII, he was engaged in a research project supported by US AFOSR in 2003 (Grant Number AOARD 03-4049). He has presented his research at PacSec2011 (BitTorrent crawler) and DEFCON 26 (packet dump analyzer). He was co-author at LISA 2006 (hypervisor security). His current research interest is massively parallel computing.
Wide Event Analytics
Igor Wiedler
Software is becoming increasingly complex and difficult to debug in production. Yet most of the monitoring systems of today are not equipped to handle high cardinality data needed to effectively operate large-scale services. It doesn't have to be this way! If we treat monitoring as an analytics problem, we can gain the ability to query our events with a lot more flexibility, get answers to questions previously unthinkable, and do so at interactive query latencies.
Igor Wiedler[node:field-speakers-institution]
Igor is an SRE.
Track II
Salon E
Earthquakes, Forest Fires, and Your Next Production Incident
Alex Hidalgo, Squarespace
The Incident Command System is a decades-old tool used for responding to real-world incidents and emergencies, and some form of it has been adopted by many operational teams. However, most don’t know the origins of the system, how it grew to what it is today or why it’s as useful for computer systems as it is for hurricane response. Come learn about the history of the ICS, its successes and failures, and how you can adopt the best aspects of it for your emergencies, today!
Alex Hidalgo, Squarespace
Alex Hidalgo has been a Site Reliability Engineer since 2011. During that time he has developed a deep love for sustainable operations, metrics, and monitoring, and using error budgets to drive almost every decision. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.
Let Your Software Supply Chain Ride with Kubernetes CI/CD
Ricardo Aravena, Rakuten
In the last two years, we have seen Kubernetes GitOps become more universal in many teams helping them enhance their software pipelines. Yet, there are still some gaps when it comes to enhancing security and gluing all the pieces together.
We will survey some of the more popular GitOps open-source tools such as Draft and Flux along with a security review for real-world production environments. Which one could be more vulnerable and how would you harden them? What about building and verifying container images with open-source projects like Kaniko, and in-toto? How can you fully put the pieces together with Spinnaker or Tekton?
By the end of the session, the audience will have a good understanding of the current state of the GitOps ecosystem in the open-source world and how to leverage several tools to enhance, secure, increase agility and create their container software factory in production.
Ricardo Aravena, Rakuten
Ricardo currently works at Rakuten as an Infrastructure Manager, automating everything in containers using open source and lately contributing to the Kata Containers project. He has been working in tech for more than 20 years and comes from a diverse professional background, having been in different roles at large companies such as Cisco and VMware as well as startups such as Coupa, Hytrust, Exablox, and SnapLogic. Most recently he was at Branch Metrics where he spent 2 years working on automating their cloud infrastructure to handle millions of requests and petabytes of data on a daily basis.
Workshops
Salon ABCD
Linux Systems Troubleshooting
Thomas Uphill, uphillian
In this fast-paced tutorial we will start with an overview of how Linux works. We'll review some of the interesting history of UNIX and why some things exist. We'll cover all the basics before moving on to how to approach troubleshooting in a systematic fashion. Armed with a method, we'll look at some examples and the tools that are useful for finding the sources of problems.
Thomas Uphill, uphillian
Thomas started working with UNIX while at University in Vancouver. He switched from IRIX/Solaris to Linux in the early 90s and has been a fan since then. He's written a few books on Puppet as well as Linux. He lives and works in Seattle and in Pacific Northwest tradition is an avid Mountain Biker and Hiker.
3:30 pm–4:00 pm
Break with Refreshments at the Expo
Exhibit Hall
4:00 pm–5:30 pm
Track I
Salon F
Token Up: Keeping Hands out of the Cookie Jar
Erin Browning, Latacora
Even in these modern times, we still trade credentials for authentication or session tokens. In typical applications, session tokens received on the client side are stored in either the browser's local storage or as cookies. As an attacker, I want to steal a user's auth token, hijack their session and then take over their account. The browser and a naive user are good attack vectors. We’ll run through how to architect your website to take advantage of various browser-based protections that reduce the impact of common attacks, such as cross-site scripting and privilege escalation.
Erin Browning, Latacora
Erin Browning is a senior security engineer at Latacora. She focuses on application and Android security and has an interest in cryptography. She loves cats and puns. You can find her on twitter @efrowning.
Multi-Architecture Container Images: Why Bother, and How To
Lisa Seelye, Red Hat
All of us, from the hobbyist to the enterprise solutions architects we are faced with downloading software from the Internet and making it work. The first hurdle is getting the new software to run on our computer and that's where we run into so much trouble.
Too often, containers are produced with just a single CPU architecture in mind. As non-traditional architectures become more common, this "make it work" problem gets harder. But there's a way to continue to make it work with our beloved containers.
Sit in and learn more about how container registries know which container to give you, and what a container even is, how to build container images which support multiple CPU architectures, and why it all even matters.
Lisa Seelye, Red Hat
Lisa is a senior site reliability engineer with Red Hat working on the OpenShift Dedicated product. In her spare time Lisa plays Magic: the Gathering, tinkers with a hobbyist 64-bit ARM Kubernetes cluster and ensure her cat Clyde has plenty of food and sunbeams in which to sleep.
Track II
Salon E
Storytelling for Engineers
Bradley Shively, Uber ATG
Every engineer has experienced the pain of a misfired e-mail or poorly delivered presentation. Eventually, it becomes apparent that communication is an essential engineering skill. Most of all, the ability to effectively communicate makes it easier to work across teams, solve (or avoid) problems, and build the right systems and software. But how do we connect with our audience, deliver pertinent information, and get people to take action? We may think that just providing data is sufficient, but this is rarely true. Instead, we have to think about communication, even in a technical setting, in a new way. We have to learn to tell stories.
This talk will explore how to apply a storytelling framework in engineering so that we can communicate more effectively about the systems we build and operate.
Bradley Shively, Uber ATG
Brad Shively is a reformed management consultant as well as former operations manager at Google. These days, he manages engineering teams at Uber's self-driving car project, where he applies his years of experience in client-facing roles to the challenge of building great internal tools and workflows for developers.
Off the Beaten Path: Moving Observability Focus from Your Service to Your Customer
Mohit Suley, Microsoft
Observability systems are usually designed to answer two broad questions: "Is my service doing OK?" and "Is my business doing OK?" There is a third perspective that often doesn't get enough attention (unless it's clearly linked to the first two): "Is my Customer Experience OK?"
Customer Experience is a part of Site Reliability for us. While SLOs are a great way of measuring service, we realized they didn't paint a representative picture of how our customers felt while engaging with our product. In other words, we wanted to humanize the numbers we see.
This talk explains our motivations for stepping out of our metrics-centered "comfort zone" and the journey that ensued: developing a habit of engaging face-to-face with some of our customers, figuring out ways to experience what they did, open-sourcing a high-scale tool to capture this data, setting up broader direct-to-team channels of communication from customers, and re-thinking performance metrics.
Mohit Suley, Microsoft
Mohit is an Engineer on Bing's Live Site Engineering team. Designing systems to proactively improve availability and route around problems is a core mission of the team. In his spare time he loves long walks, tinkers with hardware, and chases his goal of reading more books than Bill Gates.
Workshops
Salon ABCD
Mastering zsh
Justin Garrison, Disney Streaming Services
An engineer's $SHELL is likely one of their most used tools. However, it is rarely mastered because it's invisible in nature. Knowing how to use aliases, variables, and functions is an important start, but there are many more features that can be used to boost your productivity and help you be better.
In this training, Justin will cover some of the more advanced topics of zsh and how you can use them. This will help you save time at a prompt and be a more proficient shell user for fun and profit.
if [[ "$SHELL" =~ "zsh" ]]; then
attendees=$((attendees + 1))
else
chsh -s /usr/bin/zsh
exec /usr/bin/zsh
attendees=$((attendees + 1))
fi
Justin Garrison, Disney Streaming Services
Justin is a co-author of Cloud Native Infrastructure and loves open source software and participating in healthy communities. Currently he makes developers more productive at Disney Streaming Services.
6:00 pm–7:00 pm
Monday Happy Hour
Exhibit Hall
Join us at the LISA19 Expo for refreshments, and take the opportunity to learn about the latest products and technologies. Don’t forget to get your Expo passport stamped!
7:00 pm–11:00 pm
Birds-of-a-Feather Sessions (BoFs)
View the full schedule of BoFs on the LISA19 BoFs page.
Tuesday, October 29
8:00 am–9:00 am
Continental Breakfast
Ballroom Foyer
9:00 am–10:30 am
Track I
Salon F
Pulling the Puppet Strings with Ansible
Brian J. Atkisson, Red Hat, Inc.
Red Hat IT (RHIT) has invested heavily in configuration management for nearly 20 years. Over this time, the configuration management landscape and strategies have substantially evolved. After standardizing on open source Puppet in 2008, we found ourselves with 500,000 lines of manifest code and tens of thousands of template files a decade later. We also were tied to a platform that no longer solved today's challenges.
Ansible has proven to be a simple and robust tool for managing all aspects of modern environments. Application and systems teams were originally attracted to Ansible's natural and prescriptive language. We found it also easily handled hybrid cloud and application orchestration, in addition to traditional configuration management functions. This drove RHIT to begin adopting Ansible across the board.
In this session, we shall discuss our Ansible migration experience and strategy going forward, including the use of tools like AWX (Tower) and Galaxy.
Brian J. Atkisson, Red Hat, Inc.
Brian J. Atkisson has 20 years of production systems engineering and operations experience, focusing primarily on identity management and hybrid infrastructure solutions. He has worked in these roles for the University of California, Jet Propulsion Laboratory, and Red Hat, Inc. He is a Red Hat Certified Architect and Engineer, in addition to holding many other certifications and a B.S. in Microbiology. He currently is the Infrastructure Architect for Red Hat IT.
Enabling Invisible Infrastructure Upgrades with Automated Canary Analysis
Adam McKenna, Pinterest
In the DevOps world, we spend inordinate amounts of time testing our software builds and deployments. But often, the underlying infrastructure is taken for granted. When it becomes time to upgrade, we scramble and toil to migrate everything before our infrastructure goes end-of-life and becomes unsupported.
Upgrading operating systems, language runtimes, and other infrastructure can introduce subtle issues that can go unseen until the new components are deployed at scale.
This talk will introduce the concept of using automated canary analysis for infrastructure upgrades. Canary analysis is an ideal strategy to minimize risk, retain organizational knowledge, and give service owners peace of mind.
Adam McKenna, Pinterest
Adam works on the Core SRE team at Pinterest, primarily collecting tech debt and developing automation tools with a focus on developer experience. He has over 25 years experience with Linux and other Unix-like operating systems, including several years as a member of the Debian project. As a hybrid systems engineer/administrator and coder, he was naturally attracted to DevOps work. Outside of work, he is a father of 3 boys and enjoys gardening, gaming, and photography.
Track II
Salon E
Pardon the Interposition—Modifying and Improving Software Behavior with Interposers
Danny Chen, Bloomberg LP
Sometimes we are delivered software that doesn't do exactly what we want and we have to find ways to modify its behavior. Many times we wind up "wrapping" the software. But modern language run times provide mechanisms for overriding the behavior of software after it has been built and delivered/deployed (e.g. inversion of control and dynamic wiring in Java, monkey patching in python). Shared library-based software on UNIX systems have long enjoyed similar capabilities via interposers—functionality that is relatively under-utilized by testers, SAs, and SREs.
In this talk, we present how we used a C library interposer to convert the attachment store of JIRA from being filesystem-based into a two-level store—with the filesystem being merely a cache for a cloud-based store. In a sense, we changed the behavior of the system underneath JIRA to change JIRA's behavior. This implementation gives us local filesystem performance (mostly) for attachment stores with near-infinite capacity for attachments as well as added resiliency (from the use of the cloud-based store).
We also present some other examples of how we have used interposers to support testing and monitoring.
Danny Chen, Bloomberg LP
Danny Chen started his career almost 40 years ago as a UNIX performance engineer at Bell Laboratories where he was a co-developer of one of first general purpose UNIX kernel tracing facilities (USENIX/1988: CASPER the Friendly Daemon). He also contributed performance improvements to the SVR4 virtual memory implementation (USENIX/1990: "Insuring Improved VM Performance - Some No-Fault Policies). He has worked on low latency market data systems, messaging systems, distributed transaction management, capacity planning, and enterprise systems monitoring.
Network Fault Finding System: Packet Loss Triangulation
Jose Leitao and Daniel Rodriguez, Facebook
Most network monitoring relies upon the devices providing the signal used to calculate health via syslog messages, SNMP, telemetry, or custom APIs.
In large scale networks, we can’t trust the devices to accurately report health in all the possible failure cases that may exist. At FB, in addition to standard monitoring tools, we also actively probe our networks with test traffic to ensure the platforms are behaving as we expect. With active monitoring, we can find misbehaving network elements even when they exist several layers deep inside the network.
During the presentation, we will show how we built a sample system that achieves similar results using open source tools and perform a live demo with a lab network from start to finish, introducing packet loss and showing how the system can identify where the loss is occurring in real time.
Jose Leitao, Facebook
Jose Leitao is a production network engineer in the Network org at Facebook. His team's responsibilities include maintaining, monitoring, and improving the global production network infrastructure.
Daniel Rodriguez, Facebook
Daniel Rodriguez is a production network engineer in the Network org at Facebook. His team's responsibilities include maintaining, monitoring, and improving the global production network infrastructure.
Workshops
Salon ABCD
Resource Management and Service Sandboxing with systemd
Michal Sekletar, Red Hat
Attend this talk to learn more about systemd's high-level API's for service resource management and sandboxing. We will start with a quick review of systemd to get everyone up to speed and then we will dive into topics like Linux cgroup and NUMA workload placement and how to manage them with systemd. Next, we will look at sandboxing features implemented by systemd.
Make your services more performant and secure!
Michal Sekletar, Red Hat
Michal Sekletar joined Red Hat in 2011 and currently works as Principal Software Engineer in the "Plumbers" team. He spends his days working on and supporting init systems and other low-level user-space components. He holds a Masters degree from Brno University of Technology. His other professional interests include programming languages, algorithms, and UNIX-like operating systems.
10:30 am–11:00 am
Break with Refreshments at the Expo
Exhibit Hall
11:00 am–12:30 pm
Track I
Salon F
GitOps, an Elegant Tool for Hybrid Cloud Kubernetes
Ryan Cook, Red Hat, Inc.
Application management, migration, and portability is difficult. Through the use of GitOps, we can ensure that a consistent experience is delivered whether the application and objects are deployed on a single Kubernetes cluster or multiple clusters all over the world.
In this presentation, you'll learn through a series of live demonstrations how to deploy, manage, and migrate applications between three clusters spread across the United States. After the session, you will have a better understanding of managing resources and objects that will be under revision control, resilient, and automatically deployed.
Ryan Cook, Red Hat, Inc.
Ryan has been a team lead within the Red Hat OpenShift team for a number of years. He loves automation and generally anything that makes his job easier.
Ops on the Edge of Democracy
Chris Alfano, Code for Philly and Jarvus Innovations
Community organizations, nonprofits, local governments, and individual users are often considered irresponsible to build or even redeploy their own technology. But is that a core constraint of information technology, or merely an accidental symptom of over-optimizing for the end of the market with easy returns for venture capital?
Chris Alfano, Jarvus Innovations, Code for Philly
I’m a software developer forged by FOSS and a founding captain of one of the longest-running local affiliates of Code for America (called a brigade). My civic hacking journey began while working in a Philly public school and for nearly a decade now I’ve gotten to build open-source software with some of the most innovative public schools in the world.
Track II
Salon E
Linux Systems Performance
Brendan Gregg, Netflix
Systems performance is an effective discipline for performance analysis and tuning, and can help you find performance wins for your applications and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas of Linux systems performance: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (Ftrace, bcc/BPF, and bpftrace/BPF), and much advice about what is and isn't important to learn. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud.
Brendan Gregg, Netflix
Brendan Gregg is an industry expert in computing performance and cloud computing. He is a senior performance architect at Netflix, where he does performance design, evaluation, analysis, and tuning. He is the author of BPF Performance Tools (Addison Wesley) and Systems Performance (Prentice Hall), and received the USENIX LISA Award for Outstanding Achievement in System Administration. Brendan has created numerous performance analysis tools, visualizations, and methodologies for performance analysis, including flame graphs.
Our Journey of Implementing TLS at Scale for Services on Kubernetes
Tilottama Gaat and Akshay Chitneni, VMware
TLS is the industry standard for encrypting communication between endpoints, however there are unique challenges to implementing TLS in a microservice environment. For example, you may have hundreds of microservices running on multiple environments, how do you provision and disburse TLS certificates in a scalable way, while causing least disruption to uptime? How do you easily manage day 2 operations of those TLS certificates such as certificate renewal or revocation? In this talk, we present Diploma, our Vault based Certificate generation system and Chancellor, a Kubernetes controller that disburses certificates to workloads using Kubernetes API, providing certificates for 40+ microservices in production and serves 2000+ certificates a day in development and test environments.
Tilottama Gaat, VMware
Tilottama Gaat has been a software development engineer for the past 11 years, working on different SaaS products. At VMware, she is working on building infrastructure that supports 40+ services in production.
Akshay Chitneni, VMware
Akshay Chitneni is a software engineer on the cloud services infrastructure team. He focuses on developing tools and services that help run the core services in more reliable and secure way.
Workshops
Salon ABCD
Extending Kubernetes with the Operator Pattern
Ryan Jarvinen, Red Hat
Learn how to extend Kubernetes to include your own custom operational tactics and container-management best practices, using the Kubernetes "Operator" pattern!
Operators feel like native features to Kubernetes end-users because they use Custom Resource Definitions, API Aggregation, and Custom Controllers to extend the basic platform APIs, allowing complicated solutions to be managed using simple declarative resource specifications.
This session provides architectural overviews, implementation patterns, and a look at a few popular solutions from this space.
Bring a laptop to follow along as we learn to extend Kubernetes through a series of hands-on, interactive training scenarios!
Ryan Jarvinen, Red Hat
Ryan Jarvinen is a Developer Advocate (Red Hat, previously CoreOS) who focuses on developer experience and usability in the Cloud Native ecosystem. He is a frequent conference speaker and workshop leader who enjoys helping teams develop strategies for maximizing their productivity while using Kubernetes. Past achievements include contributing to the CNCF's Certified Kubernetes Administrator exam curriculum, and speaking at LISA 17 in SF! <3 If you don't get a chance to meet him in the Red Hat booth, you can find him online as "RyanJ" via twitter, github, or IRC.
12:30 pm–2:00 pm
Lunch at the Expo
Exhibit Hall
2:00 pm–3:30 pm
Track I
Salon F
Creating a Distributed Round Robin Scheduler with Etcd
Eric Chlebek, Sensu, Inc.
Join me on a journey through creating a distributed round-robin scheduler, using etcd as the persistency and consensus mechanism! As a software developer, this turned out to be one of the more challenging projects I've worked on. I'm excited to share the trials and tribulations of this work with you, as well as what I learned along the way.
Eric Chlebek, Sensu, Inc.
Eric is a software developer from Vancouver, British Columbia. He's done work in scientific computing, ad-tech, and now, monitoring and observability.
The Challenges of Managing Open Source Infrastructure at Bloomberg
Andrew Terng, Bloomberg
Bloomberg's Engineering team has deployed a Telemetry as a Service infrastructure that collects and processes over 6M metrics per second from tens of thousands of machines and applications, generating 120 TB of logs each day. In this talk, we will discuss when and how our engineers decide which open source software solutions to use. We will also look at some of the recent challenges we faced in using some of this software at scale and discuss how these hurdles were overcome.
Andrew Terng, Bloomberg
Andrew is an engineer at Bloomberg LP where he works on the Telemetry as a Service platform. Currently responsible for the build and automation of the Kubernetes infrastructure along with other OSS that processes 6M metrics per second and over 120TB of logs daily within the organization.
Prior to Bloomberg he worked at Digital Ocean, Tumblr, Yahoo! Search, and more.
Track II
Salon E
Kubernetes the Very Hard Way
Laurent Bernaille, Datadog
Running large Kubernetes clusters is challenging. At large scales, practitioners need to adapt and tune both their architectures and component configurations in specialized ways.
Our organisation has been running large scale Kubernetes clusters (up to 2000 nodes, and growing) for more than a year, and we have learned several lessons the hard way. This talk will dive into complex runtime and networking issues that occur when running Kubernetes in production at scale. We will provide examples of how to improve the architecture of clusters to increase scalability and performance, both on the control plane and the data plane. Further, tools from the greater ecosystem will be examined, as they are rarely tested within the context of very large clusters.
Finally, the talk will also discuss the mutually beneficial relationship we built with the larger Kubernetes community by providing feedback on the tools and contributing both fixes and improvements upstream.
Laurent Bernaille, Datadog
He is Staff Engineer at Datadog and works in the Compute team, which is responsible for setting up and scaling Kubernetes platforms. Laurent has given several talks on the topic of application deployment and containers in conferences such as Dockercon, Open Source Summit, EuroBSDcon or Kubecon.
Anatomy of a DDoS
Janna Hilferty, TriNet, Inc
DDoS, or Distributed Denial of Service, attacks can be difficult to understand, mitigate, and protect against. In this talk, learn the structure of DDoS attacks, how attackers build their networks, how DDoS attacks have evolved in recent years, and how law enforcement is taking action against bad actors.
Janna Hilferty, TriNet, Inc
Janna Hilferty is forever curious. Recognized by her coworkers as "Captain Marvel" and "Automation Champion," she is currently a DevOps engineer at HR company TriNet, and blogger at TechGirlKB.guru. In the not-so-distant past, she was nicknamed "War Janna" for her fierce dedication to problem-solving in web hosting and Linux server admin work, where she specialized in performance, cacheability, and technical documentation.
Workshops
Salon ABCD
Creating Your First Serverless Application on AWS
Fernando Medina Corey
This workshop will show you how to build your first fullstack serverless application using Amazon Web Services. You will look at all the tools and services required to create a web app using AWS services that don't require you to manage server infrastructure.
If you've never heard of "serverless" before I'll explain the benefits and drawbacks of moving the responsibility for the configuration and maintenance of servers to a provider like AWS. This workshop will have you working with several AWS services including Amazon S3, Amazon DynamoDB, AWS Lambda, and AWS API Gateway to create your first fullstack serverless application. Finally, I'll show how to leverage the Serverless Framework to accelerate your development process, manage your infrastructure and help you integrate monitoring and security in your application.
Fernando Medina Corey, Serverless
Fernando Medina Corey is a Solutions Architect at Serverless Inc. where he helps folks learn about and build with Serverless technologies. He has also published courses on topics ranging from cloud cost optimization to the Internet of Things. Fernando regularly publishes technical articles and tutorials on his blog (fernandomc.com) where you can learn more about cloud services, design patterns, and serverless development.
3:30 pm–4:00 pm
Break with Refreshments
Ballroom Foyer
4:00 pm–5:30 pm
Track I
Salon F
Expect the Unexpected! A Method for Handling Unplanned Work
Dan O'Boyle and Brian Artschwager, Stack Overflow
Does your team suffer long, unstructured meetings?
Is your team mired in unplanned work?
Do you struggle to provide a timeline or status on deliverables?
Are you interested in practical street magic?
We’re not magicians, but we can help with the first 3 problems!
Ops teams tend to have a higher volume of unplanned work than any other similarly sized team.
This talk will attempt to explain the details of a practical method of managing unplanned work, though the engaging story of how our team used this method to systematically process our previously unending backlog.
Have you EVER been this excited about process and procedure!?
Unfortunately, it's all downhill from here.
But we promise to keep things as lively as possible, without fireworks, or magic (Stop asking).
Brian Artschwager, Stack Overflow
Brian is an Internal Support Engineer @ Stack Overflow working on software, servers, and networking to make life easier for our employees so they can make Stack Overflow an even better experience for our users.
Dan O'Boyle, Stack Overflow
Dan works as an Internal Support Engineer @ at Stack Overflow. He started his career as a high school teacher and transitioned into a System Administrator. He enjoys creative collaboration to solve solvable things, and using automation for everything else.
Expand Contract Pattern for Continuous Delivery of Databases
Leena S N, Good Karma, Bangalore
Modifying the database schema is scary. It is hard to rollback the changes if something goes wrong. But it is difficult to avoid necessary refactoring to the database.
The talk is about the Expand/Contract pattern to make significant changes to the application, in a safer and reversible manner.
Leena S N, Co-founder/CTO/Programmer @ Good Karma, Bangalore
A pragmatic & passionate programmer, lean thinker, eXtreme Programming evangelist, hooked into Continuous Delivery. A mother of two lovely angels.
Track II
Salon E
Sub-Region Failure: How to Handle the Partial Loss of a Data Center
Joe Gasperetti and Yang Xia, Facebook
Large internet companies like Facebook operate out of multiple geo-distributed data centers (DCs) connected via global backbone networks. At this scale, it is common to experience large scale failures, like submarine network cable disconnection, as well as localized physical failures including flipped power breakers, water intrusion, electrical fires, cooling failure and more. Previous research in Disaster Recovery (DR) focuses on minimizing the impact of losing entire data centers by quickly moving traffic and data away from an affected DC.
But what if we could endure physical failures while keeping the DC online? Facebook’s Sub-Region DR initiative aims to handle the partial loss of a data center without expanding the failure scope to the entire DC. Our approach is to work with software teams to make systems durable to partial failure. We will describe how we built an “auditor” which understands stateless, stateful and storage systems and can simulate the effects of power outages without pulling the plug. We will also share testing stories about disconnecting machines on purpose, and war stories about power plugs pulled by accident.
Joe Gasperetti, Facebook
Joe Gasperetti is a Production Engineer at Facebook. He currently works on the Web Foundation team, which is responsible for the uptime and reliability of Facebook's infrastructure. Before Web Foundation, he spent five years working on media storage.
Yang Xia, Facebook
Yang Xia is a Software Engineer at Facebook. He currently works on the Disaster Recovery team. He pulls the plug on data centers on purpose to test their resiliency. Before Disaster Recovery, he spent a year running Red Teams to test the physical security of Facebook data centers.
Distributed Sys Teams
Sri Ray, Fastly
Single Points of Failure is a term we all dread in the SRE world. We go through the pain of making sure services are resilient and distributed and yet, more often than not, we fail to give the same treatment to the most critical part of any system—the Humans.
This talk will focus on the importance of hiring remote and hiring across the world. We will also touch on little changes you can make to foster such an environment.
Distributed teams not only add value to the core systems but also help us bring each other closer to one another.
Sri Ray, Fastly
Sri Ray works at the intersection of ops, security, and doing the right thing. He searches for solutions that respect and complement the Human element of systems. While not architecting systems or dreaming of the next improvement to make, he spends most of his free time on planes traveling around the only place we have all called home—Earth. He uses this opportunity to understand cultures and more importantly relish local food.
Workshops
Salon ABCD
Running Excellent Retrospectives: Talking with People
Courtney Eckhardt
How many awful meetings have you been to in your life, where people are talking forever and saying nothing, or where people are talking at cross purposes and not listening, or where they're saying things that make everyone feel bad? Have you been in retrospectives like that? (Did it make you never want to attend a retrospective again?)
Let's do better! Come learn practical techniques for facilitating pleasant, productive, welcoming retrospectives (which will improve any meeting you attend). We will talk about the structure of welcoming language and discuss when it's necessary to interrupt someone. We'll examine what it means for language to include blame and how to reframe blaming conversations. We'll practice the mental work of understanding things that seem contrafactual but are actually just confusing. When you leave, you'll be ready to make any meeting or retrospective you're in more comfortable and effective.
Courtney Eckhardt[node:field-speakers-institution]
Courtney comes from a background in customer support and internet anti-abuse policy. She combines this human-focused experience with the principle of Conway’s Law and the work of Kathy Sierra and Don Norman into a wide-reaching and humane concept of operational reliability. You can find her knitting in the audience of conference talks, and she's always interested in cat pictures.
6:00 pm–8:00 pm
Conference Reception
Exhibit Hall
Take a break from your laptops, and join us at our annual LISA Reception for dinner, drinks, and the opportunity to connect with other attendees, speakers, and conference organizers.
8:00 pm–9:00 pm
Lightning Talks
Salon E
Lightning Talks
- 20 Years of Linux on the Mainframe
Elizabeth K. Joseph, IBM - 30 Interviews Later…
Paige Bernier, LightStep - How to SRE without an SRE on Your Team
Amiya Adwitiya, Squadcast Inc - Hospital—Automated Runbook for Failures in System
Jainam Shah, Student at IIT Ropar - SRE for Cats
Denise Yu, Pivotal - An Overview of Network Shells
Michael Smith, Puppet, Inc. - Dr. Patternson: Automated Debugging Runbooks
Vanish Talwar, Facebook - A Computational Storage Parlour Trick (Demo)
Dan Pollock, Data Storage Science - March Madness!
Alex Hidalgo, Squarespace
8:00 pm–11:00 pm
Birds-of-a-Feather Sessions (BoFs)
View the full schedule of BoFs on the LISA19 BoFs page.
Wednesday, October 30
8:00 am–9:00 am
Continental Breakfast
Ballroom Foyer
9:00 am–10:30 am
Track I
Salon F
Depression Memes for Devops Teens: Self-Care for Server Janitors and Other Humans
Anirudh Ra
Content warning: depression, anxiety, suicidal ideation, domestic violence, abuse.
Come listen to Anirudh Ra get real and talk about his journey through mental health and Production Engineering / Site Reliability Engineering. He'll share his failures and the lessons he learned, the mistakes he made and the tools he collected in order to work and live better, the times when his toes were stepped on and the times when he stepped on toes.
Anirudh will also talk about living with neurodivergence in and out of the workplace. There will be a bunch of advice for noobs as well as advice for people who are managing, leading, or mentoring noobs. This will be focused on self-care, mental health, time management, life-work balance, emotional burnout... You get the drift. Remember the salt: if you have met one neurodivergent person, you have met ONE neurodivergent person. Treat everyone on a case-by-case basis. YMMV.
Anirudh Ra[node:field-speakers-institution]
This man is a human being who uses he/him pronouns. He likes bread, cats, cute little used bookstores, single-origin chocolate, gong-fu cha, fountain pens and good paper, third-wave coffee shops, and discovering music he has never heard the likes of before in cute little record stores.
How to Have an Operational Incident (A Crash Course)
Courtney Eckhardt
What happens at your company when a service goes down? Hopefully an alarm fires somewhere and someone gets paged, but then what? Does the person who got paged fix it all themselves (and do they feel as isolated as that sounds)? What if they don’t know how- is there a procedure for them to get help? Do you have a protocol for deciding when the incident is over?
More and more, most of us work at companies that provide a service. Even if you’re a game dev or you work at a retailer, the way you interface with your customers is a web service, and services have outages. Let’s talk about the basics of incident response- what it is, how it helps, how to learn more. I can't fix all your problems in a 40m talk, but I can help get you going in the right direction!
Courtney Eckhardt[node:field-speakers-institution]
Courtney comes from a background in customer support and internet anti-abuse policy. She combines this human-focused experience with the principle of Conway’s Law and the work of Kathy Sierra and Don Norman into a wide-reaching and humane concept of operational reliability. You can find her knitting in the audience of conference talks, and she's always interested in cat pictures.
Track II
Salon E
Containerizing Your Monolith
Jano González, SoundCloud
Going from a monolithic architecture to microservices it's never a "big bang" migration, you need to keep delivering features to your users so services are extracted as the need arises.
During the transition, teams deal with both the monolith and microservices. The differences in technology and delivery process for both, combined with the decreased attention the monolith receives can decrease the confidence level, slowing down delivery and even planned microservice extractions.
How much to invest in the original monolith? Too little can lead to teams avoiding it, too much means wasting resources in a deprecated component.
This talk summarizes the experience of containerizing our monolith to run it in the same infrastructure as our microservices, using the same delivery process. How this improved confidence in delivery, how it simplified operations and how it enables us in current initiatives like our multi-datacenter architecture. Also, which problems we encountered along the way.
Jano González, SoundCloud
Jano worked during the last 5 years as a backend engineer at SoundCloud, extracting services from its original monolith and becoming one of the monolith's main maintainers. He currently works for the same company in an SRE role. He eats his own dog food by publishing tracks on https://soundcloud.com/janogonzalez and https://soundcloud.com/velvetsystem82
Jupyter Notebooks for Ops
Derek Arnold
Jupyter Notebooks have been used by data professionals of many types to provide a new and exciting way to glean insight from the massive amounts of data that has become ubiquitous in our professional lives.
Why should operations-minded folks miss out on this fun?
Derek Arnold[node:field-speakers-institution]
Derek Arnold is a 20 year veteran in various technical roles (system administrator, web developer, user support, instructor). Derek has held positions in the telecommunications, education, healthcare, IT services, government, and manufacturing sectors.
Workshops
Salon ABCD
Defenders' Guide to Container Infrastructure Security
Madhu Akula
An organization using micro services or any other distributed architecture rely heavily on containers and orchestration engines like Kubernetes and as such its infrastructure security is paramount to its business operations. In this training we will use open source tools, techniques and procedures (TTP's) to build secure container infrastructure, which means we will perform security at many layers like infrastructure security, supply chain security and run-time security with real-world scenarios. The outcome of this workshop can be directly applied in their organizations and daily operations to apply practical security skills in the modern era using open source.
Madhu Akula[node:field-speakers-institution]
Madhu Akula is a security ninja, published author and cloud native security researcher with an extensive experience. Also he is an active member of the international security, devops and cloud native communities (null, DevSecOps, AllDayDevOps, etc). He holds industry certifications like OSCP (Offensive Security Certified Professional), CKA (Certified Kubernetes Administrator), etc.
Madhu frequently speaks and runs training sessions at security events and conferences around the world including DEFCON (24, 26 & 27), BlackHat USA (2018 & 19), USENIX LISA 2018, O’Reilly Velocity EU 2019, Appsec EU 2018, All Day DevOps (2016, 17, 18 & 19), DevSecCon (London, Singapore, Boston), DevOpsDays India, c0c0n(2017, 18), Nullcon (2018, 19), SACON 2019, Serverless Summit, null and multiple others.
His research has identified vulnerabilities in over 200+ companies and organisations including; Google, Microsoft, LinkedIn, eBay, AT&T, WordPress, NTOP and Adobe, etc and credited with multiple CVE’s, Acknowledgements and rewards. He is co-author of Security Automation with Ansible2 (ISBN-13: 978-1788394512), which is listed as a technical resource by Red Hat Ansible. Also won 1st prize for building Infrastructure Security Monitoring solution at InMobi flagship hackathon among 100+ engineering teams.
10:30 am–11:00 am
Break with Refreshments
Ballroom Foyer
11:00 am–12:30 pm
Track I
Salon F
Sysadmins' Introduction to Vulnerability Scanning
Tabitha Sable
Effective vulnerability scanning requires broad knowledge: software vulnerabilities and exploitation, networking, the chosen vulnerability scanning tool, and the servers and applications being inspected. Ops folks are often left out of the vulnerability scanning process, but our participation can make everyone's life better. This presentation will review some of the basics needed to start getting involved.
Tabitha Sable, Unaffiliated
Tabitha has been a hacker and cross-platform sysadmin since the turn of the century. She can often be found teaching network offense and defense to sysadmins, system administration to security folks, and asking questions that start with "I wonder what happens if we..."
Speculative and Traditional Execution Side Channel and Software Protection Mechanisms
Neelima Krishnan, Intel
This presentation focuses on the common characteristics of speculative execution side-channel methods and how they compare to traditional side channel methods. Neelima will introduce some of the architectural concepts that designers have created over the years to enhance performance, and then discuss how researchers have used those same concepts to describe how malicious actors could potentially infer secret data. She will focus on some common software environments and where they may be exposed to speculative execution side channel methods. Then, based on different approaches that Intel is taking to mitigate the issues in collaboration with the open source community, provide techniques that developers can implement to better safeguard their software and secrets.
Neelima Krishnan, Intel
Neelima is a Software Engineer at Intel. Neelima works as a part of a team that creates and validates side channel mitigations on Linux Kernel.
Track II
Salon E
Testing for the Terrified: How to Write Tests, Conquer Guilt, and Level Up
Frances Hocutt, Rackspace
Have you ever felt like you “should” be writing tests for your code, but not known where to start? Have you been swamped by subtly different test-driven development tutorials? Do you feel vague guilt about not following “best practices,” but still can't figure out how to get started?
This talk will take a harm-reduction approach to learning automated testing. You’ll find that writing tests simplifies your work so that you can improve your code, reduce debugging time and duplicated work, and eliminate that nagging guilt.
This talk will include:
- Tests written in the wild, before your very eyes!
- How to get at the fear of getting started and making mistakes.
- Ways to start small and work on incremental progress - testing ten percent of your code is infinitely better than testing none.
- Suggestions for ways to continue growing as a writer of tests - without sorting through that list of tutorials.
Frances Hocutt, Rackspace
Frances Hocutt has taken part in the science-to-tech branch of the great STEM reshuffling. In the process, he’s written, spoken, mentored, and co-founded Seattle’s first feminist hackerspace/makerspace. Frances prefers elegance in science and effectiveness in art and is happiest when drawing on as many disciplines as possible. Frances jumped into F/OSS development with work on standards for the MediaWiki web API ecosystem and expanded into work on MediaWiki and associated Wikimedia-ecosystem contributor tools. He currently installs software on other people’s computers for Rackspace Managed Security’s defensive infrastructure team and enjoys encouraging new programmers. Frances currently lives in an unfortunately catless apartment in Oakland, CA.
What Connections Can Teach Us about Postmortems
Chastity Blackwell, Truss
Too often postmortems go into what is often "write-only memory"—put away in an archive to satisfy some requirement but rarely used to actually drive improvement. They can be so dense that it's hard for anyone to derive real insight from them, or so surface level that they don't convey any of the nuances that actually surround most incidents. How can we create a postmortem document that is both an interesting story and something that provides some hint of the complexity underlying an incident?
In 1978, James Burke made Connections—a TV series that attempted to describe history in a new way, one that avoided the conventional "straight-line" view of history, great people, and golden ages, instead focusing on the surprising relationships between people and events. This talk will describe how you can use Burke's techniques to make your postmortems compelling reading that also teaches valuable lessons.
Chastity Blackwell, Truss
Chastity Blackwell took her first job as a system administrator in 1999 just to pay the bills until she could get a writing job. After spending more than a decade with the University of Illinois' central IT organization, she moved out to the Bay Area where she's worked for companies big and small. In her heart, she longs to return to somewhere with real weather.
Workshops
Salon ABCD
Deep Dive into Kubernetes Internals for Builders and Operators (Tutorial)
Jérôme Petazzoni, Tiny Shell Script LLC
Note: this tutorial is also available as a 40-minute talk with the same content but without the hands-on labs.
If you operate (or plan to operate) Kubernetes, it's helpful to understand its internals: what are the components of the control plane? What are their respective roles? How do they communicate?
We'll start by explaining exactly what happens between the execution of commands like "kubectl run" and "kubectl expose" and the moment when the containers are actually running and available on the cluster.
Then we'll build a simplified cluster, one component at a time, until it can execute that "kubectl run" command. We will add networking with kube-proxy to provide connectivity to services, and CNI plugins to provide connectivity to pods themselves.
To get the most out of this tutorial, you should be familiar with basic Kubernetes concepts like deployments, pods, and services.
We will provide remote cloud VMs to each attendee for the duration of the tutorial, so you don't need to download or install anything prior to the tutorial. All you need is an SSH client.
Jérôme Petazzoni, Tiny Shell Script LLC
Jérôme was part of the team that built, scaled, and operated the dotCloud PAAS, before that company became Docker. He worked seven years at the container startup, where he wore countless hats and ran containers in production before it was cool. He loves to share what he knows, which led him to give hundreds of talks and demos on containers, Docker, and Kubernetes. He trained thousands of people to deploy their apps in confidence on these platforms and continues to do so as an independent consultant. He values diversity and strives to be a good ally, or at least a decent social justice sidekick. He also collects musical instruments and can arguably play the theme of Zelda on a dozen of them.
12:30 pm–2:00 pm
Conference Luncheon
Exhibit Hall
2:00 pm–3:30 pm
Track I
Salon F
Level Up Your Career with Soft Skills
Yoz Grahame, LaunchDarkly
Many of us want to see our careers progress from junior to senior to lead or manager. To do this, we improve our technical skills, contribute to open source projects, and work on side projects. That's great, but technical skills will only get you so far. To truly succeed, you need to master the soft skills–also known as people skills or core skills.
These skills are sometimes called "non-technical." That's not an accurate characterization. These skills are technical. And like all technical skills, they require practice and refinement. Don't approach the idea of soft skills with a fixed mindset, employ a growth mindset, and learn how you can continuously refine and improve these critical skills. During this talk, you will learn essential soft skills, tips on how to improve them, and how biases influence our ability to learn. Avoid misunderstandings and miscommunication by improving your soft skills.
Yoz Grahame, LaunchDarkly
Yoz Grahame is a Developer Advocate for LaunchDarkly because he wants software engineering to be far less painful than it is now. Previous involvements include: the US Government (in the 18F group), Compaas, Linden Lab, British e-democracy projects WriteToThem and TheyWorkForYou, and Douglas Adams' startup The Digital Village.
Blameless Incidents: Learning from Failure at Scale
Chip Turner, Facebook, Inc.
How a company handles outages is a conscious decision, and being intentional about the mindset you cultivate is critical to long-term reliability and operability. Building a culture that embraces crises as learning opportunities rather than failures is a crucial component of healthy Incident Management.
Facebook’s blameless, reflective approach tries to make the most from every outage, large and small. Our scalable Incident Management program is designed to be used for incidents of all size, from full site issues to minor, localized problems affecting small, non-critical services. This talk will discuss the cultural and technical challenges to having an open culture that focuses on moving fast while keeping a high bar for operational excellence and reliability. We will explore the principles, tools, and processes we use to accomplish the above goals, how we scale communication during incidents, and how our open-door review culture reinforces our blameless approach while still maintaining high standards.
Chip Turner, Facebook, Inc.
Chip Turner is a Director of Engineering at Facebook where he focuses on-site reliability on the Web Foundation team. As a first responder for many years for incidents large and small, Chip has been involved with all phases of Incident Management. Chip has functioned in both an SWE and SRE role, working primarily in databases, storage, and caching systems in massively distributed environments.
Track II
Salon E
Fast, Safe, and Reliable: The Future of Configuration
Qui Nguyen, Yelp
Who wants to have to deploy an application just to change the values of some constants? At Yelp, we didn't, so we built a system that allows our web services to dynamically reload constants without being re-deployed or restarted, significantly increasing developer velocity. This system distributes configuration values using simple files to the thousands of servers we run and is used by hundreds of different services.
On the other hand, constants can make a big difference in the performance or correctness of your code. As we improved our code deployment processes, we found that more and more outages were being caused by configuration changes, because we weren't giving our configuration the same attention as our code.
This talk will cover how our configuration system works and how we were able to make it safer, without reducing its reliability or slowing developers down.
Qui Nguyen, Yelp
Qui Nguyen works as a software engineer on the Compute Infrastructure team at Yelp, building systems to deploy and run Yelp's services quickly and efficiently. Before infrastructure, she worked on data processing on the ads team, studied Computer Science at MIT, and briefly wanted to be a mathematician.
POST No AWS Bills: Cloud Cost Optimization without APIs
Corey Quinn, The Duckbill Group
AWS bills (and, to a lesser extent, those of other cloud vendors) are vast and deep, out of necessity. It's normal to feel overwhelmed when staring at them. What should you care about? What shouldn't you care about? There's gotta be a better way to control, optimize, and manage your spend past "buy some reserved instances!" This talk covers advanced concepts about AWS bills, how finance's understanding of the bill can misalign with engineering's, and what can be done to influence spend without disrupting your engineers for two years with demands to rewrite everything.
Corey Quinn, The Duckbill Group
Corey is the Cloud Economist at The Duckbill Group. Corey specializes in helping companies improve their AWS bills by making them smaller and less horrifying; hosts the Screaming in the Cloud and AWS Morning Brief podcasts; and curates Last Week in AWS, a weekly newsletter summarizing the latest in AWS news, blogs, and tools, sprinkled with snark.
Workshops
Salon ABCD
BPF Performance Tools
Brendan Gregg, Netflix
BPF (eBPF) tracing is the superpower that can analyze everything, helping you find performance wins, troubleshoot software, and more. This tutorial shows you how to use the open-source BCC and bpftrace tools to find performance wins across a variety of application and system targets, and how to create your own Linux observability tools with BPF/bpftrace. We will also discuss challenges and fixes for real-world analysis, including lessons learned from its production use at Netflix, so you can hit the ground running when you return to work.
Brendan Gregg, Netflix
Brendan Gregg is an industry expert in computing performance and cloud computing. He is a senior performance architect at Netflix, where he does performance design, evaluation, analysis, and tuning. He is the author of BPF Performance Tools (Addison Wesley) and Systems Performance (Prentice Hall), and received the USENIX LISA Award for Outstanding Achievement in System Administration. Brendan has created numerous performance analysis tools, visualizations, and methodologies for performance analysis, including flame graphs.
3:30 pm–4:00 pm
Break with Refreshments
Ballroom Foyer
4:00 pm–5:30 pm
Keynote Addresses
Salon EF
When /bin/sh Attacks: Revisiting "Automate All the Things"
J. Paul Reed, Netflix
The HBO hit series Westworld tells us of a place where we can "Live without limits!" This promise might remind us of the "magic" with which automation is often spoken about. To be sure, automation is a cornerstone of DevOps, SRE, and modern operations practices, the A in DevOps' venerable CAMS, and the subject of one of its oldest, most famous memes: "Automate ALL the things."
But are there processes we shouldn't automate? What if HOW we automate actively causes us and the systems we're responsible for harm? We'll take a look what human factors have to do with automation as well as at some of the impacts and challenges pervasive automation has presented for systems administrator and SREs, along with some important considerations when automating our complex, living socio-technical systems, and some strategies to cope when the shell scripts strike back!
J. Paul Reed, Netflix
J. Paul Reed has over twenty years experience in the trenches as a build/release and operations engineer, working with such companies as VMware, Mozilla, Postbox, Symantec, and Salesforce.
He's worked across a number of industries, from financial services to cloud-based infrastructure to health care, with teams ranging from 2 to 2,500 on everything from tooling, operational analysis and improvement, cultural transformation, and business value optimization. He's currently a member of Netflix's CORE SRE team, focusing on resilience engineering and human factors in distributed socio-technical systems.
Why Are Distributed Systems So Hard?
Denise Yu, Pivotal
Distributed systems are known for being notoriously difficult to wrangle. But why? This talk will cover a brief history of distributed computing, clear up some common myths about the CAP theorem, dig into why network partitions are inevitable, and closeout by highlighting how a few popular consensus algorithms mitigate the risks of operating in a distributed fashion. We'll also take a look at how to design systems for greater adaptability by human factors, which can help reduce the impact of programmatic uncertainty.
Denise Yu, Pivotal
Denise is a Senior Software Engineer at Pivotal who occasionally wears a product management hat. Denise has previously delivered conference talks on topics ranging from continuous delivery to functional programming to scaling company culture. She enjoys learning about distributed systems, release engineering, and low-level Linux kernel programming, and when she's not coding, she is often doodling sketch notes that break down technical concepts into digestible pieces at deniseyu.io/art.
5:30 pm–5:45 pm
Closing Remarks
Salon EF
Program Co-Chairs: Patrick Cable, Threat Stack, Inc., and Mike Rembetsy, Bloomberg