Testing 1-2-3-4

A variety of topics are being covered at LISA '13. Use the icons listed below to focus on a key subject area:

  • Cloud System Administration
  • Coding
  • DevOps
  • Linux
  • Soft Skills
  • WiAC

Follow the icons throughout the technical sessions below. You can combine days of training or workshops with days of technical sessions content to build the conference that meets your needs. Pick and choose the sessions that best fit your interest—focus on just one topic or mix and match.

Proceedings Front Matter: 
Cover Page | Title Page and List of Organizers | Table of Contents | Message from the Program Co-Chairs

Full Proceedings PDFs
 LISA '13 Full Proceedings (PDF)
 LISA '13 Proceedings Interior (PDF, best for mobile devices)
 LISA '13 Erratum (PDF)

Full Proceedings ePub (for iPad and most eReaders)
 LISA '13 Full Proceedings (EPUB)

Full Proceedings Mobi (for Kindle)
 LISA '13 Full Proceedings (MOBI)

Download Proceedings Archive (Conference Attendees Only)

Attendee Files 
Downloadable Proceedings Archive for Registered Attendees

 

Wednesday, November 6, 2013

8:15 a.m.–9:00 a.m. Wednesday

Continental Breakfast

Thurgood Marshall Ballroom Foyer

8:45 a.m.–9:00 a.m. Wednesday

Opening Remarks and Awards

Program Co-Chairs: Narayan Desai, Argonne National Laboratory; Kent Skaar, VMware, Inc.

Thurgood Marshall Ballroom

9:00 a.m.–10:30 a.m. Wednesday

Keynote Address

Thurgood Marshall Ballroom

Modern Infrastructure: The Convergence of Network, Compute, and Data

Jason Hoffman, Founder, Joyent

 

The three pillars of our industry are network, compute, and data. All trends come down to the convergence of these. The convergence of network and compute resulted in the "the network is the computer"; the convergence of network and data spawned the entire networked storage industry and now we believe we're in the technology push where we are converging compute and data. At Joyent, we were able to take a fresh look at the idea of building a datacenter as if it were a single appliance, we took a storage-centric and "software defined everything but always offload to hardware when you can" approach, and we're intending to do everything in the open. In this talk, we'll cover the philosophical basis, the overall architecture, and the deep details of a holistic datacenter implementation.

The three pillars of our industry are network, compute, and data. All trends come down to the convergence of these. The convergence of network and compute resulted in the "the network is the computer"; the convergence of network and data spawned the entire networked storage industry and now we believe we're in the technology push where we are converging compute and data. At Joyent, we were able to take a fresh look at the idea of building a datacenter as if it were a single appliance, we took a storage-centric and "software defined everything but always offload to hardware when you can" approach, and we're intending to do everything in the open. In this talk, we'll cover the philosophical basis, the overall architecture, and the deep details of a holistic datacenter implementation.

Available Media
10:30 a.m.–11:00 a.m. Wednesday

Break with Refreshments

Thurgood Marshall Ballroom Foyer

11:00 a.m.–12:30 p.m. Wednesday

Invited Talks I

Thurgood Marshall North/East Ballroom

Session Chair: Nicole Forsgren Velasquez

SysAdmins Unleashed! Building Autonomous Systems Teams at Walt Disney Animation Studios

Jonathan Geibel and Ronald Johnson, Walt Disney Animation Studios

How do you instill the agility and effectiveness of a startup company within the walls of one of the most storied animation studios in the world? This question guided our design of a new systems organization at Walt Disney Animation Studios. Our goals were simple: break down traditional top-down management silos, empower staff to make autonomous decisions, and remove bureaucracy. We used scientific method experimentation with different structures and ideas, discussing the impact of each change with our staff along the way. We’ll discuss the methods we used to empower sysadmins, and how we’ve evolved into an organization that’s designed for and by technical staff.

How do you instill the agility and effectiveness of a startup company within the walls of one of the most storied animation studios in the world? This question guided our design of a new systems organization at Walt Disney Animation Studios. Our goals were simple: break down traditional top-down management silos, empower staff to make autonomous decisions, and remove bureaucracy. We used scientific method experimentation with different structures and ideas, discussing the impact of each change with our staff along the way. We’ll discuss the methods we used to empower sysadmins, and how we’ve evolved into an organization that’s designed for and by technical staff.

Jonathan Geibel is the Systems Technology Director at Walt Disney Animation Studios where he leads a team of 50 technologists that push the boundaries of high-performance computing to advance the art of Animation. A 20-year technology veteran, Jon’s interests focus on designing adaptable organizations to enable engineering teams, high-performance computing, and technology futures.

Ronald Johnson is a 20-year SysAdmin veteran with a passion for communication technologies and organizational structures. He is currently a Systems Manager at Walt Disney Animation Studios.

Available Media

Becoming a Gamemaster: Designing IT Emergency Operations and Drills

Adele Shakal, Director, Project & Knowledge Management, Metacloud, Inc.; Formerly Technical Project Manager at USC ITS, ITS Great Shakeout 2011, IT Emergency Operations, and Drill Designer

Adele Shakal heads up project and knowledge management at Metacloud, Inc., a cloud solutions company providing Private Cloud as a Service based on OpenStack. She has nearly two decades of experience with IT project management, business process analysis and design, knowledge management, emergency operations and drill planning, business continuity, service management, system administration, and Web technologies. She has been a presenter, roundtable facilitator, and panelist on IT emergency preparedness, Google Apps for Education, project management and technical documentation, and advancing women in computing at CENIC, EDUCAUSE, APRU, USENIX LISA, and CascadiaIT conferences.

Bring emergency response and operations, business continuity, disaster recovery, and IT architecture together into practical drill design… and prepare your organization for whatever zombie apocalypse it may face.

Learn key concepts in emergency operations center and incident headquarters design, methods of introducing such concepts to your organization, and a sequence of basic-to-advanced drill designs.

Keeping IT folks engaged in a drill simulation can be very challenging. Become a gamemaster worthy of designing and executing drills on likely emergency scenarios and realistic function failures for your organization.

Hard-hats and D10s not included.

Bring emergency response and operations, business continuity, disaster recovery, and IT architecture together into practical drill design… and prepare your organization for whatever zombie apocalypse it may face.

Learn key concepts in emergency operations center and incident headquarters design, methods of introducing such concepts to your organization, and a sequence of basic-to-advanced drill designs.

Keeping IT folks engaged in a drill simulation can be very challenging. Become a gamemaster worthy of designing and executing drills on likely emergency scenarios and realistic function failures for your organization.

Hard-hats and D10s not included.

Adele Shakal has nearly two decades of experience in IT project management, business process analysis and design, knowledge management, emergency operations and drill planning, business continuity, service management, UNIX system administration, and web technologies. Her B.S. is in GeoChemistry from California Institute of Technology, and she has presented at CENIC, APRU, and CascadiaIT conferences. She now heads up Project & Knowledge Management at Metacloud, Inc., which provides on-premise, OpenStack-based private clouds as a service.

Available Media

Invited Talks 2

Thurgood Marshall South/West Ballroom

Session Chair: Mike Ciavarella


A Working Theory-of-Monitoring

Caskey L. Dickson, Site Reliability Engineer, Google Inc.

At Google we have discovered many common pitfalls and false simplifications that cause frustration and blind-spots with monitoring systems. Internally we have our own home-grown monitoring systems, but to move beyond the hit-and-miss approach to monitoring we have developed a formal model for such systems. This model is used as a framework for developing, evaluating, and evolving monitoring systems at Google that are suitable for operating at scale.

We will present our model, show how existing open source solutions fit (and don't fit!) into that model, and invite attendees to contrast it with their experiences. The goal is to encourage a larger discussion into the theory of monitoring and how current solutions can be evolved into more effective tools for operators of large systems.

At Google we have discovered many common pitfalls and false simplifications that cause frustration and blind-spots with monitoring systems. Internally we have our own home-grown monitoring systems, but to move beyond the hit-and-miss approach to monitoring we have developed a formal model for such systems. This model is used as a framework for developing, evaluating, and evolving monitoring systems at Google that are suitable for operating at scale.

We will present our model, show how existing open source solutions fit (and don't fit!) into that model, and invite attendees to contrast it with their experiences. The goal is to encourage a larger discussion into the theory of monitoring and how current solutions can be evolved into more effective tools for operators of large systems.

Caskey Dickson is a Site Reliability Engineer/Software Engineer at Google, where he works writing and maintaining monitoring services that operate at "Google scale." In online service development since 1995, before coming to Google he was a senior developer at Symantec, wrote software for various internet startups such as CitySearch and CarsDirect, ran a consulting company, and even taught undergraduate and graduate computer science at Loyola Marymount University. He has an undergraduate degree in Computer Science, a Masters in Systems Engineering, and an M.B.A from Loyola Marymount.

Available Media

Effective Configuration Management

N.J. Thomas, Amplify Education

The state of configuration management is arguably still in its infancy, and detailed info on how to effectively integrate these tools onto an existing site is scarce. Our aim is to describe the current best practices when installing a configuration management system. We will go over code review tools for version control backends and continuous integration systems in front. While we will cover some real-world examples, we are agnostic as to the choice of particular tools and CM systems, so all are welcome. The lessons learned are useful when building or maintaining an effective infrastructure that is orchestrated by configuration management.

The state of configuration management is arguably still in its infancy, and detailed info on how to effectively integrate these tools onto an existing site is scarce. Our aim is to describe the current best practices when installing a configuration management system. We will go over code review tools for version control backends and continuous integration systems in front. While we will cover some real-world examples, we are agnostic as to the choice of particular tools and CM systems, so all are welcome. The lessons learned are useful when building or maintaining an effective infrastructure that is orchestrated by configuration management.

N.J. Thomas is a Unix systems administrator for Amplify Education. He is currently focusing on evangelizing the benefits of installing configuration management everywhere. In his spare time he likes to play with BSD machines and configuration management systems. His research interests include creating effective command-line and curses-based tools for sysadmins.

Available Media

Invited Talks 3

Wilson ABC ROOM

Session Chair: John Looney


Our Jobs Are Evolving: Can We Keep Up?

Mandi Walls, Senior Consultant, Opscode Inc.

This talk will discuss the current shortage of skilled system administration professionals, the evolving skill set demanded by the changing global economy, and how we, as practitioners, can move our industry forward. We will look at baseline skill sets, professional development opportunities, attracting the next generation of system administrators, and providing strategic value to the organizations who rely on sysadmins, even when they don't realize it.

This talk will discuss the current shortage of skilled system administration professionals, the evolving skill set demanded by the changing global economy, and how we, as practitioners, can move our industry forward. We will look at baseline skill sets, professional development opportunities, attracting the next generation of system administrators, and providing strategic value to the organizations who rely on sysadmins, even when they don't realize it.

Mandi Walls is a technical consultant for Seattle-based Opscode, the makers of Chef. Prior to joining Opscode, Mandi served as a system administrator at Admeld, NHGRI, and AOL, where she ran sites including moviefone.com, games.com, and www.aol.com. She holds a Master's degree in Computer Science from the George Washington University, Washington, D.C., and an M.B.A. from UNC Kenan-Flagler Business School, Chapel Hill, North Carolina, through the global OneMBA program. She is a published author with O'Reilly and speaks at numerous conferences and events.

Available Media

The Guru Is In

Harding Room

Interviewing and Job Hunting

Adam Moskowitz

Adam Moskowitz, in his roles as IT manager and senior system administrator, and on behalf of several of his consulting clients, has interviewed more candidates for system administration positions than he can remember. By virtue of having worked for a lot of companies and clients, he has been a candidate for quite a few system administration positions. Over the years he's been asked good questions, bad questions, and horrible questions, and has seen candidates become flummoxed when asked what seemed like rather simple questions. All this plus his almost 30 years of experience in the field (not to mention a darned good ratio of interviews to job offers) have given Adam considerable field experience to draw on for this session.

Adam Moskowitz, in his roles as IT manager and senior system administrator, and on behalf of several of his consulting clients, has interviewed more candidates for system administration positions than he can remember. By virtue of having worked for a lot of companies and clients, he has been a candidate for quite a few system administration positions. Over the years he's been asked good questions, bad questions, and horrible questions, and has seen candidates become flummoxed when asked what seemed like rather simple questions. All this plus his almost 30 years of experience in the field (not to mention a darned good ratio of interviews to job offers) have given Adam considerable field experience to draw on for this session.

12:30 p.m.–2:00 p.m. Exhibit Hall C

Lunch (on your own; head over to the concession cart at the Vendor Exhibition)

2:00 p.m.–3:30 p.m. Wednesday

Invited Talks 1

Thurgood Marshall North/East Ballroom

Session Chair: John Looney


User Space

Noah Zoschke, Sr. Platform Engineer, Heroku

When running your app "in the cloud" there is a dizzying layer of software controlling your trivially deployed code. In practice, it is your code → language VM → LXC Container → Linux OS → Xen Hypervisor → Linux OS → CPU.

Here we can look at each layer as a "user space," an expressive place that you are empowered to use, and a "kernel," the black box system that imposes strict constraints through an API.

By studying each layer in this way, we see self-similar properties, which offers insight for how to best participate in the ecosystem. We also understand why the cloud is built this way with the huge benefits in power and efficiency this offers to application developers.

When running your app "in the cloud" there is a dizzying layer of software controlling your trivially deployed code. In practice, it is your code → language VM → LXC Container → Linux OS → Xen Hypervisor → Linux OS → CPU.

Here we can look at each layer as a "user space," an expressive place that you are empowered to use, and a "kernel," the black box system that imposes strict constraints through an API.

By studying each layer in this way, we see self-similar properties, which offers insight for how to best participate in the ecosystem. We also understand why the cloud is built this way with the huge benefits in power and efficiency this offers to application developers.

Noah Zoschke is a lead engineer at Heroku, a cloud Platform-as-a-Service. He spends his time managing a team of infrastructure and systems engineers on the Heroku Runtime, a distributed code compilation, process management and process execution system responsible for running and scaling millions of applications.

Available Media

Observing and Understanding Behavior in Complex Systems

Theo Schlossnagle, CEO, Circonus

Complex systems have difficult-to-understand emergent behaviors. When distributed they lack a canonical source of truth. We don't have the technology today to truly understand these systems, but that isn't an excuse for not improving the observability to improve the clarity of what is there. In this talk I will discuss techniques to instrument and observe complex systems.

Theo Schlossnagle is an expert in scalable systems design and telemetry collection and analysis. He has an academic background in distributed systems and founded several companies that build highly scalable Internet-facing software.

Complex systems have difficult-to-understand emergent behaviors. When distributed they lack a canonical source of truth. We don't have the technology today to truly understand these systems, but that isn't an excuse for not improving the observability to improve the clarity of what is there. In this talk I will discuss techniques to instrument and observe complex systems.

Theo Schlossnagle is an expert in scalable systems design and telemetry collection and analysis. He has an academic background in distributed systems and founded several companies that build highly scalable Internet-facing software.

Invited Talks 2

Thurgood Marshall South/West Ballroom

Session Chair: Cory Lueninghoener


Building a Networked Appliance

John Sellens, Syonex

This talk tells the tale of designing and building a small networked computing appliance into a product, and the decisions, trial(s) and error(s), and false starts that it entailed. It will primarily cover the technical challenges, and the infrastructure and support tools that were required. The device is intended to be deployed unattended and in remote locations, which meant that the device and the supporting infrastructure had to be built such that it could be remotely managed and would be unlikely to fail. What could possibly go wrong?

John Sellens is long time sysadmin and LISA attendee, and has taught tutorials, authored several LISA papers, and is a USENIX Short Topics author. He holds an M.Math in Computer Science from the University of Waterloo and is a reformed accountant. He is currently the proprietor of SYONEX, a systems and networks consultancy, and a member of the ops team at FreshBooks.

This talk tells the tale of designing and building a small networked computing appliance into a product, and the decisions, trial(s) and error(s), and false starts that it entailed. It will primarily cover the technical challenges, and the infrastructure and support tools that were required. The device is intended to be deployed unattended and in remote locations, which meant that the device and the supporting infrastructure had to be built such that it could be remotely managed and would be unlikely to fail. What could possibly go wrong?

John Sellens is long time sysadmin and LISA attendee, and has taught tutorials, authored several LISA papers, and is a USENIX Short Topics author. He holds an M.Math in Computer Science from the University of Waterloo and is a reformed accountant. He is currently the proprietor of SYONEX, a systems and networks consultancy, and a member of the ops team at FreshBooks.

Available Media

How Netflix Embraces Failure to Improve Resilience and Maximize Availability

Ariel Tseitlin, Director, Cloud Solutions, Netflix

Netflix created a suite of tools, collectively called the Simian Army, to improve resiliency and maintain the cloud environment. In the typical case, failure modes are corner cases, which are poorly, if at all, tested. It is only by failing often that we can ensure that we are resilient to failure. We look for ways to induce failure in our production environment to better prepare us for the inevitable failures that will occur. This presentation will cover the motivation for inducing failure in production and the mechanics of how Netflix achieves it.

Netflix created a suite of tools, collectively called the Simian Army, to improve resiliency and maintain the cloud environment. In the typical case, failure modes are corner cases, which are poorly, if at all, tested. It is only by failing often that we can ensure that we are resilient to failure. We look for ways to induce failure in our production environment to better prepare us for the inevitable failures that will occur. This presentation will cover the motivation for inducing failure in production and the mechanics of how Netflix achieves it.

Ariel Tseitlin manages the Netflix Cloud and is interested in all things cloudy. At Netflix, he is Director of Cloud Solutions, helping Netflix be successful in the cloud, including cloud tooling, monitoring, performance and scalability, and cloud operations and reliability engineering. Ariel's team builds Asgard and the Simian Army, including the Chaos Monkey. Prior to Netflix, Ariel was VP of Technology and Products at Sungevity and before that was the Founder and CEO of CTOWorks.

Available Media

Papers and Reports

Wilson ABC Room

Session Chair: Paul Krizak


Building Software Environments for Research Computing Clusters

Mark Howison, Aaron Shen, and Andrew Loomis, Brown University

Over the past two years, we have built a diverse software environment of over 200 scientific applications for our research computing platform at Brown University. In this report, we share the policies and best practices we have developed to simplify the configuration and installation of this software environment and to improve its usability and performance. In addition, we present a reference implementation of an environment modules system, called PyModules, that incorporates many of these ideas.

Available Media

Fixing On-call, or How to Sleep Through the Night

Matt Provost, Weta Digital

Matt Provost is the Systems Manager at Weta Digital. Weta Digital is a five-time Academy Award–winning visual effects facility in Wellington, New Zealand. The Systems team at Weta is responsible for all of the company's servers, storage, and networking. They run a 49,000 core renderwall. Matt has been a system and network administrator for over 15 years. He has a B.A. from Indiana University, Bloomington.

Monitoring systems are some of the most critical pieces of infrastructure for a systems administration team. They can also be a major cause of sleepless nights and lost weekends for the on-call sysadmin. This paper looks at a mature Nagios system that has been in continuous use for seven years with the same team of sysadmins. By 2012 it had grown into something that was causing significant disruption for the team and there was a major push to reform it into something more reasonable. We look at how a reduction in after hour alerts was achieved, together with an increase in overall reliability, and what lessons were learned from this effort.
Available Media

The Guru Is In

Harding Room

IPv6

Owen DeLong, Hurricane Electric

Owen DeLong is an IPv6 Evangelist at Hurricane Electric and a member of the ARIN Advisory Council. Owen brings more than 25 years of industry experience. He is an active member of the system administration, operations, and IP policy communities. In the past, Owen has worked at Tellme Networks (Senior Network Engineer); Exodus Communications (Senior Backbone Engineer), where he was part of the team that took Exodus from a pre-IPO startup with two data centers to a major global provider of hosting services; Netcom Online (Network Engineer), where he worked on a team that moved the Internet from an expensive R&E tool to a widely available public access system accessible to anyone with a computer, Sun Microsystems (Senior Systems Administrator), and more.

Owen DeLong is an IPv6 Evangelist at Hurricane Electric and a member of the ARIN Advisory Council. Owen brings more than 25 years of industry experience. He is an active member of the system administration, operations, and IP policy communities. In the past, Owen has worked at Tellme Networks (Senior Network Engineer); Exodus Communications (Senior Backbone Engineer), where he was part of the team that took Exodus from a pre-IPO startup with two data centers to a major global provider of hosting services; Netcom Online (Network Engineer), where he worked on a team that moved the Internet from an expensive R&E tool to a widely available public access system accessible to anyone with a computer, Sun Microsystems (Senior Systems Administrator), and more.

3:30 p.m.–4:00 p.m. Wednesday

Break with Refreshments

Thurgood Marshall Ballroom Foyer

4:00 p.m.–5:30 p.m. Wednesday

Invited Talks 1

Thurgood Marshall North/East Ballroom

Session Chair: Cory Lueninghoener


Storage Performance Testing in the Cloud

Jeff Darcy, Red Hat

Based on experience testing distributed storage systems in several public clouds, this talk will consist of two parts. The first part will cover approaches for characterizing and measuring storage workloads generally. The second part will cover the additional challenges posed by testing in public clouds. Contrary to popular belief, no two cloud servers are ever alike. Even the same server can exhibit wild and unpredictable performance swings over time, so new ways of analyzing performance are critical in these environments.

Jeff Darcy has worked on network and distributed storage systems for 20 years, including an instrumental role in developing MPFS (a precursor of modern pNFS) while at EMC. He is currently a member of the GlusterFS architecture team at Red Hat and frequently gives talks and tutorials about topics related to cloud storage.

Based on experience testing distributed storage systems in several public clouds, this talk will consist of two parts. The first part will cover approaches for characterizing and measuring storage workloads generally. The second part will cover the additional challenges posed by testing in public clouds. Contrary to popular belief, no two cloud servers are ever alike. Even the same server can exhibit wild and unpredictable performance swings over time, so new ways of analyzing performance are critical in these environments.

Jeff Darcy has worked on network and distributed storage systems for 20 years, including an instrumental role in developing MPFS (a precursor of modern pNFS) while at EMC. He is currently a member of the GlusterFS architecture team at Red Hat and frequently gives talks and tutorials about topics related to cloud storage.

Available Media

Surveillance, the NSA, and Everything

Bruce Schneier, Fellow, Berkman Center for Internet and Society

Bruce Schneier is an internationally renowned security technologist, called a "security guru" by The Economist. He is the author of 12 books—including Liars and Outliers: Enabling the Trust Society Needs to Survive—as well as hundreds of articles, essays, and academic papers. His influential newsletter "Crypto-Gram" and his blog "Schneier on Security" are read by over 250,000 people. He has testified before Congress, is a frequent guest on television and radio, has served on several government committees, and is regularly quoted in the press. Schneier is a fellow at the Berkman Center for Internet and Society at Harvard Law School, a program fellow at the New America Foundation's Open Technology Institute, a board member of the Electronic Frontier Foundation, an Advisory Board Member of the Electronic Privacy Information Center, and the Security Futurologist for BT—formerly British Telecom.

Abstract in progress, since there's new news every week.

Bruce Schneier is an internationally renowned security technologist, called a "security guru" by The Economist. He is the author of 12 books—including Liars and Outliers: Enabling the Trust Society Needs to Survive—as well as hundreds of articles, essays, and academic papers. His influential newsletter "Crypto-Gram" and his blog "Schneier on Security" are read by over 250,000 people. He has testified before Congress, is a frequent guest on television and radio, has served on several government committees, and is regularly quoted in the press. Schneier is a fellow at the Berkman Center for Internet and Society at Harvard Law School, a program fellow at the New America Foundation's Open Technology Institute, a board member of the Electronic Frontier Foundation, an Advisory Board Member of the Electronic Privacy Information Center, and the Security Futurologist for BT—formerly British Telecom.

Abstract in progress, since there's new news every week.

Bruce Schneier is an internationally renowned security technologist, called a "security guru" by The Economist. He is the author of 12 books—including Liars and Outliers: Enabling the Trust Society Needs to Survive—as well as hundreds of articles, essays, and academic papers. His influential newsletter "Crypto-Gram" and his blog "Schneier on Security" are read by over 250,000 people. He has testified before Congress, is a frequent guest on television and radio, has served on several government committees, and is regularly quoted in the press. Schneier is a fellow at the Berkman Center for Internet and Society at Harvard Law School, a program fellow at the New America Foundation's Open Technology Institute, a board member of the Electronic Frontier Foundation, an Advisory Board Member of the Electronic Privacy Information Center, and the Security Futurologist for BT—formerly British Telecom.

Available Media

Invited Talks 2

Thurgood Marshall South/West Ballroom

Session Chair: Mike Ciavarella


Leveraging In-Memory Key Value Stores for Large-Scale Operations

Mike Svoboda, Staff Systems and Automation Engineer, LinkedIn; Diego Zamboni, Senior Security Advisor, CFEngine

Memcache, Redis, and most other in-memory key-value systems have traditionally been used to offload (scale) queries against backend databases. Facebook made this architecture famous, showing that it is possible to have thousands of Web servers requests satisfied in sub-second time by standing up in-memory caches in front of databases. At LinkedIn, we have taken usage of in-memory caches in a completely opposite direction—we leverage them to solve operational questions:

Memcache, Redis, and most other in-memory key-value systems have traditionally been used to offload (scale) queries against backend databases. Facebook made this architecture famous, showing that it is possible to have thousands of Web servers requests satisfied in sub-second time by standing up in-memory caches in front of databases. At LinkedIn, we have taken usage of in-memory caches in a completely opposite direction—we leverage them to solve operational questions:

  • Where does the httpd process run?
  • What versions of the openssh package is installed in datacenter X?
  • Who has a network connection to machine Y?
  • What machines have experienced hardware failure?

By standing up Redis Caches on each of our CFEngine policy servers, every client populates Redis caches on every execution of CFEngine. We have built a Python library at LinkedIn where we leverage our "Range" lookup system to perform distributed queries against Redis on 60x policy servers in parallel. This approach allows us to answer any question about our infrastructure and have results delivered in under five seconds from tens or hundreds of thousands of machines. It allows our security team to find machines that could have been exploited, allows our SRE team to understand where services have been deployed, and helps SysOps build our inventory database system and modify our CMDB in real time.

Mike Svoboda currently works in System Operations at LinkedIn and is charged with administrating all production automation. LinkedIn relies on CFEngine to tie major parts of infrastructure together, which has allowed LinkedIn the flexibility to scale horizontally indefinitely.

Diego Zamboni is a computer scientist, consultant, author, programmer, sysadmin, and overall geek who works as a senior security advisor at CFEngine. He has more than 20 years of experience in system administration and security, and has worked in both the applied and theoretical sides of the computer science field. Zamboni is the author of the book Learning CFEngine 3, published by O’Reilly Media.

Available Media

What We Learned at Spotify, Navigating the Clouds

Noa Resare and Ramon van Alteren, Spotify

We would like to share some lessons we have learned building a hybrid cloud system at Spotify. The Spotify backend, being very service-oriented, presents some interesting challenges when it comes to setting up development and test environments. Our solution has been to provide a virtualized self-service environment that lets you spawn machines. This environment has evolved over time into a hybrid solution using both Apache Cloudstack and Amazon’s public cloud. There are many pieces to this puzzle touching on topics such as authentication, configuration management, and service discovery.

Noa Resare is a senior engineer at Spotify, currently working with various cloud-related challenges. He is a committer with the Apache Cloudstack project. Noa’s background is in both operations and as a developer, and he has been giving technical presentations at DevOpsdays Göteborg, at Cassandra Europe in London and at Apache Cloudstack meetups in Ghent and in NYC.

We would like to share some lessons we have learned building a hybrid cloud system at Spotify. The Spotify backend, being very service-oriented, presents some interesting challenges when it comes to setting up development and test environments. Our solution has been to provide a virtualized self-service environment that lets you spawn machines. This environment has evolved over time into a hybrid solution using both Apache Cloudstack and Amazon’s public cloud. There are many pieces to this puzzle touching on topics such as authentication, configuration management, and service discovery.

Noa Resare is a senior engineer at Spotify, currently working with various cloud-related challenges. He is a committer with the Apache Cloudstack project. Noa’s background is in both operations and as a developer, and he has been giving technical presentations at DevOpsdays Göteborg, at Cassandra Europe in London and at Apache Cloudstack meetups in Ghent and in NYC.

Ramon van Alteren is a product owner at Spotify, currently responsible for the testing platform and a large part of the capacity provisioning tools. Ramon’s background is in both development and operations. He has given technical and not so technical presentations at, for example, Devopsdays and Velocity Europe conferences. His current focus is on agile leadership/product ownership for engineering teams.

Available Media

Lightning Talks

Wilson ABC Room

Session Chair: Lee Damon, University of Washington

Lightning talks are fast-paced and high-energy. These are back-to-back 5-minute presentations on just about anything. Talk about a recent success, energize people about a pressing issue, ask a question, start a conversation!

Lightning talks are an opportunity to get up and talk about what’s on your mind. You can give several lightning talks if you have more than one topic.

Registration is open now. Go to the Lightning Talks Sign Up form to register.

The Guru Is In

Harding Room

Hadoop

Charles Wimmer, Altiscale

Charles Wimmer is a Site Reliability Engineer at VertiCloud where he is helping to build a Big Data Platform-as-a-Service, based on Apache Hadoop. Charles has 17 years experience in system administration including operating Web proxy services at LinkedIn and operating 45,000 nodes in multiple tenant clusters at Yahoo!

Charles Wimmer is a Site Reliability Engineer at VertiCloud where he is helping to build a Big Data Platform-as-a-Service, based on Apache Hadoop. Charles has 17 years experience in system administration including operating Web proxy services at LinkedIn and operating 45,000 nodes in multiple tenant clusters at Yahoo!

6:30 p.m.–11:30 p.m. Wednesday

Evening Activities

Take a look at what's happening this evening at LISA '13.

 

Thursday, November 7, 2013

8:30 a.m.–9:00 a.m. Thursday

Continental Breakfast

Thurgood Marshall Ballroom Foyer

9:00 a.m.–10:30 a.m. Thursday

Plenary Session

Thurgood Marshall Ballroom

Blazing Performance with Flame Graphs

Brendan Gregg, Joyent

"How did we ever analyze performance before Flame Graphs?" This new visualization invented by Brendan can help you quickly understand application and kernel performance, especially CPU usage, where stacks (call graphs) can be sampled and then visualized as an interactive flame graph. Flame Graphs are now used for a growing variety of targets: for applications and kernels on Linux, SmartOS, Mac OS X, and Windows; for languages including C, C++, node.js, ruby, and Lua; and in WebKit Web Inspector. This talk will explain them and provide use cases and new visualizations for other event types, including I/O, memory usage, and latency.

"How did we ever analyze performance before Flame Graphs?" This new visualization invented by Brendan can help you quickly understand application and kernel performance, especially CPU usage, where stacks (call graphs) can be sampled and then visualized as an interactive flame graph. Flame Graphs are now used for a growing variety of targets: for applications and kernels on Linux, SmartOS, Mac OS X, and Windows; for languages including C, C++, node.js, ruby, and Lua; and in WebKit Web Inspector. This talk will explain them and provide use cases and new visualizations for other event types, including I/O, memory usage, and latency.

Brendan Gregg is the lead performance engineer at Joyent, where he analyzes performance and scalability at any level of the software stack. He is the author of Systems Performance (Prentice Hall, 2013), and primary author of DTrace (Prentice Hall). He was previously a kernel engineer at Sun Microsystems where he developed the ZFS L2ARC, and has also developed numerous performance analysis tools. His recent work includes performance visualizations.

Available Media
10:30 a.m.–11:00 a.m. Thursday

Break with Refreshments

Thurgood Marshall Ballroom Foyer

11:00 a.m.–12:30 p.m. Thursday

Panel

Thurgood Marshall North/East Ballroom

Session Chair: Rikki Endsley


Women in Advanced Computing (WiAC)

Moderator: Rikki Endsley, USENIX Association

Panelists: Amy Rich, Mozilla Corporation; Deanna McNeil, Learning Tree International; Amy Forinash, NASA/GSFC; Deirdré Straughan, Joyent

Amy Rich has been a UNIX sysadmin for over 20 years at a variety of companies, has owned her own consulting business, helped organize multiple sysadmin conferences, and written professionally on the topic of UNIX systems administration. She currently works at the Mozilla Corporation and plays the part of both of sysadmin and manager of the Release Engineering Operations IT team, providing the infrastructure which performs automated builds and tests for Firefox, Firefox OS, and Thunderbird across all of the platforms that Mozilla supports. She is a member of USENIX and the LISA SIG, and is a founding member of LOPSA.

Deanna McNeil started out as a mom who had to work part time and discovered a knack for systems administration. Now she pursues technology for the love of always learning.

Amy Rich has been a UNIX sysadmin for over 20 years at a variety of companies, has owned her own consulting business, helped organize multiple sysadmin conferences, and written professionally on the topic of UNIX systems administration. She currently works at the Mozilla Corporation and plays the part of both of sysadmin and manager of the Release Engineering Operations IT team, providing the infrastructure which performs automated builds and tests for Firefox, Firefox OS, and Thunderbird across all of the platforms that Mozilla supports. She is a member of USENIX and the LISA SIG, and is a founding member of LOPSA.

Deanna McNeil started out as a mom who had to work part time and discovered a knack for systems administration. Now she pursues technology for the love of always learning.

Amy Forinash was born and raised in Prince George's County, MD, and is a happy product of the public school system there. After studying physics at the University of Maryland at College Park, she went on to become a system administrator in the mid-'90s, a job she still holds and has expanded to meet the IT security needs of her customer. In her personal life she enjoys long distance hiking, dressage with her Lippizan, and consuming adult beverages with other system administrators.

Deirdré Straughan has worked in technology for 25 years, on documentation, training, UI design, marketing, open source community management, event management, social media, video, and more. She tends to operate at the interfaces: between companies and customers, technologists and non-technologists, marketers and engineers, and anywhere else that people need help communicating with each other about technology. Much more about her life and work can be found at beginningwithi.com.

Available Media

Invited Talks 2

Thurgood Marshall South/West Ballroom

Session Chair: Nicole Forsgren Velasquez


LEAN Operations: Applying 100 Years of Manufacturing Knowledge to Modern IT

Ben Rockwood, Joyent

IT has evolved from an internal back-office services support team to an operations group that is responsible for service delivery to the end customer. Modern education hasn't prepared systems administrators with the skills and knowledge to meet these new challenges. Strikingly similar challenges were encountered in the manufacturing sector 100 years ago, many of which can and should be applied today in IT Operations to avoid reinventing the proverbial wheel. In this talk we will explore the problems, solutions, and applications that can give you a jump on the problems facing sysadmins today and in the decade to come.

Ben Rockwood is the Director of Cloud Operations for Joyent. With almost 20 years of UNIX systems administration experience, he has been an active blogger and writer for more than a decade. He strongly believes that "SA's help SA's." He lives in California, loves his wife and the challenges of systems administration, and is the father of five.

IT has evolved from an internal back-office services support team to an operations group that is responsible for service delivery to the end customer. Modern education hasn't prepared systems administrators with the skills and knowledge to meet these new challenges. Strikingly similar challenges were encountered in the manufacturing sector 100 years ago, many of which can and should be applied today in IT Operations to avoid reinventing the proverbial wheel. In this talk we will explore the problems, solutions, and applications that can give you a jump on the problems facing sysadmins today and in the decade to come.

Ben Rockwood is the Director of Cloud Operations for Joyent. With almost 20 years of UNIX systems administration experience, he has been an active blogger and writer for more than a decade. He strongly believes that "SA's help SA's." He lives in California, loves his wife and the challenges of systems administration, and is the father of five.

Available Media

Papers and Reports

Wilson ABC Room

Session Chair: Adam Oliner


Poncho: Enabling Smart Administration of Full Private Clouds

Scott Devoid and Narayan Desai, Argonne National Laboratory; Lorin Hochstein, Nimbis Services

Clouds establish a new division of responsibilities between platform operators and users than have traditionally existed in computing infrastructure. In private clouds, where all participants belong to the same organization, this creates new barriers to effective communication and resource usage. In this paper, we present poncho, a tool that implements APIs that enable communication between cloud operators and their users, for the purposes of minimizing impact of administrative operations and load shedding on highly-utilized private clouds.

Available Media

Making Problem Diagnosis Work for Large-Scale, Production Storage Systems

Michael P. Kasick and Priya Narasimhan, Carnegie Mellon University; Kevin Harms, Argonne National Laboratory

Intrepid has a very-large, production GPFS storage system consisting of 128 file servers, 32 storage controllers, 1152 disk arrays, and 11,520 total disks. In such a large system, performance problems are both inevitable and difficult to troubleshoot. We present our experiences, of taking an automated problem diagnosis approach from proof-of-concept on a 12-server test-bench parallel-filesystem cluster, and making it work on Intrepid’s storage system. We also present a 15-month case study, of problems observed from the analysis of 624GB of Intrepid’s instrumentation data, in which we diagnose a variety of performance-related storage-system problems, in a matter of hours, as compared to the days or longer with manual approaches.

Available Media

dsync: Efficient Block-wise Synchronization of Multi-Gigabyte Binary Data

Thomas Knauth and Christof Fetzer, Technische Universität Dresden
Awarded Best Paper!  

Backing up important data is an essential task for system administrators to protect against all kinds of failures. However, traditional tools like rsync exhibit poor performance in the face of today's typical data sizes of hundreds of gigabytes. We address the problem of efficient, periodic, multi-gigabyte state synchronization. In contrast to approaches like rsync which determine changes after the fact, our approach tracks modifications online. Tracking obviates the need for expensive checksum computations to determine changes. We track modification at the block-level which allows us to implement a very efficient delta-synchronization scheme. The block-level modification tracking is implemented as an extension to a recent (3.2.35) Linux kernel.

With our approach, named dsync, we can improve upon existing systems in several key aspects: disk I/O, cache pollution, and CPU utilization. Compared to traditional checksum-based synchronization methods dsync decreases synchronization time by up to two orders of magnitude. Benchmarks with synthetic and real-world workloads demonstrate the effectiveness of dsync.

Available Media

The Guru Is In

Harding Room

Secure Linux Containers

Daniel J Walsh, Red Hat

Dan Walsh, aka "Mr. SELinux," has been leading the SELinux effort at Red Hat for over 10 Years. Dan works on SELinux Userspace and Policy for Fedora and RHEL. He has also developed Secure Virtualization and helps to provide the security on OpenShift.

Dan Walsh, aka "Mr. SELinux," has been leading the SELinux effort at Red Hat for over 10 Years. Dan works on SELinux Userspace and Policy for Fedora and RHEL. He has also developed Secure Virtualization and helps to provide the security on OpenShift.

12:30 p.m.–2:00 p.m. Exhibit Hall C

Lunch (on your own; head over to the concession cart at the Vendor Exhibition)

2:00 p.m.–3:30 p.m. Thursday

Invited Talks 1

Thurgood Marshall North/East Ballroom

Session Chair: Cory Lueninghoener


Systems Performance

Brendan Gregg, Joyent

Brendan Gregg is the lead performance engineer at Joyent, where he analyzes performance and scalability at any level of the software stack. He is the author of Systems Performance (Prentice Hall, 2013), and primary author of DTrace (Prentice Hall). He was previously a kernel engineer at Sun Microsystems where he developed the ZFS L2ARC, and has also developed numerous performance analysis tools. His recent work includes performance visualizations.

Brendan Gregg is the lead performance engineer at Joyent, where he analyzes performance and scalability at any level of the software stack. He is the author of Systems Performance (Prentice Hall, 2013), and primary author of DTrace (Prentice Hall). He was previously a kernel engineer at Sun Microsystems where he developed the ZFS L2ARC, and has also developed numerous performance analysis tools. His recent work includes performance visualizations.

Brendan Gregg is the lead performance engineer at Joyent, where he analyzes performance and scalability at any level of the software stack. He is the author of Systems Performance (Prentice Hall, 2013), and primary author of DTrace (Prentice Hall). He was previously a kernel engineer at Sun Microsystems where he developed the ZFS L2ARC, and has also developed numerous performance analysis tools. His recent work includes performance visualizations.

Available Media

Hacking your Mind and Emotions

Branson Matheson, SGT

Branson is a 25-year veteran of system administration and security. He started as a cryptologist for the US Navy and has since worked on NASA shuttle projects, TSA security and monitoring systems, and Internet search engines, while continuing to support many open source projects. He founded sandSecurity to provide policy and technical audits, plus support and training for IT security, system administrators, and developers. Branson currently is a systems architect for NASA; has his CEH, GSEC, GCIH, and several other credentials; and generally likes to spend time responding to the statement "I bet you can't."


Most admins will agree that users tend to be the weakest link in the maintenance of security in an environment. People are easily manipulated by their very nature. Social Engineering techniques are used on us every day, and I will demonstrate how you can learn to recognize those techniques and build some strong defenses in your environment. I will discuss and demonstrate some basic psychology, social engineering theory and actual implementation at several levels: Remote interaction, Indirect manipulation, and one-on-one engineering. I will also discuss some defense against these attacks through the use of effective policies and awareness training.


Most admins will agree that users tend to be the weakest link in the maintenance of security in an environment. People are easily manipulated by their very nature. Social Engineering techniques are used on us every day, and I will demonstrate how you can learn to recognize those techniques and build some strong defenses in your environment. I will discuss and demonstrate some basic psychology, social engineering theory and actual implementation at several levels: Remote interaction, Indirect manipulation, and one-on-one engineering. I will also discuss some defense against these attacks through the use of effective policies and awareness training.

Branson is a 25-year veteran of system administration and security. He started as a cryptologist for the US Navy and has since worked on NASA shuttle projects, TSA security and monitoring systems, and Internet search engines, while continuing to support many open source projects. He founded sandSecurity to provide policy and technical audits, plus support and training for IT security, system administrators, and developers. Branson currently is a systems architect for NASA; has his CEH, GSEC, GCIH, and several other credentials; and generally likes to spend time responding to the statement "I bet you can't."  

Available Media

Invited Talks 2

Thurgood Marshall South/West Ballroom

Session Chair: Tim Nelson


The Efficacy of Cybersecurity Regulation: Examining the Impact of Law on Security Practices

David Thaw, Visiting Assistant Professor of Law, University of Connecticut; Affiliated Fellow, Information Society Project, Yale Law School

Cybersecurity regulation presents an interesting quandary because private entities possess the best information about threats and defenses. Yet leaving the responsibility for setting security standards to individual actors bears risk—there will always be at least some organizations with deficient security, thus creating "weak links in the chain" that harm all organizations. Those same "weak links" are also least likely to be responsive to industry self-regulatory efforts.

Cybersecurity regulation presents an interesting quandary because private entities possess the best information about threats and defenses. Yet leaving the responsibility for setting security standards to individual actors bears risk—there will always be at least some organizations with deficient security, thus creating "weak links in the chain" that harm all organizations. Those same "weak links" are also least likely to be responsive to industry self-regulatory efforts.

Thus lawmakers and regulators, seeking to preserve trust in the overall information economy, create legal obligations designed to protect both individual consumers and organizations so that they may reasonably trust and do business with one another. My research explores the wisdom of those choices by comparing the two primary styles of cybersecurity regulation: 1) comprehensive security requirements under which organizations develop and adhere to their own individualized compliance plans; and 2) more traditional, directive regulation mandating compliance with precise specific standards.

My analysis suggests that a blend of these two modes of regulating is superior to either method alone. I present data from qualitative interviews with Chief Information Security Officers (CISOs) at leading multinational corporations, detailing the practical effects of how regulation drives their organizations' security practices, as well as quantitative data on breach incidence detailing the efficacy of these regulations at preventing data breaches.

David Thaw is a Visiting Assistant Professor of Law at the University of Connecticut and an Affiliated Fellow of the Information Society Project at Yale Law School. He is a law and technology expert whose research and scholarship examine the regulation of the Internet and computing technologies, with specific focus on cybersecurity regulation and cybercrime. Dr. Thaw received his Ph.D., J.D., and M.A. from the University of California, Berkeley, and his B.S. and B.A from the University of Maryland.

Prior to joining the Law School faculty, Professor Thaw was a Research Associate on the University of Maryland Computer Science faculty, where he conducted research with the Maryland Cybersecurity Center and taught an undergraduate honors seminar on cybersecurity, law, and policy.

Professor Thaw is a frequent presenter on cybersecurity regulation and cybercrime. He has also testified before the U.S. House of Representatives regarding his research on cybersecurity regulation and its implications for federal legislation.

Available Media

The Intersection of Cyber Security, Critical Infrastructure, and Control Systems

Sandra Bittner, CISSP, Arizona Public Service, Palo Verde Nuclear Generating Station

The intersection of cyber security, critical infrastructure, and control systems is on the minds of people around the world: particularly, those designing, integrating, managing, modifying, regulating, defending, and using the systems; and most notably, those daring to exploit weaknesses in the systems to their own ends. There are requirements to secure critical infrastructure in cyberspace. Thus a race is on to apply cyber security controls prized for information systems to leading—and sometimes archaic—control systems and processes to stem the results of cyber attack. How has this worked out? What is the tipping point of success? What are those serving as pioneers and advocates in this arena doing? How are we tuning results to focus on grounded cyber security principles and practice to deliver innovation, measurable performance, and active defense in this time of confusion? What impacts are our efforts having at local, industry, sector, national, and international levels?

The intersection of cyber security, critical infrastructure, and control systems is on the minds of people around the world: particularly, those designing, integrating, managing, modifying, regulating, defending, and using the systems; and most notably, those daring to exploit weaknesses in the systems to their own ends. There are requirements to secure critical infrastructure in cyberspace. Thus a race is on to apply cyber security controls prized for information systems to leading—and sometimes archaic—control systems and processes to stem the results of cyber attack. How has this worked out? What is the tipping point of success? What are those serving as pioneers and advocates in this arena doing? How are we tuning results to focus on grounded cyber security principles and practice to deliver innovation, measurable performance, and active defense in this time of confusion? What impacts are our efforts having at local, industry, sector, national, and international levels? Hear about one approach and the results of foundation cross-sector efforts in the U.S.A.

Sandra Bittner is a Senior Engineer and Cyber Security Specialist with Arizona Public Service at the Palo Verde Nuclear Generating Station. She provides cyber security leadership in the area of cyber security program development for the Strategic Teaming and Resources Sharing Alliance (STARS) and is an active member of the Nuclear Information Technology Strategic Leadership (NITSL) community.

Papers and Reports

Wilson ABC Room

Session Chair: Paul Krizak


HotSnap: A Hot Distributed Snapshot System For Virtual Machine Cluster

Lei Cui, Bo Li, Yangyang Zhang, and Jianxin Li, Beihang University

The management of virtual machine cluster (VMC) is challenging owing to the reliability requirements, such as non-stop service, failure tolerance, etc. Distributed snapshot of VMC is one promising approach to support system reliability, it allows the system administrators of data centers to recover the system from failure, and resume the execution from a intermediate state rather than the initial state. However, due to the heavyweight nature of virtual machine (VM) technology, applications running in the VMC suffer from long downtime and performance degradation during snapshot. Besides, the discrepancy of snapshot completion times among VMs brings the TCP backoff problem, resulting in network interruption between two communicating VMs. This paper proposes HotSnap, a VMC snapshot approach designed to enable taking hot distributed snapshot with milliseconds system downtime and TCP backoff duration. At the core of HotSnap is transient snapshot that saves the minimum instantaneous state in a short time, and full snapshot which saves the entire VM state during normal operation. We then design the snapshot protocol to coordinate the individual VM snapshots into the global consistent state of VMC. We have implemented HotSnap on QEMU/ KVM, and conduct several experiments to show the effectiveness and efficiency. Compared to the live migration based distributed snapshot technique which brings seconds of system downtime and network interruption, HotSnap only incurs tens of milliseconds.

Available Media

Supporting Undoability in Systems Operations

Ingo Weber and Hiroshi Wada, NICTA and University of New South Wales; Alan Fekete, NICTA and University of Sydney; Anna Liu and Len Bass, NICTA and University of New South Wales

When managing cloud resources, many administrators operate without a safety net. For instance, inadvertently deleting a virtual disk results in the complete loss of the contained data. The facility to undo a collection of changes, reverting to a previous acceptable state, is widely recognized as valuable support for dependability. In this paper, we consider the particular needs of the system administrators managing API-controlled resources, such as cloud resources on the IaaS level. In particular, we propose an approach which is based on an abstract model of the effects of each available operation. Using this model, we check to which degree each operation is undoable. A positive outcome of this check means a formal guarantee that any sequence of calls to such operations can be undone. A negative outcome contains information on the properties preventing undoability, e.g., which operations are not undoable and why. At runtime we can then warn the user intending to use an irreversible operation; if undo is possible and desired, we apply an AI planning technique to automatically create a workflow that takes the system back to the desired earlier state. We demonstrate the feasibility and applicability of the approach with a prototypical implementation and a number of experiments.

Available Media

Back to the Future: Fault-tolerant Live Update with Time-traveling State Transfer

Cristiano Giuffrida, Călin Iorgulescu, Anton Kuijsten, and Andrew S. Tanenbaum, Vrije Universiteit, Amsterdam
Awarded Best Student Paper!  

Live update is a promising solution to bridge the need to frequently update a software system with the pressing demand for high availability in mission-critical environments. While many research solutions have been proposed over the years, systems that allow software to be updated on the fly are still far from reaching widespread adoption in the system administration community. We believe this trend is largely motivated by the lack of tools to automate and validate the live update process. A major obstacle, in particular, is represented by state transfer, which existing live update tools largely delegate to the programmer despite the great effort involved.

This paper presents time-traveling state transfer, a new automated and fault-tolerant live update technique. Our approach isolates different program versions into independent processes and uses a semantics-preserving state transfer transaction—across multiple past, future, and reversed versions—to validate the program state of the updated version. To automate the process, we complement our live update technique with a generic state transfer framework explicitly designed to minimize the overall programming effort. Our time-traveling technique can seamlessly integrate with existing live update tools and automatically recover from arbitrary run-time and memory errors in any part of the state transfer code, regardless of the particular implementation used. Our evaluation confirms that our update techniques can withstand arbitrary failures within our fault model, at the cost of only modest performance and memory overhead.

Available Media

The Guru Is In

Harding Room

Project Management: Establishing and Fostering the Basics

Adele Shakal, Metacloud, Inc.

Adele Shakal heads up project and knowledge management at Metacloud, Inc., a cloud solutions company providing Private Cloud as a Service based on OpenStack. She has nearly two decades of experience with IT project management, business process analysis and design, knowledge management, emergency operations and drill planning, business continuity, service management, system administration, and Web technologies. She has been a presenter, roundtable facilitator, and panelist on IT emergency preparedness, Google Apps for Education, project management and technical documentation, and advancing women in computing at CENIC, EDUCAUSE, APRU, USENIX LISA, and CascadiaIT conferences.

Some IT organizations have established project management cultures. Some do not. If you’re in the latter camp and are interested in the potential for project management practices to increase productivity within your organization, please stop by and visit with Adele Shakal.

Adele believes in the genuine value of project management, yet understands that the application of appropriate project management principles can be challenging. She is happy to share her best practices for applying non-invasive techniques that allow IT teams to be more efficient and effective. She’ll also share tips for establishing and fostering project management culture within rapidly changing and growing organizations.

Adele currently heads up project and knowledge management at Metacloud, Inc., a cloud solutions company providing Private Cloud as a Service based on OpenStack.

Some IT organizations have established project management cultures. Some do not. If you’re in the latter camp and are interested in the potential for project management practices to increase productivity within your organization, please stop by and visit with Adele Shakal.

Adele believes in the genuine value of project management, yet understands that the application of appropriate project management principles can be challenging. She is happy to share her best practices for applying non-invasive techniques that allow IT teams to be more efficient and effective. She’ll also share tips for establishing and fostering project management culture within rapidly changing and growing organizations.

Adele currently heads up project and knowledge management at Metacloud, Inc., a cloud solutions company providing Private Cloud as a Service based on OpenStack.

She has nearly two decades of experience with IT project management, business process analysis and design, knowledge management, emergency operations and drill planning, business continuity, service management, system administration, and web technologies.

She has been a presenter, roundtable facilitator and panelist on IT emergency preparedness, Google Apps For Education, project management and technical documentation, and advancing women in computing; at CENIC, EDUCAUSE, APRU, USENIX LISA and CascadiaIT conferences.

3:30 p.m.–4:00 p.m. Thursday

Break with Refreshments

Thurgood Marshall Ballroom Foyer

4:00 p.m.–5:30 p.m. Thursday

Invited Talks 1

Thurgood Marshall North/East Ballroom

Session Chair: Tim Nelson


A Guide to SDN: Building DevOps for Networks

Rob Sherwood, Big Switch

The networking community is making an increasing amount of noise about software-defined networking (SDN). In an attempt to clarify SDN's increasingly nebulous value proposition, this talk makes the case that software-defined networking simply applies sorely needed, well-known, and standard principles of good software design to the network. So, by asking "what would a programmer do?" to solve network problems, we can derive and make concrete all of SDN's real-world value propositions including improved automation through documented well-structured APIs, higher uptime with automated testing, benefits of refactoring the control/data plane relationship, and increased modularity of network functions. Throughout the talk I will substantiate this claim with real-world examples from my time as a network admin, my work in standards bodies like the Open Networking Foundation, and customer stories from my current day job.

The networking community is making an increasing amount of noise about software-defined networking (SDN). In an attempt to clarify SDN's increasingly nebulous value proposition, this talk makes the case that software-defined networking simply applies sorely needed, well-known, and standard principles of good software design to the network. So, by asking "what would a programmer do?" to solve network problems, we can derive and make concrete all of SDN's real-world value propositions including improved automation through documented well-structured APIs, higher uptime with automated testing, benefits of refactoring the control/data plane relationship, and increased modularity of network functions. Throughout the talk I will substantiate this claim with real-world examples from my time as a network admin, my work in standards bodies like the Open Networking Foundation, and customer stories from my current day job. Finally, I'll conclude with the point that while network and server admins have historically had disjoint skill sets, SDN presents an opportunity for the two to meet in the middle with a DevOp-style control of both domains.

Rob leads standardization and controller software architecture at Big Switch, where he developed and evangelized the emerging OpenFlow standard and network virtualization. He is the current Chair of the ONF’s Architecture & Framework Working Group and all Northbound API activity and was vice-chair for the ONF Testing & Interoperability Working Group. Rob prototyped the first OpenFlow-based network hypervisor, the “FlowVisor,” allowing production and experimental traffic to safely co-exist on the same physical network, and is involved in various standards efforts and partner and customer engagements. Rob holds a Ph.D. in Computer Science from the University of Maryland, College Park.

Available Media

OSv: A New Open Source Operating System Designed for the Cloud

Nadav Har'El, Cloudius Systems Ltd.

Scale out is the key requirement behind most modern workloads. As a result cloud deployments are executing homogeneous clusters of virtual machines, running single applications within them, such as NoSQL, MemCache, front/backend servers, etc. The operating system that drives these applications within the guest virtual machine doesn't do any of its traditional roles—there isn't any hardware, users, or multiple apps. OSv is designed from the ground up to execute a single application on top of a hypervisor, resulting in superior performance and effortless management. Extreme optimizations are performed to minimize the hypervisor overhead, autotune for size and integrate with the Java Virtual Machine implementation. The OS itself is stateless and offers a simple API to manage any aspect remotely, making it perfect for virtual machine images.

Scale out is the key requirement behind most modern workloads. As a result cloud deployments are executing homogeneous clusters of virtual machines, running single applications within them, such as NoSQL, MemCache, front/backend servers, etc. The operating system that drives these applications within the guest virtual machine doesn't do any of its traditional roles—there isn't any hardware, users, or multiple apps. OSv is designed from the ground up to execute a single application on top of a hypervisor, resulting in superior performance and effortless management. Extreme optimizations are performed to minimize the hypervisor overhead, autotune for size and integrate with the Java Virtual Machine implementation. The OS itself is stateless and offers a simple API to manage any aspect remotely, making it perfect for virtual machine images.

Available Media

Invited Talks 2

Thurgood Marshall South/West Ballroom

Session Chair: John Looney


Enterprise Architecture Beyond the Perimeter

Jan Monsch and Harald Wagener, Google Security Team, Google

An increasingly mobile workforce and the ubiquity of attacks on client platforms limit the effectiveness of the traditional corporate network perimeter-security model. Beyond Corp is a broad effort to re-architect the delivery of Google corporate computing services, removing privileges granted solely on the basis of network address. The Overcast architecture blueprint is key to this, presenting a model of machine identity, authentication, and inventory-aware authorization. We discuss the background of our work, our general approach, challenges encountered, and future directions.

An increasingly mobile workforce and the ubiquity of attacks on client platforms limit the effectiveness of the traditional corporate network perimeter-security model. Beyond Corp is a broad effort to re-architect the delivery of Google corporate computing services, removing privileges granted solely on the basis of network address. The Overcast architecture blueprint is key to this, presenting a model of machine identity, authentication, and inventory-aware authorization. We discuss the background of our work, our general approach, challenges encountered, and future directions.

Jan is a tech lead in the security operations team and has been designing and driving enterprise security initiatives within Google. His focus at the moment is machine identity and inventory. Prior to joining Google in 2010, he was senior security analyst at Compass Security AG, a leading Swiss security assessment company.

He has a bachelor’s degree in electrical engineering from the Zurich University of Applied Sciences and a master’s degree with honors in security and forensic computing from the Dublin City University.

Available Media

Drifting into Fragility

Matt Provost, Weta Digital

This talk will look at complex systems failure analysis and how to apply it to system administration through the books Drift into Failure by Sidney Dekker and Antifragile by Nassim Taleb. There will be a focus on using postmortems to gain a better understanding of how complex systems fail and then coming up with strategies based on real-world examples to change the way that we run systems so that they are more understandable, less prone to failure, and easier to repair when outages do occur.

Matt Provost is the Systems Manager at Weta Digital. Weta Digital is a five-time Academy Award–winning visual effects facility in Wellington, New Zealand. The Systems team at Weta is responsible for all of the company's servers, storage, and networking. They run a 49,000 core renderwall. Matt has been a system and network administrator for over 15 years. He has a B.A. from Indiana University, Bloomington.

This talk will look at complex systems failure analysis and how to apply it to system administration through the books Drift into Failure by Sidney Dekker and Antifragile by Nassim Taleb. There will be a focus on using postmortems to gain a better understanding of how complex systems fail and then coming up with strategies based on real-world examples to change the way that we run systems so that they are more understandable, less prone to failure, and easier to repair when outages do occur.

Matt Provost is the Systems Manager at Weta Digital. Weta Digital is a five-time Academy Award–winning visual effects facility in Wellington, New Zealand. The Systems team at Weta is responsible for all of the company's servers, storage, and networking. They run a 49,000 core renderwall. Matt has been a system and network administrator for over 15 years. He has a B.A. from Indiana University, Bloomington.

Available Media

Papers and Reports

Wilson ABC Room

Session Chair: Casey Henderson, USENIX Association


Live Upgrading Thousands of Servers from an Ancient Red Hat Distribution to 10 Year Newer Debian Based One

Marc Merlin, Google, Inc.

Google maintains many servers and employs a file level sync method with applications running in a different partition than the base Linux distribution that boots the machine and interacts with hardware. This experience report first gives insights on how the distribution is setup, and then tackles the problem of doing a difficult upgrade from a Red Hat 7.1 image snapshot with layers of patches to a Debian Testing based distribution built from source. We will look at how this can actually be achieved as a live upgrade and without ending up with a long "flag day" where many machines are running totally different distributions, which would have made testing and debugging of applications disastrous during a long switchover period.

Like a coworker of mine put it, "It was basically akin to upgrading Red Hat 7.1 to Fedora Core 16, a totally unsupported and guaranteed to break upgrade, but also switching from rpm to dpkg in the process, and on live machines."

The end of the paper summarizes how we designed our packaging system for the new distribution, as well as how we build each new full distribution image from scratch in a few minutes.

Available Media

Managing Smartphone Testbeds with SmartLab

Georgios Larkou, Constantinos Costa, Panayiotis G. Andreou, Andreas Konstantinidis, and Demetrios Zeinalipour-Yazti, University of Cyprus

The explosive number of smartphones with ever growing sensing and computing capabilities have brought a paradigm shift to many traditional domains of the computing field. Re-programming smartphones and instrumenting them for application testing and data gathering at scale is currently a tedious and time-consuming process that poses significant logistical challenges. In this paper, we make three major contributions: First, we propose a comprehensive architecture, coined SmartLab1, for managing a cluster of both real and virtual smartphones that are either wired to a private cloud or connected over a wireless link. Second, we propose and describe a number of Android management optimizations (e.g., command pipelining, screen-capturing, file management), which can be useful to the community for building similar functionality into their systems. Third, we conduct extensive experiments and microbenchmarks to support our design choices providing qualitative evidence on the expected performance of each module comprising our architecture. This paper also overviews experiences of using SmartLab in a research-oriented setting and also ongoing and future development efforts.

Available Media

YinzCam: Experiences with In-Venue Mobile Video and Replays

Nathan D. Mickulicz, Priya Narasimhan, and Rajeev Gandhi, YinzCam, Inc., and Carnegie Mellon University

YinzCam allows sport fans inside NFL/NHL/NBA venues to enjoy replays and live-camera angles from different perspectives, on their smartphones. We describe the evolution of the system infrastructure, starting from the initial installation in 2010 at one venue, to its use across a dozen venues today. We address the challenges of scaling the system through a combination of techniques, including distributed monitoring, remote administration, and automated replay-generation. In particular, we take an in-depth look at our unique automated replay-generation, including the dashboard, the remote management, the remote administration, and the resulting efficiency, using data from a 2013 NBA Playoffs game.

Available Media

The Guru Is In

Harding Room

PostgreSQL

Stephen Frost, Resonate

Stephen Frost is a PostgreSQL Major Contributor who implemented the PostgreSQL role system and column-level privileges, and who has made contributions to PL/PgSQL, PostGIS, the Linux kernel, and Debian. He has broad experience with the PostgreSQL authentication and authorization system, Multi-Version Concurrent-Control (MVCC), performance tuning (both system-wide and for specific queries), hacking on PostgreSQL itself, the PostgreSQL community, and PostGIS.

Stephen Frost is a PostgreSQL Major Contributor who implemented the PostgreSQL role system and column-level privileges, and who has made contributions to PL/PgSQL, PostGIS, the Linux kernel, and Debian. He has broad experience with the PostgreSQL authentication and authorization system, Multi-Version Concurrent-Control (MVCC), performance tuning (both system-wide and for specific queries), hacking on PostgreSQL itself, the PostgreSQL community, and PostGIS.

6:30 p.m.–11:00 p.m. Thursday

Evening Activities

Take a look at what's happening this evening at LISA '13.

 

Friday, November 8, 2013

8:30 a.m.–9:00 a.m. Friday

Continental Breakfast

Thurgood Marshall Ballroom Foyer

9:00 a.m.–10:30 a.m. Friday

Invited Talks 1

Thurgood Marshall North/East Ballroom

Session Chair: Adam Oliner


Rethinking Dogma: Musings on the Future of Security

Dan Kaminsky, Chief Scientist, White Ops

Security has become a first-class engineering requirement. But it is not the only such requirement. In this talk, I'm going to consider various sacred cows in security and ask whether we'll still believe in them in a few years. Does the user model make sense now in a world of app servers? Are biometrics better or worse than passwords? Will DJB become the new NIST? Let's talk about the future of actually delivering security to our users.

Dan Kaminsky has been a noted security researcher for over a decade, and has spent his career advising Fortune 500 companies such as Cisco, Avaya, and Microsoft. Dan spent three years working with Microsoft on their Vista, Server 2008, and Windows 7 releases.

Security has become a first-class engineering requirement. But it is not the only such requirement. In this talk, I'm going to consider various sacred cows in security and ask whether we'll still believe in them in a few years. Does the user model make sense now in a world of app servers? Are biometrics better or worse than passwords? Will DJB become the new NIST? Let's talk about the future of actually delivering security to our users.

Dan Kaminsky has been a noted security researcher for over a decade, and has spent his career advising Fortune 500 companies such as Cisco, Avaya, and Microsoft. Dan spent three years working with Microsoft on their Vista, Server 2008, and Windows 7 releases.

Dan is best known for his work finding a critical flaw in the Internet’s Domain Name System (DNS), and for leading what became the largest synchronized fix to the Internet’s infrastructure of all time. Of the seven Recovery Key Shareholders who possess the ability to restore the DNS root keys, Dan is the American representative. Dan is presently developing systems to reduce the cost and complexity of securing critical infrastructure.

Available Media

Invited Talks 2

Thurgood Marshall South/West Ballroom

Session Chair: Chris McEniry


ZFS for Everyone

George Wilson, Delphix

ZFS was originally released into the open source community as part of OpenSolaris in November 2005. After Oracle acquired Sun Microsystems, the focus on maintaining an open source community quickly diminished and in August 2010, the public releases of ZFS source code silently stopped. At the same time, a new community was emerging and Open ZFS was born. The Open ZFS community is now flourishing with new features being developed across a variety of platforms. This talk will go into the technical details of some the features in Open ZFS and how administrators can utilize each of them.

ZFS was originally released into the open source community as part of OpenSolaris in November 2005. After Oracle acquired Sun Microsystems, the focus on maintaining an open source community quickly diminished and in August 2010, the public releases of ZFS source code silently stopped. At the same time, a new community was emerging and Open ZFS was born. The Open ZFS community is now flourishing with new features being developed across a variety of platforms. This talk will go into the technical details of some the features in Open ZFS and how administrators can utilize each of them.

George Wilson is a software engineer at Delphix, working on filesystems for Delphix's database storage appliance. At Delphix, George has worked on ZFS features such as single-copy arc, nop writes, and performance enhancements for pools with imbalanced LUNS. Before joining Delphix, George was a senior member of the ZFS kernel development team at Sun Microsytems, working on key features such as LUN expansion, Log Device removal, and Deduplication. He was also the tech lead for the Solaris 10 ZFS integration and developed an in-depth ZFS training course for Sun's field organization.

Available Media

Manta Storage System Internals

Mark Cavage, Joyent

Manta is a system we have developed that applies the UNIX philosophy—small, well-defined tools and simple ways to combine them—to distributed computing on a multi-tenant object store. This new model of general-purpose compute not only unifies such disparate big data applications as indexing, log analysis, image processing, and video transcoding, but also carries with it a new set of design and implementation challenges.

In this talk, I'll discuss the architectural choices and implementation details, along with lessons learned as we created, deployed, and scaled the system. I'll pay particular attention to the aspects of running and automating the system where we've used Manta to operate and debug Manta, and how you can also use Manta to better operate your own services.

Manta is a system we have developed that applies the UNIX philosophy—small, well-defined tools and simple ways to combine them—to distributed computing on a multi-tenant object store. This new model of general-purpose compute not only unifies such disparate big data applications as indexing, log analysis, image processing, and video transcoding, but also carries with it a new set of design and implementation challenges.

In this talk, I'll discuss the architectural choices and implementation details, along with lessons learned as we created, deployed, and scaled the system. I'll pay particular attention to the aspects of running and automating the system where we've used Manta to operate and debug Manta, and how you can also use Manta to better operate your own services.

Mark Cavage is a software engineer at Joyent, where he works primarily on distributed systems supporting Joyent's cloud computing suite, and maintains several popular open source projects, such as restify and ldapjs. Prior to joining Joyent, Mark was a senior software engineer with Amazon Web Services, where he was primarily responsible for envisioning and launching the AWS Identity and Access Management product.

Available Media

Papers and Reports

Wilson ABC Room

Session Chair: John Looney


Challenges to Error Diagnosis in Hadoop Ecosystems

Jim (Zhanwen) Li, NICTA; Siyuan He, Citibank; Liming Zhu, NICTA and University of New South Wales; Xiwei Xu, NICTA; Min Fu, University of New South Wales; Len Bass and Anna Liu, NICTA and University of New South Wales; An Binh Tran, University of New South Wales

Deploying a large-scale distributed ecosystem such as HBase/Hadoop in the cloud is complicated and error-prone. Multiple layers of largely independently evolving software are deployed across distributed nodes on third party infrastructures. In addition to software incompatibility and typical misconfiguration within each layer, many subtle and hard to diagnose errors happen due to misconfigurations across layers and nodes. These errors are difficult to diagnose because of scattered log management and lack of ecosystem-awareness in many diagnosis tools and processes.

We report on some failure experiences in a real world deployment of HBase/Hadoop and propose some initial ideas for better trouble-shooting during deployment. We identify the following types of subtle errors and the corresponding challenges in trouble-shooting: 1) dealing with inconsistency among distributed logs, 2) distinguishing useful information from noisy logging, and 3) probabilistic determination of root causes.

Available Media

Installation of an External Lustre Filesystem using Cray esMS management and Lustre 1.8.6

Patrick Webb, Cray Inc.

High performance computing systems need a similarly large scale storage system in order to manage the massive quantities of data that are produced. The unique aspects of each customer's site means that the on-site configuration and creation of the filesystem will be unique. In this paper we will look at the installation of multiple separate Lustre 1.8.6 filesystems attached to the Los Alamos National Laboratory ACES systems and their management back-end. We will examine the structure of the filesystem and the choices made during the installation and configuration as well the obstacles that we encountered along the way and the methods used to overcome them.

Available Media

The Guru Is In

Harding Room

Time Management for System Administrators

Thomas A. Limoncelli, Stack Exchange

Thomas A. Limoncelli is an internationally recognized author, speaker, and system administrator. His best-known books include Time Management for System Administrators (O'Reilly) and The Practice of System and Network Administration (Addison-Wesley), for which he shared the SAGE 2005 Outstanding Achievement Award. He works in New York City at StackExchange.com, owner of ServerFault.com. He blogs at EverythingSysadmin.com

Thomas A. Limoncelli is an internationally recognized author, speaker, and system administrator. His best-known books include Time Management for System Administrators (O'Reilly) and The Practice of System and Network Administration (Addison-Wesley), for which he shared the SAGE 2005 Outstanding Achievement Award. He works in New York City at StackExchange.com, owner of ServerFault.com. He blogs at EverythingSysadmin.com

10:30 a.m.–11:00 a.m. Friday

Break with Refreshments

Thurgood Marshall Ballroom Foyer

11:00 a.m.–12:30 p.m. Friday

Panel

Thurgood Marshall North/East Ballroom

Session Chair: Narayan Desai


Futures

Rob Sherwood, Big Switch Networks; Mark Cavage, Joyent; Nadav Har'El, Cloudius Systems Ltd.

Available Media

Invited Talks 2

Thurgood Marshall South/West Ballroom

Session Chair: Adele Shakal


Apache Hadoop for System Administrators

Allen Wittenauer, LinkedIn, Inc.

Knowledge is power! As a result, the adoption of Apache Hadoop to help mine data as a way to increase knowledge is taking the world by storm. For system administrators, however, it is a large, complicated system that isn't well understood. In this talk, Allen will cover some Hadoop basics from an operations perspective: what it is, how it works, key data points to monitor, metrics that are important to gather, and the secrets to making it work securely and reliably.

Allen Wittenauer has been involved with Apache Hadoop since May 2007, when he was hired by Yahoo! to bring large-scale operational experience to the fledgling project. His work there helped create the basic blueprints that almost all Hadoop deployments follow today. At LinkedIn, his experience provided key insight and a foundation to its award-winning data science team.

Knowledge is power! As a result, the adoption of Apache Hadoop to help mine data as a way to increase knowledge is taking the world by storm. For system administrators, however, it is a large, complicated system that isn't well understood. In this talk, Allen will cover some Hadoop basics from an operations perspective: what it is, how it works, key data points to monitor, metrics that are important to gather, and the secrets to making it work securely and reliably.

Allen Wittenauer has been involved with Apache Hadoop since May 2007, when he was hired by Yahoo! to bring large-scale operational experience to the fledgling project. His work there helped create the basic blueprints that almost all Hadoop deployments follow today. At LinkedIn, his experience provided key insight and a foundation to its award-winning data science team.

Available Media

Optimizing VM Images for OpenStack with KVM/QEMU Fall 2013

Chet Burgess, Senior Director, Engineering, and Brian Wellman, Director, Operations, Metacloud, Inc.

OpenStack and KVM/QEMU support a cornucopia of image and disk formats. With so many options it can be difficult to understand all the various trade-offs of these formats.

We will explore some of the more common image formats and disk formats and their trade-offs, tips and tricks for converting image formats, and working with these images directly. We will dive into how nova, libvirt, and KVM interact with these formats when performing operations such as snapshotting and image resizing. We will look at best practices for configuring the guest operating systems such as login credentials, network configuration, and device/performance optimization.

As the Senior Director of Engineering and part of Metacloud’s founding team, Chet Burgess is responsible for system design and ensuring that Metacloud solutions are always available, redundant, and scalable.

OpenStack and KVM/QEMU support a cornucopia of image and disk formats. With so many options it can be difficult to understand all the various trade-offs of these formats.

We will explore some of the more common image formats and disk formats and their trade-offs, tips and tricks for converting image formats, and working with these images directly. We will dive into how nova, libvirt, and KVM interact with these formats when performing operations such as snapshotting and image resizing. We will look at best practices for configuring the guest operating systems such as login credentials, network configuration, and device/performance optimization.

As the Senior Director of Engineering and part of Metacloud’s founding team, Chet Burgess is responsible for system design and ensuring that Metacloud solutions are always available, redundant, and scalable.

Chet’s career in technology began at age 16 when he worked at a VAR building small business networks. Since then he has held a number of positions including Director of Enterprise Unix Systems at the University of Southern California, and Senior Systems Architect at Ticketmaster Entertainment.

Chet is a contributor to the OpenStack Nova project. His passion is working on availability and scalability problems in large-scale systems deployments.

Brian Wellman is the Director of Systems Operations at Metacloud. He is responsible for the implementation, operation, support, and monitoring of Metacloud’s client deployments as well as Metacloud’s internal infrastructure. He has 16 years of experience in designing, implementing, and operating highly available, large scale server deployments and private clouds.

Prior to joining the Metacloud team in 2012, Brian held the role of Director of Web Systems at Ticketmaster Entertainment. His team managed an infrastructure that consisted of thousands of nodes spread across several datacenters worldwide and was responsible for the implementation and operation of Ticketmaster's first private cloud.

Available Media

Invited Talks 3

Wilson ABC Room

Session Chair: Carolyn Rowland


Managing Macs at Google Scale

Clay Caviness and Edward Eigerman, Google Inc.

Google has one of the largest managed fleets of Macintosh computers in the world. With tens of thousands of assets to manage and an ever-changing security landscape, the organization has had to develop many of its own tools to effectively maintain its fleet and keep its end-users safe and productive. Macintosh Operations is the internal team tasked with developing these tools and managing these machines globally.

Clay Caviness has been a Mac and UNIX systems engineer since the early 90s and was thrilled when NeXT bought Apple. He only recently stopped running BSD on a Quadra at home. Clay has worked in advertising and technology companies since 1996 and currently works for Google in New York City.

Google has one of the largest managed fleets of Macintosh computers in the world. With tens of thousands of assets to manage and an ever-changing security landscape, the organization has had to develop many of its own tools to effectively maintain its fleet and keep its end-users safe and productive. Macintosh Operations is the internal team tasked with developing these tools and managing these machines globally.

Clay Caviness has been a Mac and UNIX systems engineer since the early 90s and was thrilled when NeXT bought Apple. He only recently stopped running BSD on a Quadra at home. Clay has worked in advertising and technology companies since 1996 and currently works for Google in New York City.

Edward Eigerman has worked in IT, primarily with Macs, since 1988. He worked as an engineer for Apple for several years and later as an independent consultant, where his clients included The New York Times, Major League Baseball, NASA, and Jon Stewart. He currently works for Google in New York City.

Available Media

OS X Hardening: Securing a Large Global Mac Fleet

Greg Castle, Security Engineer, Google Inc.

OS X security is evolving: defenses are improving with each OS release but the days of “Macs don’t get malware” are gone. Recent attacks against the Java Web plugin have kindled a lot of interest in hardening and managing Macs. So how does Google go about defending a large global Mac fleet? Greg will discuss various hardening tweaks and a range of OS X defensive technologies including XProtect, Gatekeeper, Filevault 2, sandboxing, auditd, and mitigations for Java and Flash vulns.

A former pentester, incident responder, and forensic analyst, Greg Castle has been responsible for the security of Google’s OS X fleet for a couple of years, working closely with the Google MacOps team to harden and protect Google’s global Mac fleet. He is now working in Google’s incident response team on the GRR Rapid Response project: Google’s open source incident response framework.

OS X security is evolving: defenses are improving with each OS release but the days of “Macs don’t get malware” are gone. Recent attacks against the Java Web plugin have kindled a lot of interest in hardening and managing Macs. So how does Google go about defending a large global Mac fleet? Greg will discuss various hardening tweaks and a range of OS X defensive technologies including XProtect, Gatekeeper, Filevault 2, sandboxing, auditd, and mitigations for Java and Flash vulns.

A former pentester, incident responder, and forensic analyst, Greg Castle has been responsible for the security of Google’s OS X fleet for a couple of years, working closely with the Google MacOps team to harden and protect Google’s global Mac fleet. He is now working in Google’s incident response team on the GRR Rapid Response project: Google’s open source incident response framework.

Available Media

The Guru Is In

Harding Room

*aaS: Building and Maintaining the Cloud

David Nalley, Apache Cloudstack

David Nalley is a recovering systems administrator of 10 years. He is a member of the Apache Software Foundation, and a Project Management Committee Member for Apache CloudStack. David is a frequent author for development, sysadmin, and Linux magazines and speaks at numerous IT conferences.

David Nalley is a recovering systems administrator of 10 years. He is a member of the Apache Software Foundation, and a Project Management Committee Member for Apache CloudStack. David is a frequent author for development, sysadmin, and Linux magazines and speaks at numerous IT conferences.

12:30 p.m.–2:00 p.m. Friday

Lunch, on your own

2:00 p.m.–3:30 p.m. Friday

Invited Talks 1

Thurgood Marshall North/East Ballroom

Session Chair: John Looney


Cloud/IaaS Platforms: I/O Virtualization and Scheduling

Dave Cohen, Office of the CTO, EMC

Cloud-based, Infrastructure-as-a-Service (IaaS) platforms present a simple data center resource model to their end-users. The end-user provisions logical resource units of compute, network, and block storage. They then attach a compute unit to one or more network and block storage units. This "attach" and subsequent "detach" of emulated network and storage devices embodies what is referred to as I/O virtualization, the act of scheduling I/O resources independent of compute. For this session we will outline this I/O virtualization concept and then dig deeper into how cloud/IaaS platforms leverage it for both local and global/data-center-wide scheduling.

Cloud-based, Infrastructure-as-a-Service (IaaS) platforms present a simple data center resource model to their end-users. The end-user provisions logical resource units of compute, network, and block storage. They then attach a compute unit to one or more network and block storage units. This "attach" and subsequent "detach" of emulated network and storage devices embodies what is referred to as I/O virtualization, the act of scheduling I/O resources independent of compute. For this session we will outline this I/O virtualization concept and then dig deeper into how cloud/IaaS platforms leverage it for both local and global/data-center-wide scheduling.

Dave is a strategist and researcher in EMC's Office of the CTO where he focuses on the costs, scalability, and performance of data center infrastructure, especially in a Cloud setting. During his tenure at EMC, he has provided technical leadership in the areas of Network Virtualization, OpenStack, and converged infrastructure platforms such as OpenCompute. Prior to joining EMC, Dave spent ten years on Wall Street working on various facets of data center infrastructure for companies such as Goldman Sachs and Merrill Lynch.

Available Media

Cluster Management at Google

John Wilkes, Google

Cluster management is the term that Google uses to describe how we control the computing infrastructure in our datacenters that supports almost all of our external services. It includes allocating resources to different applications on our fleet of computers, looking after software installations and hardware, monitoring, and many other things. I'll present an overview of some of these systems and introduce Omega, the new cluster-manager tool we are building. Much of the talk will be about exciting challenges that we're facing along the way, driven by the scale at which we operate, an acute awareness of failures, and the drive to provide ever-better service-levels while curbing complexity. We certainly don't have all the answers, but we do have some pretty impressive systems.

Cluster management is the term that Google uses to describe how we control the computing infrastructure in our datacenters that supports almost all of our external services. It includes allocating resources to different applications on our fleet of computers, looking after software installations and hardware, monitoring, and many other things. I'll present an overview of some of these systems and introduce Omega, the new cluster-manager tool we are building. Much of the talk will be about exciting challenges that we're facing along the way, driven by the scale at which we operate, an acute awareness of failures, and the drive to provide ever-better service-levels while curbing complexity. We certainly don't have all the answers, but we do have some pretty impressive systems.

John Wilkes has been at Google since 2008, where he is working on cluster management and infrastructure services. He is interested in far too many aspects of distributed systems, but a recurring theme has been technologies that allow systems to manage themselves. In his spare time he continues, stubbornly, trying to learn how to blow glass. http://e-wilkes.com/john

Available Media

Invited Talks 2

Thurgood Marshall South/West Ballroom

Session Chair: Chris McEniry


Scaling User Security: Lessons Learned from Shipping Security Features at Etsy

Zane Lackey, Director of Security Engineering, and Kyle Barry, Security Engineering Manager, Etsy

Over the past year, the Etsy Security Engineering Team has been primarily focused on building out new user-facing features to provide proactive protections to our members. On the surface, these features appeared straightforward to implement and roll out; however, we encountered a number of interesting challenges along the way. This talk will provide actionable advice for organizations seeking to ship and support modern security features including full site SSL, two-factor authentication, and account takeover detection. Specifically, we will cover engineering your environment for capacity and resiliency, collecting useful metrics, creating effective anomaly alerts, supporting a global user base, and abstracting away single points of failure with third party providers.

Over the past year, the Etsy Security Engineering Team has been primarily focused on building out new user-facing features to provide proactive protections to our members. On the surface, these features appeared straightforward to implement and roll out; however, we encountered a number of interesting challenges along the way. This talk will provide actionable advice for organizations seeking to ship and support modern security features including full site SSL, two-factor authentication, and account takeover detection. Specifically, we will cover engineering your environment for capacity and resiliency, collecting useful metrics, creating effective anomaly alerts, supporting a global user base, and abstracting away single points of failure with third party providers.

Zane Lackey is the Director of Security Engineering at Etsy and a member of the Advisory Council to the US State Department-backed Open Technology Fund. Prior to Etsy, Zane was a senior security consultant at iSEC Partners.

Kyle Barry is the Security Engineering Manager at Etsy. His work focuses on security and risk engineering for Etsy's internal and user-facing features. Kyle has worked on implementing Etsy's two-factor authentication system for millions of users in over 80 countries. Recently he has been working on solving security issues with big data.

Available Media

Building Large Scale Services

Jennifer Davis, Yahoo! Senior Grid Service Engineer, SE Tech Lead

Yahoo! Service Engineers (SE) specialize in bridging the gap between system administration and development. SEs are tasked with delivering a reliable, consistent quality service through the use of best practices. They must understand network, OS, hardware, and customer use cases; and dive deep into the application internals.

In this talk, Jennifer will describe her journey with the Sherpa service at Yahoo! and lessons learned about building a reliable, consistent, and high-quality service from scratch.

The key takeaway from this talk will be to educate practitioners on successful strategies and pitfalls when building out a service.

Yahoo! Service Engineers (SE) specialize in bridging the gap between system administration and development. SEs are tasked with delivering a reliable, consistent quality service through the use of best practices. They must understand network, OS, hardware, and customer use cases; and dive deep into the application internals.

In this talk, Jennifer will describe her journey with the Sherpa service at Yahoo! and lessons learned about building a reliable, consistent, and high-quality service from scratch.

The key takeaway from this talk will be to educate practitioners on successful strategies and pitfalls when building out a service.

Jennifer has worked in education, startup, and large-scale environments, which has contributed to a diverse set of experiences. Currently she is a lead senior service engineer at Yahoo! with a focus on customer service experience. She completed her B.S. in Computer Science at Notre Dame de Namur University. She has previously presented on Visualizing Self—Exploring Your Personal Metrics at Velocity and Dungeons and Data at Strata.

Available Media

Invited Talks 3

Wilson ABC Room

Session Chair: Patrick Cable


Managing Access Using SSH Keys

Tatu Ylönen, SSH Communications Security, and Inventor of SSH

SSH user keys are ubiquitously used for accessing information systems by automated processes and system administrators. Many large organizations have hundreds of thousands of keys granting access, with many keys providing privileged access without auditing or controls. The talk educates the audience about risks arising from unmanaged access using SSH keys; discusses what is required by compliance mandates; outlines how to establish effective operational processes for provisioning, terminating, and monitoring SSH user key based access; and outlines how to understand and remediate SSH user keys in an existing environment.

SSH user keys are ubiquitously used for accessing information systems by automated processes and system administrators. Many large organizations have hundreds of thousands of keys granting access, with many keys providing privileged access without auditing or controls. The talk educates the audience about risks arising from unmanaged access using SSH keys; discusses what is required by compliance mandates; outlines how to establish effective operational processes for provisioning, terminating, and monitoring SSH user key based access; and outlines how to understand and remediate SSH user keys in an existing environment.

Mr. Ylönen invented the Secure Shell (SSH) protocol in 1995 and is founder and CEO of SSH Communications Security. OpenSSH is based on his free version of 1995. He has 29 years of programming and systems management experience and plenty of business management background. He co-authored the IETF guidelines on SSH key management for automated access and is a co-author in upcoming NIST IR series guidelines for managing access using SSH keys. Mr. Ylönen and his company have been deeply involved in several actual SSH key remediation and management projects with some of the leading financial institutions and other enterprises.

Available Media

Secure Linux Containers

Daniel J Walsh, Red Hat

Linux container technology allows a customer to carve a system out into isolated containers and run applications securely within the confines of the containers. It facilitates multi-tenancy, which allows IT organizations to take better advantage of the large servers available in their datacenters. While multitenancy provides flexibility for server resource management, especially for service providers, it introduces additional complexity, especially related to the security of applications and data that reside on the same server. Daniel will discuss resource management, namespacing, and the use of SELinux to tighten the security of Linux containers.

Dan Walsh, aka "Mr. SELinux," has been leading the SELinux effort at Red Hat for over 10 Years. Dan works on SELinux Userspace and Policy for Fedora and RHEL. He has also developed Secure Virtualization and helps to provide the security on OpenShift.

Linux container technology allows a customer to carve a system out into isolated containers and run applications securely within the confines of the containers. It facilitates multi-tenancy, which allows IT organizations to take better advantage of the large servers available in their datacenters. While multitenancy provides flexibility for server resource management, especially for service providers, it introduces additional complexity, especially related to the security of applications and data that reside on the same server. Daniel will discuss resource management, namespacing, and the use of SELinux to tighten the security of Linux containers.

Dan Walsh, aka "Mr. SELinux," has been leading the SELinux effort at Red Hat for over 10 Years. Dan works on SELinux Userspace and Policy for Fedora and RHEL. He has also developed Secure Virtualization and helps to provide the security on OpenShift.

Available Media

The Guru Is In

Harding Room

ZFS in Depth

George Wilson, Delphix

George Wilson is a senior software engineer at Delphix developing features and enhancements to ZFS. During his time at Delphix, George has worked on features such as single-copy arc, nop_write, and enhancements to the allocation logic in ZFS. Before joining Delphix, George was a senior member of the ZFS kernel development team at Sun Microsystems working on key features such as LUN expansion, Log Device removal, and Deduplication. He was also the tech lead for the Solaris 10 ZFS integration and developed an in-depth ZFS training course for Sun's field organization.

George Wilson is a senior software engineer at Delphix developing features and enhancements to ZFS. During his time at Delphix, George has worked on features such as single-copy arc, nop_write, and enhancements to the allocation logic in ZFS. Before joining Delphix, George was a senior member of the ZFS kernel development team at Sun Microsystems working on key features such as LUN expansion, Log Device removal, and Deduplication. He was also the tech lead for the Solaris 10 ZFS integration and developed an in-depth ZFS training course for Sun's field organization.

3:30 p.m.–4:00 p.m. Friday

Break with Refreshments

Thurgood Marshall Ballroom Foyer

4:00 p.m.–5:30 p.m. Friday

Closing Plenary

Thurgood Marshall Ballroom

PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability

Todd Underwood, Google

Widespread availability of distributed computing creates the fertile foundation for DevOps, a more collaborative approach to deploying and managing systems. Now it is time to take advantage of the inherent malleability of software and move beyond operations entirely. PostOps is a realistic call to end toil, to stop feeding the machine with human blood (and time and effort).

Widespread availability of distributed computing creates the fertile foundation for DevOps, a more collaborative approach to deploying and managing systems. Now it is time to take advantage of the inherent malleability of software and move beyond operations entirely. PostOps is a realistic call to end toil, to stop feeding the machine with human blood (and time and effort).

Todd Underwood is a site reliability manager at Google in Pittsburgh, leading several teams of engineers on the money side of the house (ads quality, payments, billing, and shopping). Prior to that, he was in charge of operations, security, and peering for Renesys, a provider of Internet intelligence services; and before that he was CTO of Oso Grande, a New Mexico ISP. He has a background in systems engineering and networking. Todd has presented work related to Internet routing dynamics and relationships at NANOG, RIPE and various peering forums (Global Peering Forum, LINX, and Switch and Data). He was Chair of the NANOG Program Committee and the RIPE Programme Committee. He is interested in how to make all of this work much, much better.

Available Media