Introducing Reliability Toolkit: Easy-to-Use Monitoring and Alerting

Monday, October 29, 2018 - 11:00 am11:45 am

Janna Brummel and Robin van Zijll, ING

Abstract: 

By definition, SREs are responsible for the reliability of sites, but what if they don’t own any sites themselves? Within ING, the largest bank of the Netherlands, BizDevOps teams are autonomous and responsible for the build and run of their services. In theory, that could make the existence of SRE obsolete, right? How can you improve availability for end customers in an environment of engineers with full service ownership? How to convince without the power of intervention? How to improve without being blameful?

We’ll explain how we, a team of 8 SREs among 1700 DevOps engineers, try to improve stability by focussing on software engineering. We created the Reliability Toolkit to help BizDevOps teams with their reliability challenges in the fields of white box monitoring and alerting while minimizing toil.

This talk will explain:

  • Our SRE team purpose and why we think our approach with heavy focus on software engineering works for our organization
  • The concept of the Reliability Toolkit and introduction of its components and their set up (Prometheus, Alertmanager, Grafana, NGINX Log Aggregator, SMS and ChatOps functionalities)
  • How we provision Reliability Toolkit
  • How we convince, onboard and educate BizDevOps teams to use the Reliability Toolkit

To conclude we will end our talk with a demo of our Reliability Toolkit

Janna Brummel, ING

Janna is IT chapter lead for the site reliability engineering squad within the Domestic Bank (Retail) for ING in the Netherlands. Her job is to help other teams within the bank to know more about their services' performance and to be able to respond more efficiently to incidents. Before this, Janna worked as business manager and dev engineer of credit cards and debit cards back end systems.

Robin van Zijll, ING

Robin is a Site Reliability Engineer @ ING and PO of the SRE Team, and has years of experience in being on-call for all services offered to our retail customers.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {221698,
author = {Janna Brummel and Robin van Zijll},
title = {Introducing Reliability Toolkit: {Easy-to-Use} Monitoring and Alerting},
year = {2018},
address = {Nashville, TN},
publisher = {USENIX Association},
month = oct
}

Presentation Video 

Presentation Audio