Brian Sherwin, LinkedIn Corporation
LinkedIn’s production stack consists of over thousands of different applications and associated with complex dependencies. In this environment, when a production issue is caused due to a misbehaving microservice(s), finding the right culprit can be both challenging and time consuming.
At LinkedIn, we have built a framework to automate the incident correlation process by ingesting data pertaining to incidents and associated dependencies to identify the the unhealthy microservice(s). This gives us the ability to directly escalate an incident to the corresponding team thus cutting down MTTD/MTTR while improving quality of life of the oncall engineers.
In this talk, we will give a higher level overview of the correlation engine, how we are doing correlations, how we reduce false positives and increase the accuracy of the correlated results and finally lessons learned.
Brian Sherwin, LinkedIn
Brian Cory Sherwin is a Sr. SRE at LinkedIn since 2012. Brian has had many responsibilities at LinkedIn ranging from autoremediation, business metric collection and analysis, host level monitoring, disaster recovery, data center decommissions, and incident command. The common thread between all these is the need to find a solution to a problem that needed a solution yesterday.
Outside of solving whatever problems get thrown at him, Brian enjoys spending time with his wife and son, coffee, learning Spanish, and classic science fiction.
author = {Brian Sherwin},
title = {Weathering the Storm: How Early Warnings Save the Farm},
year = {2019},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}