Hemanth Malla and Elijah Andrews, Datadog
It all started with a team reaching out because they had DNS issues during rolling updates. Business as usual… Four weeks later: We are reading kernel code to understand the corner cases of dropping Martian packets. Could this be the connection between gRPC client reconnect algorithms and the overflowing conntrack table we can feel but not see? In time, we solved the issue. And for once… it wasn't DNS!
In this talk, we will focus on one of the most complex incidents we have faced in our Kubernetes environment. We will go through the debugging steps in detail, dive deep into the mysterious behaviors we discovered and explain how we finally addressed the incident by simply removing three lines of code.
Hemanth Malla, Datadog
Hemanth Malla is a Software Engineer working on Kubernetes and container networking at Datadog. Previously he worked on various distributed systems in industries like e-commerce, fintech, and high-frequency trading. Apart from computers, he enjoys all things photography, drones, and dark chocolate.
Elijah Andrews, Datadog
I'm a software engineer at Datadog. I'm currently working on our networks, and previously worked on our data ingestion pipelines. Outside of work, I love playing guitar, going to concerts, and spending time with my cat Bao.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Hemanth Malla and Elijah Andrews},
title = {Logs Told Us It Was {DNS}, It Looked like {DNS}, It Had to Be {DNS}, It Wasn{\textquoteright}t {DNS}},
year = {2023},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = mar
}