Fault Tree Analysis Applied to Apache Kafka

Website Maintenance Alert

Due to scheduled maintenance, the USENIX website may not be available on Monday, March 17, from 10:00 am–6:00 pm Pacific Daylight Time (UTC -7). We apologize for the inconvenience and thank you for your patience.

If you would like to register for NSDI '25, SREcon25 Americas, or PEPR '25, please complete your registration before or after this time period.

Friday, 4 October, 2019 - 16:0016:45

Andrey Falko, Lyft

Abstract: 

This talk should provide a framework for answers the following common questions a Kafka operator or user might have: What should your replication factor be for your Kafka topics? How many partitions should you have? How many consumers should I provision? What should my ISR setting be? Should I use RAID or not?

Andrey Falko, Lyft

Andrey Falko is one of the first Reliability Software Engineers at hired at Lyft, where he has been for more than a year. He is currently focused on building and scaling reliable PubSub systems for Lyft's Data Platform. Prior to Lyft, Andrey worked at Salesforce for nine years where he researched Kafka and Pulsar performance and reliability. While there, he also built an IaaS system, many CI/CD systems, a Zipkin service, and features for the Salesforce platform.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

Presentation Video