Sriram Subramanian, Director, Platform & Infra Engineering, Confluent
Apache Kafka is recognized as the world’s leading real-time, fault tolerant, highly scalable stream platform. It is adopted very widely across thousands of companies worldwide from web giants like LinkedIn, Netflix, Uber to large enterprises like Apple , Cisco, Goldman Sachs and more.
In this talk, we will look at what Confluent has done along with the help from the community to enable running Kafka as a fully managed service. The engineers at Confluent spent multiple years running Kafka as a service and learnt very valuable lessons in that process. They understood how things are very different when you run in a controlled environment inside a single company vs running Kafka for thousands of companies. This talk will go over those valuable lessons and what we have built in Kafka as a result which is available to all Kafka users as part of Confluent Cloud.
We will cover three key aspects
- Resiliency – A data system needs to be highly available and should never loose data. Kafka is no different. We are very paranoid about the guarantees that Kafka provides and we have taken a lot of effort to ensure Kafka is extremely durable and highly available. To achieve this, we have taken various steps including improving the replication protocol, rewriting the controller, reducing zk failures and more. We will go over each of these improvements.
- Observability – To run any data system as a service, we need to be able to measure and alert on key metrics. We have spent a lot of time adding tons of metrics to Kafka based on our previous experience to ensure it can be easily monitored. This includes better client, storage, replication and controller metrics to ensure any external alerting system can monitor and alert on these metrics. We will go over some of them in this talk and describe why they are important.
- Extensive testing – Confluent runs 1000s of tests nightly and have run them for hundreds of hours now. Based on our testing, we have been able to proactively identify issues and work with the community to fix them. We will discuss about the different tests we write, the types of fault injections we do and the issues we have identified and fixed in this process. We will also touch a bit on code quality and compile time bug identification approaches we employ to ensure we build a highly reliable system that is essential to run Kafka as a service with top notch SLAs.