What is Apache Kafka
Official definition of Apache Kafka is distributed streaming platform, which starts to make sense only after reading at least few chapters of its documentation. However, idea behind it is relatively simple. In large distributed apps we have many services that produce messages: logs, monitoring events, audit entries – any type of records. On the other hand there’s similar amount of services that consume that data. Kafka brings these parties together: it accepts data from producers, reliably stores it in topics and allows consumers to subscribe to them. In other words, Kafka is a love child of distributed storage and messaging system.
Kafka’s produce-consume messaging model is traditional publish-subscribe pattern, which brings up a question what makes it different from message queues. Well, a lot.
Difference from message queues
- Publish-subscribe is the only pattern Kafka supports. Traditional message queue like, let’s say, RabbitMQ, supports many, including publish-subscribe.
- Kafka keeps the messages certain amount of time, which doesn’t depend on whether or not somebody actually received it. In effect, new consumer can subscribe to topic and receive its messages from the past. Message queue usually removes the message as soon as somebody received it.
- Numbers vary, but Kafka outperforms popular message queues like ActiveMQ or RabbitMQ. Some bring impressive numbers like ~100K messages per second throughput for Kafka and 20K msg/s for RabbitMQ. What’s more, Kafka works equally well with kilobytes and terabytes of data.
- Clustering. Even one Kafka node is a cluster, and adding one more doesn’t require any additional configuration. Not every message queue supports clustering and ones that do, require additional steps to configure it.
- Kafka messages are always persistent. Message queues that do support durability usually have it disabled by default (e.g. MSMQ, RabbitMQ).
- Availability. Kafka’s topics can be sharded into partitions within the cluster, and those partitions also can be replicated for high availability. Despite some MQs support replication, I can’t come up with any that can shard its queue.
How can we use it
Well, in some scenarios Kafka could replace message queue and just deliver messages from one part of the app to another. After all, it supports publish-subscribe pattern, it’s throughput is beyond compare, it’s durable and insanely scalable, so that would definitely work. But unless your app sends millions of messages, that would be using a sledgehammer to crack a nut.
With Kafka we could aggregate user activity on a web site: clicks, page reloads, searches – whatever we can collect. In fact, that was the problem LinkedIn tried to solve when it came up with Kafka. Having such data organized in one place in topics, other services like analytics, real-time processors or permanent storage could subscribe to it.
Another use-case for Kafka is logs aggregation. Instead of maintaining log data in physical files on separate servers, we can make logs sources to act as producers and send the data to central hub – Kafka. Out of nowhere individual logs sources turn into unified stream of events we can subscribe to, store and analyze.
Logs aren’t the only thing that we can collect. Monitoring events, like memory consumption or threads count, maintenance events, like reboots or upgrades, even exception events – data source and location doesn’t matter, it’s just a stream of data that Kafka can aggregate.
Finally, along with moving data from one place to another, Kafka can do some processing as well. Streams API allows to create stream processors, which take data from one topics, process it (enrich, filter, transform, whatever) and publish into another. E.g. such processor could subscribe to stream of logs, analyze its flow, detect upcoming fault and publish an event in ‘possible fault’ topic.
Conclusion
Apache Kafka is highly performant tool that can move data from one place to another, with or without processing. It’s mature and widely adopted by many giants, like LinkedIn, Netflix, Yahoo, Twitter, Pinterest and many others. However, just reading how good something is probably isn’t enough, so next time we’ll grab installer and do ourselves some data streaming.