I’ve been teaching Kafka at companies without the textbook definition of Big Data problems. They don’t have, and will not have in the future, what you’d define as Big Data problems. As a result, the students ask me if using Kafka is appropriate for their use cases. Put another way, is Kafka only a Big Data tool?
For most Big Data technologies, not having or having a Big Data problem in the future is the reason not to use technologies like Apache Hadoop or Apache Spark. It’s a pretty clear pass/fail because the technical and operational overhead of these projects immediately negates any other benefits. Using Big Data for small data isn’t just massive overkill; it’s going to waste a lot of time and money.
For Kafka, it’s different. I define Kafka as a distributed publish subscribe system. Companies without clear Big Data problems are gaining value from it. They’re able to use the other interesting features of Kafka.
Want to become a Data Engineer but can't find in-depth materials?
Get my exclusive training video to see how to become a Data Engineer. The video will teach you how to:
• Learn difficult technologies: Understand MapReduce and Spark … even if you’re just starting out
• Target the right technologies: Identify which technologies your target companies do and don’t use
• Stop wasting your time: See the techniques I used to teach Big Data at over twenty Fortune 100 companies
You have Successfully Subscribed!
Here are some of the pros I see for using Kafka with small data:
- All data can be replicated to more than one computer
- Kafka removes single points of failure for the brokers
- Kafka removes single points of failure for consumers with consumer groups
- Consumers can move freely through the commit log and go back in time
- Consumers don’t miss data as a result of downtime because the data is saved
Here are some of the cons I see for using Kafka compared to a traditional small data pub/sub:
- Programmatic API is more complex than others
- Conceptually more complex (e.g. partitions and offsets) than others
- Ordering is no longer global and is only on a partition basis
- Consumer groups will need to handle state transitions for failures
- Fewer people available with Kafka skills (you will probably need to train)
- Operationally, more processes will need to be monitored
With these pros and cons in mind, you can make a choice between Kafka and your small data pub/sub of choice. If the pros are really compelling and outweigh the cons, I suggest you start looking at Kafka. If the cons outweigh, you’re probably better off with your small data pub/sub.
Learn more about how Kafka works here: