I spoke at GOTO Chicago last week with Martin Fowler. He gave a keynote on The Many Meanings of Event-Driven Architecture. It wasn’t tied to or specific to any particular technology. In this post, I’m applying some of his points specifically to Kafka and Big Data.
Kafka is often used to event changes. These changes are used as notifications to other systems that a some information change or a an action was performed.
Teams ask me how much information should be sent. Should it just have the action and what field changed? Should it give more information about the action?
My general answer is to figure out how expensive a lookup is. If your database lookup is expensive and you can save X number of downstream lookups, you’re coming out ahead. In those cases, I suggest eventing more information. The event would contain the action performed, the data before the action, and the data after the action was performed.
Want to become a Data Engineer but can't find in-depth materials?
Get my exclusive training video to see how to become a Data Engineer. The video will teach you how to:
• Learn difficult technologies: Understand MapReduce and Spark … even if you’re just starting out
• Target the right technologies: Identify which technologies your target companies do and don’t use
• Stop wasting your time: See the techniques I used to teach Big Data at over twenty Fortune 100 companies
You have Successfully Subscribed!
Eventually Consistent Databases
Martin talked about another event architecture I haven’t seen in Big Data architectures. It’s an architecture where events are moved serially through different databases in an eventually consistent manner. A system will make a change to a database and then another system will take all, or a subset of changes, and add them to its local database of changes.
A more common pattern with Kafka is to have all X number of databases pull their updates from Kafka directly. From there, the database can choose all or a subset of changes.
I’m not saying you shouldn’t have multiple databases. In fact, I encourage it in certain cases. I’m saying you should update all databases from Kafka. I’ve seen this sort of architecture lead to siloing and we’re trying to avoid that with Big Data.
The Importance of Schema
When I teach Kafka, I stress the importance of schema. I tell the class that using or not using schema won’t manifest as a failure in the early phases of the project. You’re not going see its need until 6-12 months later as you start to make changes to data and code.
My metric for success with schema is around data changes. If you make a change 6 months after release, how many projects have to change? If every project has to be updated due to a data format change, you’ve failed at schema. If only projects that need the new data change, then you’re doing schema correctly.
Really Large Updates
Some use cases require eventing really large updates (>1MB). Should those be sent through Kafka?
These questions are very use case specific. Sometimes, I’ll suggest that the event be sent as the row/key id for a NoSQL database or the path to the file in question and what changed. This way, the entire row or file doesn’t need to be changed.
Other companies have broken up the file into part and publish the parts into Kafka. Using some metadata, the files will get reassembled during read time.
These decisions about out-of-band changes affect replayability.
Replayability is the ability to take the events and recreate a database’s or state store’s state. This means that every single mutation has to go through Kafka. It if occurs out-of-band or indirectly, those changes can’t be recreated or replayed.
The need for replayability is very use case specific. If you do have one, you’ll need to make sure that all mutations happen as events.