Designing data for consumption in a Kafka topic requires more forethought. Instead of the messages being a consumed from point to point, there are many different consumers.
You will need to decide on:
- Name
- Schema
- Contents
- Key/Ordering
- Number of Partitions
- Number of Replicas
Name
The choice of a topic name shouldn’t be difficult. I suggest using a descriptive and long as necessary.
Don’t hardcode the name all over the place in your code. It’s a common early bug to misspell the topic name in several different places. I suggest using class that exposes topic names as public static final String
s.
Schema
You might have noticed that I broke out the actual schema of the topic apart from the contents or payload. A topic’s schema is different than the actual data sent.
Some examples of schema are JSON or Apache Avro. Don’t use XML as your post-ETL schema. Avro is the recommended format for post-ETL schema. Some organizations choose to use JSON. There are some big benefits to using a binary format such as Avro.
Contents
The contents are the actual payload of the message. Sometime this includes the key, but is primarily the value.
When you’re deciding on the contents of the value, remember not to focus on the first use case. Remember that data usage will grow over time and it’s easy to add more consumers. Other teams can, and will, write new consumers and require new data.
If the value’s contents are designed for one use case, you might have to change it for a new use case. My general suggestion is to add the data that makes sense to the value, even if all of the fields aren’t being used.
Key/Ordering
By default, your choice of key affects ordering in Kafka. In a worst case scenario, choosing the wrong key could make a future consumer’s use case impossible. It’s important to choose a key that makes sense given the value’s contents.
Number of Partitions
Partitions are how Kafka breaks down a topic into smaller pieces. Choosing the number of partitions affects the scalability of the topic for consumers.
It’s outside the scope of this post to say how to choose the number of partitions. You can change the number of partitions later on, but I highly suggest spending the time to figure out the right number of partitions.
Number of Replicas
The number of replicas comes down to: how important is your data? If don’t really care about the data, go for 1 replica. If you remotely care about your data, make you have 3 replicas.