This post will focus on the key differences a Data Engineer or Architect needs to know between Apache Kafka and Amazon Kinesis.
Cloud vs DIY
Some of the contenders for Big Data messaging systems are Apache Kafka, Amazon Kinesis, and Google Cloud Pub/Sub (discussed in this post). While similar in many ways, there are enough subtle differences that a Data Engineer needs to know. These can range from nice to know or we’ll have to switch.
Kinesis is a fully-managed streaming processing service that’s available on Amazon Web Services (AWS). If you’re already using AWS or you’re looking to move to AWS, that isn’t an issue. Kinesis doesn’t offer an on-premises solution.
Apache Kafka is an open source distributed publish subscribe system. It was originally created at LinkedIn and then open sourced.
Kafka does have the leg up in this comparison. It can be installed as an on-premises solution or in the cloud. I’ve trained at companies using both of these approaches.
Kafka can store as much data as you want. The actual storage SLA is a business and cost decision rather than a technical one. I’ve had companies store between 4 and 21 days of messages in their Kafka clusters.
Kinesis stores messages for 24 hours by default or up to 7 days with a configuration change. You can’t configure Kinesis to store more than 7 days. You can use the Kinesis VCR to replay data stored in S3 using the Kinesis API. This way, you’ll be able to store for longer periods without having the 7 day limitation.
All operational parts of Kafka are your purview. That’s good and bad. With great power will come many knobs to turn. All of the installation, setup, and configuration are your job. Confluent has an administrator course to learn the various ins and outs of Kafka you’ll need to know. One big part of the operational puzzle is disaster recovery and replication. Kafka calls this mirroring and uses a program called MirrorMaker to mirror one Kafka cluster’s topic(s) to another Kafka cluster.
Kinesis is a cloud service. There isn’t anything you need to do operationally, including replication and scaling. Kinesis focuses on maintaining uptime and AWS is responsible for maintaining that uptime. On the replication side, all messages are automatically replicated to three availability zones. Most people keep their Kinesis data in a single region, but you can use AWS Lambda or write a consumer to replicate to another region.
As of Kafka 0.9, there is support for authentication (via Kerberos) and line encryption as configurations. At rest encryption is the responsibility of the user.
Kinesis Streams can be encrypted as a configuration property by choosing an encryption key or be encrypted programmatically at the producer and programmatically decrypted at the consumer. The recommended encryption for the programmatic route uses AWS’ Key Management Service.
There is a big difference in the operational overhead of scaling between Kafka and Kinesis. If your Kafka cluster has enough resources, scaling could mean adding more partitions to a topic. If your Kafka cluster doesn’t have enough resources, you will need to install and configure another broker, then add more partitions. With Kinesis, scaling will always be an API call to increase the number of shards.
The number of shards in Kinesis are changed far more often than Kafka. This is because Kinesis’ pricing model is based on the number of shards. As a result, companies will change their number of shards to optimize costs based on demand.
Coding and API
Kafka has its own API for creating producers and consumers. These APIs are written in Java and wrap Kafka’s RPC format. There are other languages that have libraries written by the community and their support/versions will vary. Confluent has created and open sourced a REST proxy for Kafka.
In 0.9 and 0.10 Kafka has started releasing APIs and libraries to make it easier to move data around with Kafka. There is Kafka Connect and Kafka Streams. Kafka Connect focuses on move data into or out of Kafka. Kafka Streams focuses on processing data already in Kafka and publishing it back to another Kafka topic.
Kafka’s consumers are pull. Data is only retrieved during a
Kinesis has a REST interface, but Amazon provides libraries in other languages. These libraries wrap the REST calls in a more language-friendly way and make it easier to program. Their libraries support 8 different languages.
Kinesis’ consumers are pull. Data is only retrieved during a GetRecords call.
Kinesis producers and consumers have various limits that you should know about. Here are a few highlights. The maximum message size is 1 MB and Kafka’s messages can be bigger. You can only consume 5 times per second and up to 2 MB per shard. That means you can only consume once per 200 ms. Each shard can only write 1,000 records per second. As you get close to hitting these limits, you would add more shards.
Note: Kafka doesn’t have explicit limitations on performance, but is limited by the underlying hardware performance.
Similar to Kafka’s partitions, Kinesis shards directly affect the application’s performance and scalability. Care should be taken by the developers to verify that the application has enough shards for the use case at a given time.
AWS has started adding on to its Kinesis offering. Kinesis Firehose offloads data into other technologies such as S3. Kinesis Analytics allows you to run SQL in real-time on a Kinesis Stream. There are Kinesis Agents to move files into Kinesis.
Comparing prices between a cloud service and Kafka is difficult. For this I’ll mostly focus on Kinesis’ pricing model.
Kinesis is priced per shard hour, per PUT (record produce) and for the extended storage of the messages. If you don’t store Kinesis data beyond the 24 hour default, you won’t be charged for storage.
Pricing for Kinesis is in essence per shard and Kafka is per node.
With Kinesis, pricing per shard allows companies to optimize their spend at a more granular level. For example, if you had a low point during the day, you could go down to 1 shard and save money. This sort of scaling is only possible in a cloud service.
This sort of price optimization is optional with Kinesis. You could just calculate the number of shards you need and never really change it. This is similar to how you’d allocate partitions in Kafka.
With Kafka, you generally don’t change your number of nodes (brokers) or partitions as dynamically. For example, you wouldn’t shut down nodes because you’re at a low point. You might add partitions once a month or on rare occasion.
Running a Kafka cluster is more of a fixed cost. For a small enterprise cluster, you’d be looking at 3-5 nodes depending on how much node failure you want to survive. At the high point of the day, a Kafka cluster will cost the same as the low point of the day.
The pricing page gives an example cost. If you were to have 4 shards and send in 267,840,000 35 KB messages, that would cost $52.14 per month. That doesn’t include extended storage costs. For 7 days, that will cost another $59.52. The total cost per month to have 4 shards, send 267,840,000 35 KB records, and store for 7 days is $111.66.
For calculating or comparing costs with Kafka, I recommend creating a price per unit. This will help you understand costs around your systems and help you compare to cloud providers.
Normally, your biggest cost center isn’t the messaging technology itself. Usually, it’s wrapped up in the publishing and processing of the messages. The code and distributed system to process the data is where most costs are incurred.
The biggest differences for Data Engineers come with the architecture differences. Kafka gives knobs and levers around delivery guarantees. Most people try to write an at least once. Kinesis guarantees an at least once and you can’t change that programmatically. Both products feature massive scalability.
Mis-configuring or partitioning incorrectly can lead to scalability issues in Kafka. Similar issues can happen with Kinesis shards by using partition keys incorrectly.
Both technologies offer the same guarantees for ordering. Kafka has ordering at a partition level and Kinesis has ordering at a shard level.
The technologies differ in how they store state about consumers. In Kafka, they are called offsets and are stored in a special topic in Kafka. In Kinesis, this is called checkpointing or application state data and stored in a DynamoDB table. Both Kafka’s offsets and Kinesis’ checkpointing are consumer API concepts and may be implemented differently in different access models and APIs.
Choosing a Big Data messaging system is a tough choice. It’s incredibly important to understand your use case. Only a qualified Data Engineer can help you choose the right tools for the job. Make sure that your Data Engineers know about both on-premises and cloud options.
Special thanks to Allan MacInnis for his technical review.