Search
Close this search box.

Saving Money with Apache Pulsar Tiered Storage

As companies start to look at rolling out real-time messaging systems, it’s important to look at the overall hardware costs. With some forward planning, companies can save as much as 85% on their overall storage costs. Before we start getting into the cost comparisons, let me briefly show how Apache Kafka and Apache Pulsar store […]

Q and A: Viewpoints on Open Source

There are diverse viewpoints on open source and its usage as a service. I’ve attempted to give a synopsis of the issues and some background – but that’s only my viewpoint. I’m bringing in other people to give their diverse viewpoints to give a more well-rounded one. This is stemming from this Twitter thread. The […]

The Three Components of a Big Data Data Pipeline

The Three Components of a Big Data Data Pipeline There’s a common misconception in Big Data that you only need 1 technology to do everything that’s necessary for a data pipeline – and that’s incorrect. Data Engineering != Spark The misconception that Apache Spark is all you’ll need for your data pipeline is common. The […]

Advice for Small Teams and Startups on Data Engineering

Small data engineering teams require different tactics. Much of my writing is geared towards larger companies and teams. How should a startup or small data engineering team in a big company be set up and work? What, if anything, should be done different? Your First Data Engineer Your first data engineering hire is a crucial […]

Creating a Data Engineering Culture

At DataEngConf Barcelona, I premiered a new talk about the importance of creating a data engineering culture. I share what a data engineering culture is and what management needs to do to be successful with Big Data.
Here is the video from the conferen…

Why You Can’t Do All of Your Data Engineering with SQL

There is a common misunderstanding in data engineering that you can do everything you need to create a Big Data data pipeline with SQL. This notion is being promoted by some vendors and companies. They’re wrong and you can’t do all of your data engineering with SQL. You will eventually need a programming language to […]

Thoughts on Cloudera Merging/Buying Hortonworks

Cloudera has merged with/purchased Hortonworks. As a former Clouderan, it’s interesting to see this move on several levels. I’m going to share my insights from the outside as a former insider. Full Disclosure: Although I’m former Cloudera, I don’t own any shares of Cloudera or Hortonworks and don’t plan to purchase any in the short-term. […]

Creating Work Queues with Apache Kafka and Apache Pulsar

A common use case for using Kafka and Pulsar is to create work queues. The two technologies offer different implementations for accomplishing this use case. I’ll discuss the ways of implementing work queues in Kafka and Pulsar as well as the relative strengths of doing each one. What are work queues? A work queue is […]

What is a Data Pipeline?

I’ve been seeing some questions about data pipelines lately. I realized I haven’t written a post that gives the level of detail necessary for a good definition of a data pipeline in the…