The Two Types of Data Engineering

There are two different types of data engineering. There are two different types of job types with the title data engineer. This is especially confusing to organizations and individuals who are starting out learning about data engineering. This confusion leads to the failure of many teams’ Big Data projects. Types of Data Engineering The first […]

Why Real-time is the Future

One of the benefits of teaching and consulting is the sheer number of organizations, teams, and people I get to work with. Since I deal with so many different groups, I can see patterns emerge much faster than others. One pattern I saw early on was real-time Big Data. Organizations wanted to do things in […]

The Four Types of Technologies You Need for Real-time Big Data Systems

Creating real-time data pipelines bring new challenges. There are new concepts and technologies that you’ll need to learn and understand. To help you understand the basic technologies you need in a real-time data pipeline, I break it down into 4 general types. These types are: Processors Analytics Ingestion and dissemination Storage Processors A processor is […]

What Are Batch and Real-time Big Data?

The move from batch to real-time Big Data represents change. It will entail using brand new technologies and concepts that you haven’t dealt with before. Batch Big Data Let’s start off by defining batch Big Data. For batch, all data must be there when the processing starts. Batch processes can run over fixed periods of […]

Data engineers vs. data scientists

I wrote a post for the O’Reilly data blog going into my latest thoughts and views on data engineers versus data scientists. I continue on to talk about machine learning engineers.

Should You Even Do Big Data?

There’s an elephant in the room with Big Data. If an organization tries to half-ass their way through a Big Data project, they’re going to fail (usually a 5-10% odds of success). Given this really low success rate, should you even do Big Data? When I worked at a Big Data vendor, I couldn’t tell […]

Unit Testing Kafka Streams

Unit testing your Kafka code is incredibly important. I’ve already written about integration testing, consumer testing, and producer testing. Now, I’m going to share how to unit test your Kafka Streams code. To start off with, you will need to change your Maven pom.xml file. You’ll need to include the test libraries for Kafka Streams […]

Are Your Programming Skills Ready for Big Data?

As people start with Big Data, they go through the list of necessary skills. One of those crucial skills is to program. The question arises — how good does a person’s programming skills need to be? This is because programming skills are on a wide spectrum. There are people who are: Brand new to programming […]

The Veteran Skill on a Data Engineering Team

In my book *Data Engineering Teams, I talk about a skill that’s often overlooked and unknown to data engineering teams. Teams often don’t know they need a veteran, think they can’t afford a veteran, or don’t understand why you need a veteran on the team. In Chapter 3 “Data Engineering Teams,” I give my definition […]

How Much More Complicated Is Real-Time Big Data?

In my seminal post On Complexity in Big Data I talked about the level of complexity increase with Big Data. The post itself focused on Big Data batch systems. I didn’t really cover real-time complexity increases when dealing with Big Data. In the post, I argue that Big Data batch is 10x more complex than […]