Close this search box.

In my book Data Engineering Teams, I separate out programming as a different skill than distributed systems. The section is the “Skills Needed in a Team” and talks about the various skills that a data engineering team needs.

Several people have emailed me for clarification about this distinction. Aren’t programming and distributed systems the same thing? How are they different?


I’ll start with my definition of programming within Big Data.

I find there are three general types of programmers:

I wrote an entire article about the programming skills needed for Big Data. This article helps to define which category each member of your team falls into.

Distributed Systems

A common misconception is that Big Data frameworks make it dead simple to do Big Data. The answer is they make it easier, but don’t make it dead simple. Creating a solution is still very complicated. The frameworks just make it easier to concentrate on the code instead of the RPCs and threading.

In my experience, the companies that think Big Data frameworks make things easy are the most likely to fail. They assign teams and individual contributors without the skills to create the solution. They have a skills gap as I talk about in the book. Skills gaps lead to failure.

How Are They Different?

The two skills are different and not often found in the same members of the team. For your team to succeed, you will need at least one person with both the programming and distributed systems skills.

I gave a list of types of programmers. Let me show you how each one relates to their distributed systems skills.

The “coders” don’t have the distributed systems skills to create a data pipeline. They’re usually the consumers of the data pipeline.

The simple programmers rarely have the distributed systems skills. They’re usually the consumers of the data pipeline.

The advanced programmers have the highest probability of having the distributed skills, though it’s not 100%. They’re the ones creating the data pipeline. They’re consuming and creating value out of the data pipeline. They help the other programmers as they get stuck working with the data pipeline.