I’ve been seeing some questions about data pipelines lately. I realized I haven’t written a post that gives the level of detail necessary for a good definition of a data pipeline in the context of data engineering.
Instead of just giving my opinion, I’ve brought together some of the greatest minds in Big Data to give their opinions too. As you’ll see, it’s more difficult to give a single definition. Here are the people who were kind enough to respond:
Paco Nathan (PN) – O’Reilly author, Evil Mad Scientist at Derwen.ai, former Director of O’Reilly’s Learning Group
Jamier Grier (JG) – Senior Software Engineer at Lyft, former Director of Software Engineering at Data Artisans and Twitter
Mark Gover (MG) – O’Reilly author, Product Manager at Lyft, former Cloudera.
Ry Walker (RW) – CEO of Astronomer
Dean Wampler (DW) – O’Reilly author, VP of Fast Data at Lightbend
Russell Jurney (RJ) – O’Reilly author, Principal Consultant at Data Syndrome, former Senior Data Scientist at LinkedIn
Evan Chan (EC) – Software Engineer at Apple, former Distinguished Engineer at tuplejump
Davor Bonaci (DB) – VP of Apache Beam, former Senior Software Engineer at Google
Tyler Akidau (TA) – O’Reilly author, Technical Lead for the Data Processing Languages & Systems group at Google
Jesse Anderson (JA) – Me/Myself/I
Stephen O’Sullivan (SO) – former VP, Engineering at SVDS
What’s your definition of a data pipeline?
Data Pipeline – A arbitrarily complex chain of processes that manipulate data where the output data of one process becomes the input to the next.
IMHO ETL is just one of many types of data pipelines — but that also depends on how you define ETL
This term is overloaded. For example, the Spark project uses it very specifically for ML pipelines, although some of the characteristics are similar.
I consider a pipeline to have these characteristics:
- 1 or more data inputs
- 1 or more data outputs
- optional filtering
- optional transformation, including schema changes (adding or removing fields) and transforming the format
- optional aggregation, including group bys, joins, and statistics
- robustness features
- resiliency against failure
- when any part of the pipeline fails, automated recovery attempts to repair the issue
- when an interrupted pipeline resumes normal operation, it tries to pick up where it left off, subject to these requirements:
- If at least once delivery is required, then the pipeline ensures that processing of each record happens at least once, involving some sort of acknowledgement
- If at most once delivery is required, the pipeline can start after the last record that it read at the beginning of the pipeline
- If exactly (effectively?) once delivery is required, the pipeline uses deduplication mechanisms with at least once to output a result once and only once (subject to the fact that it’s impossible to make this guarantee for all possible scenarios)
- management and monitoring hooks allow issues, as well as normal operational characteristics, like performance criteria, to be available
I wouldn’t necessary add latency criteria to the basic definition. Sometimes a pipeline is “watch this directory and process each file that shows up.”
I think real ETL jobs are pipelines, because they must satisfy these criteria. Depending how broadly you define ETL, then all pipelines could be ETL jobs.
ETL process, in my opinion, carries the baggage of the old school relational ETL tools. It was and is simply a process that picks up data from one system, transforms it and loads it elsewhere. When I hear the term ETL process, two things ring a bell in my mind – “batch” and “usually periodic”.
When I hear the term data pipeline, I think of something much broader – something that takes data from one system to another, potentially including transformation along the way. However, this includes newer streaming like processing and older ETL processes. So, to me data pipeline is a more generic, encompassing term that includes real-time transformation. One point I would note is that data pipeline don’t have to have a transform. A replication system (like LinkedIn’s Gobblin) still sets up data pipelines. So, while an ETL process almost always has a transformation focus, data pipelines don’t need to have transformations.
I’d define data pipeline more broadly than ETL. An ETL process is a data pipeline, but so is:
- automation of ML training (ex: pull data from warehouse, feed to ML engine as a training set, update results in a production database that’s being used for real-time recommendations)
- data quality pipeline (ex: run a query on the ML-generated values above, confirm they’re within a range of reasonableness, alert if not)
- ingestion from external sources (ex: fetching data from Salesforce API, drop into warehouse ELT)
- metric computation: roll ups of engagement/segmentation metrics
- sessionization: re-ordering events to tell clearer user behavior stories
- data pipelines can be real-time (kafka consumer pulls data from kafka, algorithm coefficients from redis, runs ML algorithm and presents recommendation to user in real-time)
What is a pipe? It takes an input (water from the utility), and gives you output (water in your house).
What is a pipeline? Like an oil pipeline? Same thing, but possibly transforms the input to outputs.
So for me, a data pipeline can be thought of as a function which, given some data as input, transforms it and returns output data that is transformed.
I don’t like Data Pipeline. I like Data Processing Pipeline.
Data Processing Pipeline is a collection of instructions to read, transform or write data that is designed to be executed by a data processing engine. Typically used by the Big Data community, the pipeline captures arbitrary processing logic as a directed-acyclic graph of transformations that enables parallel execution on a distributed system.
A process to take raw data and transform in a way that usable by the entire organization. This data is served up by technologies that are the right tools the job and that are correct for the use cases. The data itself is available in formats that keep in mind the changing nature of data and the enterprise demand for data.
I view data pipelines relatively broadly as being any software system that takes data from one or more inputs and transforms it in some way before writing it to one or more outputs. This can mean something like a Hadoop, Spark, Flink, Beam, etc. pipeline reading from and writing to some type of distributed data store, but can also mean a set of serverless functions operating on HTTP calls, some hand-rolled job reading and writing Kafka, or even your frontend WebUI servers taking user input and turning that into database writes. The reason I lump them all under the same umbrella is that they’re all doing the same task at the core: reading input data, transforming it in some way, and writing it as output data. And as such, they all struggle in similar (though sometimes different) ways with difficulties with difficulties like duplicate detection and output consistency. And they all present similar challenges in understanding things like progress, latency, completeness, and correctness.
I think of data pipelines of a way to move data from point A to point B,(c,d,e…) with or without transformations, in realtime or in batch, has guarantee of not losing data (I don’t like leaky pipes), and you have full traceability of the pipeline and the performance (got to know when it’s slowing down, leaking, or stopped working)
Is just a relational database a data pipeline? Why or why not?
There’s a lot of overlap in the features required. My sense of a relational database is that there’s a query engine and optimizer which has been through lots of testing and iteration — that’s not exactly needed on a data pipeline. For example, the concept of
JOIN seems out of scope. One could make the Kafka-esque arguments and start to get into some indexing, even so those are two different kinds of feature sets and IMHO OSFA tends to hurt more than help.
Nope. In order to be a pipeline there has to be at least two stages that are “pipelined”.
No – there’s no concept of picking data from one place and dropping it in another. I suppose one could say that data is being picked from the same place and dropped there too, but then the focus is on the transformation, not on delivery.
It’s the end point and it might provide some of the processing inside, but I don’t think it’s useful to think of a database as a pipeline itself. That stretches the definition too much to be useful, IMHO.
I would say that a relational database isn’t a data pipeline. It is one place, one destination for data. Data pipelines tend to move data between more than one place. You might move data from one table to another, and that could be a data pipeline… but that kind of thing is certainly in the minority of data pipelines out there.
Database < -> data storage. Pipeline < -> data processing. Data at rest vs. Data in motion.
No, just putting your data into a relational database isn’t a data pipeline. This violates my definition of a data pipeline because it assumes that a relational database is the right tool for every enterprise use case. That isn’t the case and shows a very steady-state thinking to your data and its growth.
(TA) The storage layer of a database is not a data pipeline, but query execution and materialized view layers very much are in my opinion. SQL queries can express some remarkably complicated data processing pipelines, and the sophistication with which database systems are able to optimize and physically execute those queries is quite impressive. Is this any different than a Spark SQL pipeline? I don’t think so. There’s also a nice symmetry between interactive queries and batch processing, as well as materialized views and a subset of stream processing.
No. what JG & DB said! (The benefits of coming late to the party)
Does something have to be automated to be a data pipeline?
Not exactly, though lacking options to automate it’d be much less useful for most of the use cases I encounter.
Does something have to be automated to be a data pipeline?
I would say yes to this.
Nope, I think you can totally have an adhoc data pipeline. Say, you wanted to do a bulk upload (or replay) of historical marketing spend due to an incident, you can set up a one-time data pipeline to do this bulk load (or replay) from your marketing system to your data warehouse.
if it’s not automated, it’s not a pipeline – it’s just a data export
I could imagine a job I start myself and if it fails I have to do the error recovery manually. So, I’ve stretched the definition of “automated”
If the pipeline is not automated, it is not connected. It might exist, but it is not operating.
In most communities, yes. Semantically, automation is probably an implementation detail.
It doesn’t have to be, but it absolutely should be. If you have humans with their fingers on the buttons doing manual operations, you will fail.
I’m honestly not sure how to answer this, because it’s not clear to me what level of automation is implied. We’re talking software here, and most of the value in software comes from the automation it provides for doing repetitive and detail-oriented tasks over and over. So in that sense, yes, I think it needs to be automated in the sense that it needs to be defined in software. But there are absolutely cases where manual guidance or intervention may be needed, for example, human-in-the-loop handling of exceptional inputs that fall outside of some predefined range of values that may be handled automatically.
It doesn’t have to be.. But if you are not automating it, you are doing it wrong.
Does a data pipeline have to be Big Data to be considered a real data pipeline?
NO. My all-time favorite example is MQSeries by IBM, where one could have credit card transactions in flight, and still boot another mainframe as a new consumer without losing any transactions. Not big, per se; however, it’s exceptionally reliable.
Not at all.
My opinion is biased here since I have mostly worked in big data. The answer here is no, a data pipeline is a data pipeline regardless of the context of data size of use.
it doesn’t have to be big data
Nope. Don’t tell the RDBMS people!
No. They’re real, large and small.
In Big Data community, probably. Semantically, not.
They don’t have to be. It’s important that a data engineering team be able to handle both small and Big Data projects. However, throwing too many small data projects their way will be a waste of time and money.
No. Been doing it long before “Big Data”. (Love DW’s comment, as I’m an old RDBMS person )
Is there a complexity cutoff between a simple ETL process and a data pipeline?
I couldn’t articulate it. Perhaps my mainframe IT background is coming through, but ETL is the baseline abstraction to me for a data pipeline — that plus reliability guarantees. Even so, the more contemporary definitions of data pipelines often have much more sophisticated pub-sub features, are tending toward streaming, etc.
Organizational reasons come to mind: opex, reliability, scalability, security, compliance, audits, etc.
I don’t think so.
To me, complexity is not a cut off. There are ETL pipelines that are data pipelines. Data pipelines is a super set of ETL pipelines. I think once you reduce the focus on transformation and/or do real-time transport/transform, that’s when you can’t use the term ETL pipelines any more.
No complexity cutoff.
I don’t think so.
A data pipeline is a connection for the flow of data between two or more places. You can use that definition
ETL is Extract, Transform, Load – which denotes the process of extracting data from a source, transforming it to fit your database, and loading it into a table.
ETL is one operation you can perform in a data pipeline. There are many others.
In Big Data community, ETL pipeline is usually refers to something relatively simple. Semantically, no.
Not in and of itself. The complexity for the ETL/DW route is very low. Think of it as a 1x. All of it can be easily done with command line programs, maybe some minimal Python scripting.
Compare that with the Kafka process. How much more time did you spend just planning how data would be added to Kafka? How much time did you spend verify you had the right key? Everything in this pipeline is programmatically repeatable and coded. This code should scale with minimal, if any changes.
Why should a company care that they have a data pipeline versus a non-data pipeline? What value is there for a company to have a data pipeline?
I think the terms are wrong here. You can have events flowing that can model traffic for sessions, like a web session, but I don’t think of those as pipelines.
A company inevitably has many data pipelines. Whether they recognize that fact and manage them as the distinct assets they are is another story entirely. Many companies do not do so, and have an entirely ad hoc or unknown process for operating their data pipelines.
I’d say companies shouldn’t care about existence of data pipelines. Its existence is not a goal and doesn’t hold any value in its own right. It is just a tool of achieving something else of value. For example, actionable insights have value and companies should care about that, and a data pipeline is often one of the necessary requirements to achieve that business goal.
A company should care if they don’t have a data pipeline because they aren’t making full use of their data otherwise. I think there will be two types of companies – those gaining maximum value from data and all others. Companies without data pipelines will be taking stabs in the dark about any hypothesis. If they do have data, it will be so dirty as throw all validity of the results into question. A company that has created a data pipeline will be able to gain maximum value and trust their results.
(TA) I’m not even sure what a non-data pipeline is; maybe this is me being pedantic about the definition of “data”. But that aside, I think Davor’s point that data pipelines are a means to an end is the important takeaway here.
Extra Thoughts Worth Reading
ETL is a subset; one specific category of pipeline where processing logic does “extract-transform-load”. There are other types of pipelines, for examples, those that don’t load transform data into a database.
- Spark and other frameworks and infrastructure pieces such as Kafka are simply distributed pipe functions
- Being a function means it must be reusable. One should be able to inject new inputs and derive new outputs.
- Streams are simply a special case of the function which runs continuously on inputs and continuously yields output
Automated data pipelines are a good proxy for data-drivenness IMHO – i wrote this a few years ago – new astronomer interface we’re building will highlight this metric to our platform users
I wrote another post talking about the difficultes I’ve seen organizations have while transitioning from REST and solely database applications to a data pipeline.
Since I seem to have a different opinion than most on the relationship of databases and data pipelines, I will plug a few additional resources of mine on the topic. I’ve given a number of talks about streaming SQL, which dive a bit into the relationship of databases to data pipelines, and are free to watch on YouTube. But I go much deeper on the topic in Chapters 6 and 8 of my Streaming Systems book. I hate to plug something you have to pay for that I gain monetary benefit from, but that’s the only way that content is available currently.