Sometimes I’ll train at a company that’s creating a data engineering team. The team often includes a Data Scientist.
I’ll always make a note to talk to the Data Scientist about their experience and interactions with the team before I arrived. These Data Scientists are recent hires – within the last 6 months. A clear theme is that their time is under-utilized. They’ve been waiting for 2-6 months for a Data Engineer to create the data pipeline for them.
The trouble is that the definition of Data Scientist is highly variable. For some, it means a person with some programming skills that has math skills. With Data Scientists, the programming and distributed system skill level is incredibly variable. They can range from people with a CS degree to beginner programmers.
These beginner to intermediate programmers will have the most difficulty in creating the data pipeline. They’re lacking the programming, distributed systems, and Big Data skills to create a data pipeline because that’s a complex endeavor; they’re not lacking the math or statistical skills.
These inabilities lead to issues all around. The Data Scientist expected the data pipeline to already be created when they were hired. They’re used to creating the models and not the hardcore data engineering that’s needed. They’re consumers of the pipeline and the creators of the pipeline. The company and managers are expecting the Data Scientist to create the data pipeline.
When I’ve encountered this issue, the Data Scientist has been idle for 2-6 months. After about 6 months they’ll quit. They haven’t done any of the cool stuff they thought they were signing on for. At small companies, this spells the end of the Big Data foray.
My suggestion is to make sure you have a data pipeline before hiring your first Data Scientist. This will require you to create a data engineering team, before or at the same, as you’re creating a data science team. At a minimum, you need to inventory your datasets and make them available before hiring a Data Scientist.