We Don’t Need Data Engineers, We Need Better Tools for Data Scientists

The role of Data Engineer exists as we know it because of a lack of adequate tooling for Data Scientists

--

In most companies, Data Engineers support the Data Scientists in various ways. Often this means translating or productionizing the notebooks and scripts that a Data Scientist has written. A large portion of the Data Engineer’s role could be replaced with better tooling for Data Scientists, freeing Data Engineers to do more impactful (and scalable) work.

Why does this matter?

There’s a sentiment making its way around the internet (again): We don’t need Data Scientists, we need Data Engineers.

These articles focus on the number of available job positions for the title of “Data Engineer” vs “Data Scientist”. Let’s put aside the fact that the hiring managers who post these positions often don’t know the difference between the two jobs and use them interchangeably (or use whatever is in style at the moment). For the sake of this article we can take the existence of the positions at face value. The question then becomes: Is the surplus of available Data Engineer positions solely a personnel problem?

Data Science is messy because it reflects the real world

Data Scientists are domain experts (on top of knowing statistics), and they don’t often have a strong background in programming. I’ve seen this expertise discounted in multiple Twitter and forum threads, with software engineers and other “technical people” asking questions like “Why don’t they just learn Spark?”. This type of mentality completely misses the fact that Data Scientists can already do what they want to do at smaller scales with their existing tools. Data Scientists want to gain insights, not worry about building elegant pipelines. Companies want something actionable, not beautiful.

Insights are more important than elegant pipelines.

Popular Data Science tools are also criticized by more technical people and academics: “Why would anyone use pandas?”. pandas must be the most popular tool to hate by people who have no use for it. It is loved (or at least appreciated) by the Data Scientists who use it daily, however.

If pandas is so bad, why has nothing unseated it?

pandas, among other tools, was built to handle the messiness of the real world. Just look at how many parameters read_csv has:

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

If pandas is so bad, why has nothing unseated it as the standard dataframe for Python Data Science? Why does it continue to grow in adoption year after year? It’s not the fastest, it’s not the most robust, so why?

Data Engineers have to handle the messiness that scalable tools can’t

The scalable systems (e.g. Apache Spark) that are robust enough for production use can’t handle the messiness of the real world as-is. It’s difficult to scale without clean and simple assumptions, and the messier the problem, the harder it is to scale. Data Engineers handle the messiness because scalable tools can’t.

Scaling with messiness is extremely difficult. Data Engineers handle the messiness because the tools can’t.

Messiness in this case can mean:

  • Group/Join Key Skew
  • Partitioning
  • Debugging Distributed Systems
  • Cluster configuration and resource provisioning

None of these are things that you have to worry about with smaller scale systems. Outside of the Bay Area, most Data Engineers spend time debugging and translating to a distributed system, usually Spark.

Multiple rewrites are necessary to turn one time insights into production jobs.

We can’t really fault anyone here, the people who built the scalable tools in use today were building for highly technical users like themselves. Highly technical people don’t need their tooling to handle messiness for them, and often they want knobs to tune. Dogfooding is a popular concept in system engineering: “those that built it also use it”. I think worrying so heavily about dogfooding can in part cause the landscape we are seeing in data science today: “only people as technical as those that built the system can use it”.

What, then, should Data Engineers do?

The Data Science ecosystem needs systems that don’t only focus on the problems of those building it. Data Scientists have been mostly stuck using the same or similar tools for the last 10+ years. The explanation for this is twofold: (1) Data Scientists love using their existing tools because they understand them, and (2) those who are capable of building large scale systems have largely (unintentionally) overlooked the problems of those less technical than they.

We need Data Engineers to help build tools that empower Data Scientists, not translate pandas to Spark.

We need Data Engineers to help build scalable tools that empower Data Scientists, not translate pandas to Spark. Who better to help build the next generation of Data Science tools than today’s Data Engineers?

--

--