Data engineers are there, can you see them ?

Guillaume Payen
Towards Data Science
5 min readMay 23, 2018

--

There is a high demand for data engineers these days. I can see job proposals flourishing over LinkedIn. Yet, recruiting a data engineer is quite hard. I hear a lot say this has to do with this offer and supply balance. Big data is in the trends all around the world, and there is a high need for engineers able to tame this data. This is a fact, and it is undeniable. The reason why I am writing this post today is that I strongly believe there is something else, or rather something more.

Did you say data engineer ?

After looking at job proposals on the Internet, I came up to the conclusion that the data engineer position is quite hard to define. I could read various positions and skills requirements for the same so called data engineer, and it really looked like every company had its own definition of what a data engineer is. I have seen job descriptions that were very close to IT engineer. Others were about development. Surprisingly, I have even read about companies looking for a data scientist as a data engineer.

This fuzziness brings much confusion. This is disturbing for me, even though I have been working as a data engineer for some time now.

I definitely think there is something missing in the definition of what a data engineer is, and here’s my take.

The missing point

Most job offers I can see are requiring technical skills such as ETL, Spark, Hadoop, or NoSQL. This is very very great, but these are just tools actually. What we are speaking about here is ways to achieve a purpose. And this purpose is data processing, which is — to me — exactly what this is all about. I got interviewed many times, for a job or a contractor mission, and I always got the same questions: What are Scala companions? What is a DStream ? How would you configure partitions in Kafka? Those questions are really interesting (well I find them interesting as I love this field of informatics), but during those interviews, people were checking that I knew the tools, that I filled the grid, and hardly no one spoke about the value I could extract from them.

Take this example, Apache Spark: Spark has become one of the first choice tools to use when you want to process data. Good, but the take on Spark is not about writing jobs. You can code the same Spark job with 4 different languages. Writing some Spark has become really accessible, but regarding your project and your end users, does this ensure your data will be processed the most efficient way ? Actually, it does not. And this is where the data engineer comes into action. The real value of the data engineer will be to bring his expertise to deploy and distribute a job in the cluster, with the resources configuration, that will use the cluster at its finest.

Bring me some value

To be concise, Data engineers’s value is about why, when, where and how the data will be processed. So let me say the big word once in for all. Yes, in the end, all of this is a matter of data, and this is exactly what a data engineer is here for. I believe that data engineers won’t think in terms of tools and techniques, but rather architecture, and choose tools like books on a shelf. Actually, the real purpose of a data engineer is to understand the data, design the way it could be processed, and then help facilitate its processing by choosing the optimal tools configuration that will ensure reliability and performance.

Of course, data engineers need a good understanding of the environment, and the context. Much experience is then required to be legitimate to choose between all of those cutting edge tools. But knowing the tools brings no guaranty over the performance of your data pipeline. Also it can lead to misunderstandings of what kind of resource you really need. Maintaining a Hadoop cluster is not data engineering, it is IT. Developing exclusively on Spark is not data engineering, it is development. Building data models with MLLib is not data engineering, it is machine learning.

When I receive a candidate for a Data engineering job, I try to focus on how he considers the interaction between his skills and the data he will have to process. Learning a tool can take weeks, maybe months. Having the good mindset will take years.

Expect more than mastering tools and frameworks

This is why I think finding data engineers is more than validating expertise over the traditional big data stack. This is actually a matter of what you want to focus on. If your goal is to find the expert among the experts, the one, then fine, and I agree with you, your research is going to be a long journey. But if you focus on what the engineer can provide with because he focuses on the data rather than the tools, well you may have picked a potentially very good candidate.

In the end, as clients are becoming more and more demanding, and data is turning more and more complex to process, I tend to believe that there could be much more data engineers than there actually are. Many engineers get discarded when interviewed because they don’t know enough about the inner details of how HDFS works, or they don’t know enough about scala. Are you rejecting the candidate because he doesn’t know the tool ? Wouldn’t be more relevant to test him about his interest in data, and his focus on how he can extract value from it ? A person with the good mindset will achieve so much more than just using tools…

Let me share with your an experience of mine to explain why I believe that data engineers are all about data. A couple of years ago, in my former job, I have faced quite a strange situation where I was working on a project where there was no data! I was ready to rock, with my Kafka setup, my Ansible scripts, my Spark cluster and my jobs ready to run. Thing is that there was no data. At this moment, my job was looking so meaningless that I realized that the most important thing above everything was data.

Why are data engineers all about data ? Because without data, data engineers loose what makes them so important. We call this relevance.

--

--

I live in Paris, France, and I am a Data/Cloud architect and machine learning engineer. I’m glad I can share my projects and experiments with you on Medium :)