The world’s leading publication for data science, AI, and ML professionals.

The Most Important Trait of Great Data Engineers

Are You Curious to Find Out?

Photo by Marko Blažević on Unsplash
Photo by Marko Blažević on Unsplash

All the great data engineers I’ve spoken with or read have a single trait in common. Conversely, all the not-so-good data engineers are missing this one key ingredient. Not a skill or knowledge of some specific tool, but an innate quality within. Data engineers solve new problems every day, and without this most important trait, we will likely fail to deliver anything of value.

Now, we’ve already talked about the most important skill for data engineers, and this trait is related. Without this key ingredient, you will likely struggle learning new tools and processes. Let’s start with an example. I needed to move files from an FTP server to Snowflake. These files are in zip format and some are ~1G in size. Snowflake does not work with the zip format and likes files to be 100m or smaller. This creates a fun little challenge.

Data flow for FTP to Snowflake Workflow
Data flow for FTP to Snowflake Workflow

I needed to download zip files from FTP, unzip them, split them into smaller files, then upload them to a Snowflake stage. New files were created daily, so this needed to be automated. Since we were not using Airflow, I could not use existing operators and Fivetran seemed like an expensive solution to a small problem. Seems like a great task for Python, my go-to scripting language. But then my manager challenged me.

Thinking Different

Can you do this without writing a new application? Hmm, I don’t think so. Seems like an easy case for scripting. Well, that made me curious. This is not a simple Python script. I’d have to query the FTP server to see what files exist. Download them, unzip them, then split them to smaller files. This seems like quite a bit of coding when considering everything that can go wrong in the process. Maybe there was a better way. My curiosity was piqued.

Someone must have already solved many of the problems I faced. I found that LFTP has a mirror command that will download any new files from an FTP server. Linux has a split command custom made for splitting text files into smaller chunks. Maybe Python wasn’t the right solution. To top it off, we already had ARGO configured in a Kubernetes cluster, so workflow management already existed. Long story short, I replaced myself with a very small shell script. A couple of them, actually. So, what does this have to do with the most important trait of data engineers?

What Is It Already!

I think you know what we’re getting at. It’s curiosity. Are you curious by nature? Do you love figuring out how things work? Coming up with clever ways to solve difficult problems? Yes? Great, you will be an excellent data engineer, because curiosity is the base trait all great data engineers have in common. We can’t rely on existing patterns to solve every problem, so we must come up with our own. To do that, we need to be curious. We need to continually expose ourselves to new tools and processes. New ways of doing things.

Data engineers solve new problems every day, and without curiosity, we will likely fail to deliver anything of value. Curious data engineers have read the blogs, watched the videos, and played with the new tools. They have re-evaluated older technologies and found new ways to make use of them. When challenges arise, they research and use their existing knowledge to find clever solutions. This is why all great data engineers I know are curious by nature. But, they don’t kill any cats.


Related Articles