The explosion of data volumes and throughputs in the business climate have led companies in the direction of data-driven decision making. Historically, data scientists have reaped the benefits of this new found direction. However, as the role of a data scientist narrows, a new role is coming to the forefront of the data field.
What is data engineering?
Data Engineering is a relatively new term and, as such, it has yet to standardize its definition across the various industries. Some companies will hire a data engineer with responsibilities more akin to a database administrator while other companies will expect data engineers to have data science capabilities. To ensure we are discussing the same definition, when I refer to a data engineer, I have the following definition:
Data engineering is a specialization in software engineering, focused on the storage and flow of data in its raw and/or processed form.
Let’s breakdown this definition a bit:
"…specialization in software engineering" : I find it important to highlight that data engineering is a specialization in software engineering. As a data engineer, you are solving software problems in the data space. The same mindset that a generalist software engineer uses to solve problems is needed when approaching data engineering problems. There is a reason that data engineers still need to understand data structures and algorithms for their interviewing loop(s). However, more so than data structures and algorithms, a data engineer also interface with databases, microservices, third party APIs, streaming technologies, etc. A solid foundation in software engineering is essential to be a successful data engineer.
"…focused on the storage and flow": The two areas of interest for most engineers are the storage of data and its flow, or movement. Data engineers are focused on the movement of data from a data source(s) to a data sink(s). Therefore, understanding the storage and movement processes are vital to a data engineer’s success as this will be the prime focus of the day to day tasks.
"…in its raw and/or processed form": Data has many different forms in its lifecycle. A data engineer is expected to handle the data no matter the form it takes. The structure of the data is irrelevant; a data engineer is expected to have the toolset required to accomplish the task at hand.
Now that we have aligned on a definition, let’s jump into how we become a data engineer!
How to become a data engineer
Skill 1: Understand the fundamentals of computer science
Since data engineering is a speciality of computer science, it should not be a surprise that you need to thoroughly understand the fundamentals. In my experience, you do not need to have an in-depth understanding of operating systems or compilers, but you should be able to have an intelligent conversation on a solution’s impact on scalability, for instance.
Example skills include: data structures, algorithms, parallel processing, etc.
Skill 2: Storage, storage, storage
An understanding of storage systems is a must-have skill for all data engineers; after all, data is useless unless it can be retrieved for analysis. Familiarize yourself with the various database technologies (relational databases, NoSQL databases, data warehousing, etc.) and their differences. Understand how to model various datasets and the trade-offs of using one technology over another. Most importantly, understand how to interact with a database (hint: typically SQL or an SQL-variant) to retrieve data. Data storage is a large component of the data engineering role so the more you know about data storage, the better.
Example skills include: Cassandra, PostgreSQL, Hive, etc.
Skill 3: Data movement
As with storage, data movement is a must-have skill for all data engineers. A data engineer is primarily responsible for moving data from point A to point B. We typically refer to this as Extract, Transform, and Load (ETL) processes:
Extraction: grabbing, or extracting, data from the data source
Transformation: changing the shape of the data based on a set of business logic
Load: putting, or loading, the data into the data sink
NOTE: this can sometimes be referred to as ELT if the load step takes place prior to the transformation step.
Data movement is a bit nuanced. The "how" varies from company to company. Some companies prefer to build their own in-house software for managing data movement while others rely on third-party software. However, most companies, whether using a third-party software or not, require data engineers to have a solid foundation in Python/bash.
Example skills include: python, bash, Scala, Talend, Informatica, etc.
Skill 4: Orchestration
New data is constantly being generated so data engineers need a way to "refresh" stale data. One way to do this is through real-time processing. However, the more common way (at least for now) is to have a batch process. Typically, a data engineer would not want to run this pipeline manually so a batch job is scheduled. A batch job may actually include multiple data pipelines with interdependencies throughout so orchestration becomes essential. Data orchestration typically is a scheduling software that allows an engineer to define the interdependencies so that job A will always run before job B if defined as such. The software itself can be an in-house tool or a third-party software, but, as a data engineer, you will inevitably encounter some form of job scheduling software so it is important familiarize yourself with the concepts.
Example skills include: Airflow, python, batch, etc.
Okay, I learned all of these skills, but I am not a software engineer…
Not a problem! Most, if not all, jobs are data-driven. Frame your current role into something that aligns with data engineering. Find metrics that you can store and, later, analyze via a data pipeline. If all else fails, start a side project. There are an enormous amounts of open source data sets that you can use to practice loading data into a database and moving the data around in an automated fashion. Skills learned for data engineering can benefit most roles and set you on a path to becoming a data engineer!
Good luck!