The world’s leading publication for data science, AI, and ML professionals.

The Maturity of Data Engineers

Does the rise of data science and machine learning affect the role of data engineers?

Photo by Campaign Creators on Unsplash
Photo by Campaign Creators on Unsplash

DATA SCIENCE AND MACHINE LEARNING

What does a data engineer do?

Let’s start with three big wars that we need to understand before understanding what a data engineer does.

Data mining, Big Data, and data pipeline.

Data mining means pre-processing and extracting some knowledge from the data, so we use some data to extract knowledge. Big data contains lots of data and variables. Those data are enormous, and you need to have it running on cloud computing or multiple computers such as AWS¹, Azure², and Google Cloud³ because they have a lot of machines and storage to store that data.

Normally, Big Data is not stored in one machine. This usually happens because the dataset gets so big. Having data in a Database like MySQL or Postgres or any database becomes complicated when it is in one single machine. New technologies were invented to solve this problem, like Hadoop⁴ and NoSQL⁵.

A data pipeline is essentially a pipeline that a data engineer built. The fact, you need to extract information from this data using data mining. Data engineers need to make a pipeline that allows data to flow from unknown large amounts of data to a more useful form.

Data engineers essentially create a data pipeline where all the information comes from different devices like IoT devices, mobile applications, web apps, cameras, cars, and anything that collects data and stores information or logs data into servers or to the cloud.

Data engineers accumulate all this information into nicely packed databases and store engines so that different parts of the company can create visualizations. They can monitor the performance of their product, get business insights, make business decisions from this data, and even use this data on their apps – for example, for user profiles.

Before a company is looking for a data scientist, Machine Learning expert, business intelligence, or data analyst, they need to hire a data engineer to build the pipeline. Data engineers bring in all the information organized to do data modeling. They help with the data collection part. Usually, a machine learning engineer or data scientist doesn’t have to be concerned about the data pipeline.


In practice, data engineers start with the process called data ingestion, which collects data from various sources and ingests them into what we call a data lake. A data lake is a collection of data. However, we don’t want the lake to overflow dry up. We need to perform something called data transformation, which is converting data from one format to another, usually into something we call a data warehouse.

A data warehouse is a place that saves accessible data that is useful for the business. Before placing the data into a data warehouse, data engineers look into raw data and uses some parts of the data that are useful and then put it into a data warehouse so that other parts of the business can use it. We can assume that a data lake is a pool of raw data. That means that data lakes are usually less organized and have less filtration than something like a data warehouse.

The question is, why would businesses want to do that?

It’s a lot easier to analyze data when it’s organized. We might have data in data lakes that we don’t need. However, we save on storage space in the data warehouse because we don’t have to store all the data and only store the data structure. Building data warehouse infrastructure is expensive; therefore, we can save money with this kind of data management.

To review, a data engineer built this pipeline of taking the data production and data capture using data engineering practices to build this pipeline. So that data can now be analyzed by data scientists and data analysts.


What kind of tools do data engineers use?

You may have heard of Apache Kafka⁶, Hadoop⁴, Amazon S3¹, or Azure Data Lake². These are programs that have been built by engineers to carry large amounts of data like a data lake. There are also tools like Google Big Query², Amazon Redshift¹, and Amazon Athena¹. These are data warehouses that allow engineers to make queries or analyze the structure data.

In this whole system, we’ve study that the data engineer creates this entire system for business. They use different tools and programs to ingest data and then put it into a data lake or a data warehouse. As a data scientist and a machine learning expert, which data do you use? Most of the time, you would be working with a data lake because if you’re doing machine learning, the more information you have, the better.

With machine learning, you can use structured or unstructured data to go into a data lake and grab a bunch of data to use for your models, whether in CSV forms or any other forms. Usually, data warehouses are used by business intelligence people or business analysts or data analysts to make visualization or analyze data because the data warehouse usually has more structured data that has been cleaned out.

As a data scientist, you can use the data from a data warehouse. This isn’t just a rule; it’s usually you use whatever data is useful to you. They use as much data that they can, as many valuable data as they can. In contrast, somebody like a business intelligence person or a data analyst already has the data cleaned processed by a data engineer and use something like a data warehouse to analyze data.

Something like Google’s Big Query² precisely does that. It allows somebody with not much engineering experience or programming experience to analyze it in a data warehouse. Typically, software engineers, software developers, app developers, mobile developers build programs and apps that users and customers use. A data engineer would then make this piping and pipelining to ingest data and store it in different services like Hadoop⁴ or Google Big Query². Then the rest of the business can access data.

We also have data scientists who use the data lake and the data scientists to extract information and deliver some business value. Finally, we have data analysts or business intelligence to use data warehouse or structured data to derive business value.

Nowadays, the industry is fast evolving, and there’s some overlap. Sometimes job descriptions might say be different from the other, but they are general simplified rules that you can use to understand how each role plays into the part of a company.


Conclusion

There are three main tasks as data engineers. First, data engineers build an extract transform load pipeline, also known as ETL. Unlike data ingestion, which means moving data from one place to another, and ETL pipeline is the idea that a data engineer extracts data that has been generated by all of these systems. They extract data, and then they transform the data into a useful form that can be loaded into a data warehouse. So, data can be used by the rest of the company, and they used programming languages like Python, Go, Scala, and Java to accomplish these ETL jobs.

Next, data engineers also build analysis tools to understand how company systems work. A data engineer needs to make sure that when any part of the system breaks, it is notified. Data engineers allow data scientists, data analysts, and business intelligence people to use tools to analyze the data and ensure that the system they’ve put in place is running correctly.

Finally, their third main task is obviously to maintain the data warehouse and data lakes, which is making sure that everything in there is accessible for other parts of the companies to use.

Now, you have a high-level overview of what a data engineer does. However, this landscape is fast changing because new tools are always popping up. So my advice is don’t take as the absolute must know for all data engineers instead see that they exist. Furthermore, it looks like the role of data engineers will be replaced by data scientists. Go and read some of their documentation, only learn or use them once the need arises because they’re regularly updated, and the world of data engineering is fast-paced right now.


About the Author

Wie Kiang is a researcher who is responsible for collecting, organizing, and analyzing opinions and data to solve problems, explore issues, and predict trends.

He is working in almost every sector of Machine Learning and Deep Learning. He is carrying out experiments and investigations in a range of areas, including Convolutional Neural Networks, Natural Language Processing, and Recurrent Neural Networks.


References
#1 Amazon Web Services
#2 Microsoft Azure
#3 Google Cloud
#4 Apache Hadoop
#5 NoSQL
#6 Apache Kafka

Related Articles