Data Engineering — Complete Reference Guide From A-Z [2019]

Yan Parker
12 min readJul 29, 2019

--

1. Data Engineering — Fast start

‘A scientist can discover a new star, but he cannot make one. He would have to ask an engineer to do it for him.’ — Gordon Lindsay Glegg

Nowadays everybody wants to be a Data Scientist. What about Data Engineers? Do not take it personally, but in fact, Data Scientists are as good as the quality of data they are provided with. Since companies store their data in a variety of formats across databases and text files, the key role of Data Engineers is to build data workflows, pipelines, and ETL processes that prepare and transform data for Data Scientists making their jobs more effective.

source: Thomas Henson @henson_tm

More on this topic:

Data Engineering 101 for Dummies like Me

Data engineering: A quick and simple definition

Who Is a Data Engineer & How to Become a Data Engineer?

Data Engineers: The best friends of Data Scientists you forgot to hire.

What is a Data Engineer?

2. Data Engineering vs. Data Science

source: https://twitter.com/jessetanderson/status/1115618459725979649

‘Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers, giving meaning to an otherwise static entity.’ Urthecast’s David Bianco notes.

These two positions are not interchangeable. However, there is a significant overlap between Data Engineers and Data Scientists when it comes to skills and responsibilities. The main difference lies in their FOCUS on a different aspect of data utilization.

source: The core competencies of data scientists and data engineers and their overlapping skills. Illustration by Jesse Anderson @jessetanderson and the Big Data Institute @bdi_oxford

In short, Data Engineers are focused on building infrastructure and architecture for data generation, while Data Scientists are focused on advanced mathematics and statistical analysis on that generated data.

The difference in skillsets translates into differences in languages, tools, and software that both use. Below a great overview by DataCamp including both commercial and open-source alternatives and overlapping area.

source: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer

If you need another explanation between Data Engineer and Data Scientist, have a look at a widely shared AI Hierarchy of Needs by Monica Rogati.

Data Engineering covers the first 2–3 stages, while Data Science — stages 4 and 5.

based on: https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007

Data Engineers are just as important as Data Scientists but seem to be less visible because they tend to move away from the final analysis product. According to Dataquest, it is just like with a race car builder and race car driver.

Data Scientist vs. Data Engineer

More on this topic:

Data Scientist vs Data Engineer

Data Engineer VS Data Scientist

Why a data scientist is not a data engineer

Data engineers vs. data scientists

3. Data Engineering Skills

source: https://i.pinimg.com/originals/5e/23/93/5e23938b01f90a79eb73c3e69f446d64.jpg

Data Engineers are the data professionals who prepare the ‘big data’ infrastructure to be analyzed by Data Scientists. They design, build, integrate data from various resources and then, they write complex queries on that, make sure it is easily accessible, works smoothly, and their goal is optimizing the performance of their company’s big data ecosystem.

That’s why a good Data Engineer - apart from tools and languages necessary to do this job - should be first and foremost good at:

✔ Excellent Problem Solvers

✔ Multi-Disciplined

✔ Team-Oriented and Collaborative

✔ Curious and Never Stops Learning

Now, let’s have a look at Top 8 Skills a Data Engineer needs to acquire:

source: https://youtu.be/Pym0oriyQgM
  1. Tools and Components of Data Architecture: Having knowledge of building complex database systems for companies. This term also refers to processes that address the data at rest, data at motion, data sets, and how they relate to data-dependent applications and processes.
  2. Big Data Frameworks/Hadoop-based Technologies: There are several tools in the Hadoop Ecosystem which caters different purposes & professionals belonging to different backgrounds: HDFS (Hadoop Distributed File System), YARN, MapReduce, PIG & HIVE, Flume & Sqoop, ZooKeeper, Oozie.
source: https://www.edureka.co/blog/big-data-engineer-skills/

3. Real-time Processing Framework: Apache Spark — a distributed real-time processing framework which can be easily integrated with Hadoop leveraging HDFS.

4. Heavy, In-Depth Database Knowledge — SQL (e.g. MySQL) and NoSQL (e.g. HBase, Cassandra and MongoDB): Structured Query Language is used to structure, manipulate & manage data stored in databases, while NoSQL databases can store large volumes of structured, semi-structured & unstructured data with quick iteration and agile structure as per application requirements.

5. Coding Skills: Python, C/C++, Java, Perl, Golang, or other such languages.

6. ETL/Data Warehousing Solutions (e.g. Talend, Informatica): when managing a huge amount of data from heterogeneous sources you need to apply ETL (Extract Transform Load). Data warehousing helps you aggregate unstructured data from one or more sources to compare and analyze for better business.

7. Machine Learning: Although machine learning is technically something assigned to the Data Scientist, having some level of understanding of how to put the data into use using statistical analysis and data modeling is a huge advantage.

8. Solid Knowledge of Operating Systems: Apart from extensive knowledge in operating systems such as UNIX, Linux, Solaris or MS Windows can be very helpful as most of the tools will be based on these systems.

More on this topic:

A Beginner’s Guide to Data Engineering — part I, II and III

Big Data Engineer Skills: Skills Required To Become A Big Data Engineer

O’Reilly’s Suite of Free Data Engineering E-Books

4. When Does My Team Need a Data Engineer?

Tristan Handy @jthandy Founder & CEO @ Fishtown Analytics says that the role of the Data Engineer in a startup data team is changing rapidly which impacts the sequence of Data Engineer hires. Previously we needed Data Engineers first, ‘because data analysts and scientists had nothing to work with if there wasn’t a data platform in place. Today, data analysts and scientists should self-serve and build the first version of their data stack using off-the-shelf tools’.

Handy’ s advice is to hire Data Engineers as you start hitting scale points:

based on https://blog.fishtownanalytics.com/does-my-startup-data-team-need-a-data-engineer-b6f4d68d7da9

5. Building Data Team

based on https://marketoonist.com/2014/01/big-data.html

Data-driven teams consist of Data Engineers, Data Scientists, and Data Analysts. While these titles may sound similar, each role focuses on a different aspect of data utilization. It is important to realize how these roles complement each other. Having a Data Scientist doing the job of a Data Engineer and vice versa is a waste of precious resources. Also finding a unicorn — one person who is both skilled at Data Engineering and Data Science seems impossible.

source: https://twitter.com/kdnuggets/status/1058015837330784256?s=20

Keith McNulty @dr_keithmcnulty shares his experience and describes in 6 steps the process of building an analytics team. These steps advice how the team should operate and be structured, what skills are required, and what types of profiles and skillsets should enter the team.

Michael Kaminsky draws our attention to the fact that the roles and responsibilities of data engineers, analysts, and data scientists are changing.

Data Engineer previously:

Data engineers: traditionally, this has been a ‘plumbing’ job of moving bytes from point A to point B, typically misnomered simply as ‘ETL’. They were concerned with building robust and scalable infrastructure for ingesting and storing data, but generally did not concern themselves with ‘business logic’— once the data were in the warehouse, it wasn’t their problem anymore.

Data Engineer today:

Data engineers: still responsible for data infrastructure and plumbing code, but the team is now generally much smaller than it was in the past. Many companies can get by just using contractors and consultants in the beginning, and they may only need one or two data engineers to ‘fill in the gaps’ of what they can’t purchase from off-the-shelf solutions.

More on this topic:

How To Build A Big Data Engineering Team

How to build an analytics team for impact in an organization

Building Data Science Teams: What Do You Need to Know?

6. Are Data Engineers Expected to also be Data Analysts?

source: https://qr.ae/TWnwFd

7. Data Engineering Job Description

source: https://www.equest.com/cartoons/cartoons-2013/thank-you-facebook/

Whether you are an aspiring Data Engineer or looking for one of them it is always a good habit to use good examples. According to Toptal ‘the actual definition of Data Engineer’s role varies, and often mixes with the Data Scientist role’. They share their Big Data Engineer — Job Description and Ad Template you can use to either create a job announcement or to simply review commonly required skills on this position. Another template you will find at Glassdoor.

Nevertheless, here’s what Data Engineering roles typically demand:

  • Build, test, and maintain optimal data pipeline architecture
  • Assemble large, complex data sets to meet both functional and non-functional business demands
  • Build the infrastructure necessary for optimal extraction, transformation, and loading of data (from a variety of sources leveraging AWS and SQL technologies)
  • Identify, design, implement and enhance internal processes
  • Automate manual processes
  • Optimize data delivery
  • Re-design infrastructure for greater scalability
  • Build analytics tools that utilize data pipelines to deliver actionable insights
  • Work with all stakeholders across departments
  • Assist data scientists in building and optimizing products

Another useful tool when screening candidates for roles within data teams might be Hackerrank’s [Checklist] Screening Data Scientists vs. Analysts vs. Engineers. ‘The nuances between Data Analysts, Data Scientists, Data Engineers may seem minute at first, but each has a distinct role to play in deriving and conveying meaningful insights from data’. The checklist makes a great overview of what to expect from each respective role.

source: https://twitter.com/hackerrank/status/1063944936440115200?s=20

8. Data Engineering Jobs

source: https://vublsts.wordpress.com/2016/09/28/anyware-privacy-and-location-data-in-the-era-of-machine-learning/

While the Harvard Business Review may have declared ‘Data Scientist: The Sexiest Job of the 21st Century,’ it is the Data Engineering team that allows them to shine. Without the Data Engineering support, the sexy Data Scientist job will quickly devolve into something about as sexy as a street sweeper. — Bill Schmarzo @schmarzo, CTO Hitachi Vantara.

Due to the fact that demand for Data Engineers is greater than supply, there is not a problem to find a job in this field. There are hundreds of jobs to scroll down on websites like Indeed, Linkedin, Glassdoor, Datajobs, Ziprecruiter, Angel, Stackoverflow, etc.

But much bigger problem than finding a job in Data Engineering is finding the candidate by growing startups. Some of them still need to hire on-site employee, ‘however hiring in big cities like London, New York, Berlin is hard, since for the same limited talent pool there are also fighting corporations having big budgets. Even if you finally manage to hire great techies they will be on the target of other hungry startups crazily looking for devs. And what if you lose your engineers at the peak of product development? Both investors and you would not be happy’. I encourage you to read the whole article by Rob Renner from @JavaShopPoland: Why Fintech Startups Should Build Their Development Teams In Poland? together with this one: Data engineers are there, can you see them? by Guillaume Payen from Moonshots.

9. Data Engineering Salary

https://pl.pinterest.com/pin/842665780247850785/

Data Engineers’ salaries depend on variables such as the type of role, skills, experience, and location. According to Glassdoor, the global average salary for a Data Engineer is about $116,591 per year while the same index according to Payscale equals $91,845 per year.

According to Payscale ‘skills in Apache Spark, Hadoop, AWS and ETL are correlated to pay that is above average. Skills that pay less than market rate include Data Analysis and SQL’.

source: payscale.com

Below a short overview presenting the average Data Engineer’s salary per year in a few locations:

based on payscale.com and neuvoo.ch

10. Data Engineering Interview Question

source: https://media.giphy.com/media/b7MdMkkFCyCWI/giphy.gif

If you are looking for a job that is related to Data Engineering, you need to prepare for the interview questions you might face during the meeting or a call.

There are a few self-study pages recommended by us you should definitely visit to prepare for this conversation.

  1. Data Engineer Interview Questions — the list of top 10 Data Engineer Interview Questions and Answers Updated for 2019
  2. Things you should know when traveling via the Big Data Engineering hype-train —really interesting article by Wojtek Pituła @Krever01
  3. The Interview Study Guide For Data Engineers — with a practical Downloadable Data Engineering Interview Checklist
  4. Facebook Data Engineer Interview Questions & Amazon Data Engineer Interview Questions — covering real-life examples of Facebook and Amazon Data Engineer Interview Questions
  5. 30+ Best Data Engineer Interview Questions To Get Hired With — another set of 33 Q&A
  6. 1001 Data Engineering Interview Questions by Andreas Kretz also available on Github in PDF [from page 111].

12. Data Engineering [Online] Courses

source: canva.com

There are many online courses offering significant training in this field:

Some sites, such as DataCamp, Memrise and EdX are heavily focused on data science and engineering, while others, such as Udemy and Galvanize, have a wider scope. Your choice of course provider will depend on your needs, level of knowledge and your wallet size.

TIP: Although a course can help you expand your knowledge it may only offer you a certificate or diploma rather than a certification. Therefore do not treat a course as a replacement for actual certification or accredited diploma issuance.

More on this topic:

Learn Data Engineering: My Favorite Free Resources

Want to Become a Data Engineer? Here’s a Comprehensive List of Resources to get Started

13. Data Engineering Podcasts

Constant innovations can overwhelm anyone, but in tech, if you’re not learning, you’re becoming obsolete. Therefore, to stay up to date with what is going on in Data world try listening to these 5 podcasts’ resources:

  1. Data Skeptic
  2. The O’Reilly Data Show
  3. Digital Analytics Power Hour
  4. Data Engineering Podcast
  5. Data Stories

However, if you prefer watching & listening than only listening go to Webcasts / Webinars on AI, Analytics, Big Data, Data Science, & Machine Learning by KdNuggets — an up-to-date list of webinars or watch Thomas Henson’s YouTube channel.

14. Data Engineering Case Studies

As for this point, there is a comprehensive case study collection created by Andreas Kretz in his Data Engineering CookBook. You will find here a great number of examples of companies like Twitter, Netflix, Amazon, Uber, Airbnb, and many other prominent players. Some of them are also available on Youtube. Download the PDF and follow the list of contents to find the required resources.

source: https://twitter.com/KirkDBorne/status/1153447958895091712?s=20

15. Data Engineering Conferences 2019

source: https://twitter.com/Khanoisseur/status/556881894366003200?s=20

KDnuggets, one of the biggest websites in the Big Data and data science space provides us with a complete and updated list of Meetings and Conferences on AI, Analytics, Big Data, Data Mining, Data Science, & Machine Learning broken down by location in America, Europe, and Asia.

If it happens you find even more widespread list of conferences just put it in a comment.

16. Top 10 Data Engineering Mistakes

Other suggestions?

--

--

Yan Parker
Yan Parker

Written by Yan Parker

Helping better understand your data #DataEngineering #DataScience #ETL #datapipelines #ApacheSpark #Scala #MachineLearning #AI #DataDriven | polystat.io

Responses (3)