Data Engineering — Complete Reference Guide From A-Z [2019]
1. Data Engineering — Fast start
‘A scientist can discover a new star, but he cannot make one. He would have to ask an engineer to do it for him.’ — Gordon Lindsay Glegg
Nowadays everybody wants to be a Data Scientist. What about Data Engineers? Do not take it personally, but in fact, Data Scientists are as good as the quality of data they are provided with. Since companies store their data in a variety of formats across databases and text files, the key role of Data Engineers is to build data workflows, pipelines, and ETL processes that prepare and transform data for Data Scientists making their jobs more effective.
Data Engineering 101 for Dummies like Me
Data engineering: A quick and simple definition
Who Is a Data Engineer & How to Become a Data Engineer?
Data Engineers: The best friends of Data Scientists you forgot to hire.
2. Data Engineering vs. Data Science
‘Data engineers are the plumbers building a data pipeline, while data scientists are the painters and storytellers, giving meaning to an otherwise static entity.’ Urthecast’s David Bianco notes.
These two positions are not interchangeable. However, there is a significant overlap between Data Engineers and Data Scientists when it comes to skills and responsibilities. The main difference lies in their FOCUS on a different aspect of data utilization.
In short, Data Engineers are focused on building infrastructure and architecture for data generation, while Data Scientists are focused on advanced mathematics and statistical analysis on that generated data.
The difference in skillsets translates into differences in languages, tools, and software that both use. Below a great overview by DataCamp including both commercial and open-source alternatives and overlapping area.
If you need another explanation between Data Engineer and Data Scientist, have a look at a widely shared AI Hierarchy of Needs by Monica Rogati.
Data Engineering covers the first 2–3 stages, while Data Science — stages 4 and 5.
Data Engineers are just as important as Data Scientists but seem to be less visible because they tend to move away from the final analysis product. According to Dataquest, it is just like with a race car builder and race car driver.
Data Scientist vs Data Engineer
Data Engineer VS Data Scientist
3. Data Engineering Skills
Data Engineers are the data professionals who prepare the ‘big data’ infrastructure to be analyzed by Data Scientists. They design, build, integrate data from various resources and then, they write complex queries on that, make sure it is easily accessible, works smoothly, and their goal is optimizing the performance of their company’s big data ecosystem.
That’s why a good Data Engineer - apart from tools and languages necessary to do this job - should be first and foremost good at:
✔ Excellent Problem Solvers
✔ Multi-Disciplined
✔ Team-Oriented and Collaborative
✔ Curious and Never Stops Learning
Now, let’s have a look at Top 8 Skills a Data Engineer needs to acquire:
- Tools and Components of Data Architecture: Having knowledge of building complex database systems for companies. This term also refers to processes that address the data at rest, data at motion, data sets, and how they relate to data-dependent applications and processes.
- Big Data Frameworks/Hadoop-based Technologies: There are several tools in the Hadoop Ecosystem which caters different purposes & professionals belonging to different backgrounds: HDFS (Hadoop Distributed File System), YARN, MapReduce, PIG & HIVE, Flume & Sqoop, ZooKeeper, Oozie.
3. Real-time Processing Framework: Apache Spark — a distributed real-time processing framework which can be easily integrated with Hadoop leveraging HDFS.
4. Heavy, In-Depth Database Knowledge — SQL (e.g. MySQL) and NoSQL (e.g. HBase, Cassandra and MongoDB): Structured Query Language is used to structure, manipulate & manage data stored in databases, while NoSQL databases can store large volumes of structured, semi-structured & unstructured data with quick iteration and agile structure as per application requirements.
5. Coding Skills: Python, C/C++, Java, Perl, Golang, or other such languages.
6. ETL/Data Warehousing Solutions (e.g. Talend, Informatica): when managing a huge amount of data from heterogeneous sources you need to apply ETL (Extract Transform Load). Data warehousing helps you aggregate unstructured data from one or more sources to compare and analyze for better business.
7. Machine Learning: Although machine learning is technically something assigned to the Data Scientist, having some level of understanding of how to put the data into use using statistical analysis and data modeling is a huge advantage.
8. Solid Knowledge of Operating Systems: Apart from extensive knowledge in operating systems such as UNIX, Linux, Solaris or MS Windows can be very helpful as most of the tools will be based on these systems.
A Beginner’s Guide to Data Engineering — part I, II and III
Big Data Engineer Skills: Skills Required To Become A Big Data Engineer
4. When Does My Team Need a Data Engineer?
Tristan Handy @jthandy Founder & CEO @ Fishtown Analytics says that the role of the Data Engineer in a startup data team is changing rapidly which impacts the sequence of Data Engineer hires. Previously we needed Data Engineers first, ‘because data analysts and scientists had nothing to work with if there wasn’t a data platform in place. Today, data analysts and scientists should self-serve and build the first version of their data stack using off-the-shelf tools’.
Handy’ s advice is to hire Data Engineers as you start hitting scale points:
5. Building Data Team
Data-driven teams consist of Data Engineers, Data Scientists, and Data Analysts. While these titles may sound similar, each role focuses on a different aspect of data utilization. It is important to realize how these roles complement each other. Having a Data Scientist doing the job of a Data Engineer and vice versa is a waste of precious resources. Also finding a unicorn — one person who is both skilled at Data Engineering and Data Science seems impossible.
Keith McNulty @dr_keithmcnulty shares his experience and describes in 6 steps the process of building an analytics team. These steps advice how the team should operate and be structured, what skills are required, and what types of profiles and skillsets should enter the team.
Michael Kaminsky draws our attention to the fact that the roles and responsibilities of data engineers, analysts, and data scientists are changing.
Data Engineer previously:
Data engineers: traditionally, this has been a ‘plumbing’ job of moving bytes from point A to point B, typically misnomered simply as ‘ETL’. They were concerned with building robust and scalable infrastructure for ingesting and storing data, but generally did not concern themselves with ‘business logic’— once the data were in the warehouse, it wasn’t their problem anymore.
Data Engineer today:
Data engineers: still responsible for data infrastructure and plumbing code, but the team is now generally much smaller than it was in the past. Many companies can get by just using contractors and consultants in the beginning, and they may only need one or two data engineers to ‘fill in the gaps’ of what they can’t purchase from off-the-shelf solutions.
How To Build A Big Data Engineering Team
How to build an analytics team for impact in an organization
6. Are Data Engineers Expected to also be Data Analysts?
7. Data Engineering Job Description
Whether you are an aspiring Data Engineer or looking for one of them it is always a good habit to use good examples. According to Toptal ‘the actual definition of Data Engineer’s role varies, and often mixes with the Data Scientist role’. They share their Big Data Engineer — Job Description and Ad Template you can use to either create a job announcement or to simply review commonly required skills on this position. Another template you will find at Glassdoor.
Nevertheless, here’s what Data Engineering roles typically demand:
- Build, test, and maintain optimal data pipeline architecture
- Assemble large, complex data sets to meet both functional and non-functional business demands
- Build the infrastructure necessary for optimal extraction, transformation, and loading of data (from a variety of sources leveraging AWS and SQL technologies)
- Identify, design, implement and enhance internal processes
- Automate manual processes
- Optimize data delivery
- Re-design infrastructure for greater scalability
- Build analytics tools that utilize data pipelines to deliver actionable insights
- Work with all stakeholders across departments
- Assist data scientists in building and optimizing products
Another useful tool when screening candidates for roles within data teams might be Hackerrank’s [Checklist] Screening Data Scientists vs. Analysts vs. Engineers. ‘The nuances between Data Analysts, Data Scientists, Data Engineers may seem minute at first, but each has a distinct role to play in deriving and conveying meaningful insights from data’. The checklist makes a great overview of what to expect from each respective role.
8. Data Engineering Jobs
While the Harvard Business Review may have declared ‘Data Scientist: The Sexiest Job of the 21st Century,’ it is the Data Engineering team that allows them to shine. Without the Data Engineering support, the sexy Data Scientist job will quickly devolve into something about as sexy as a street sweeper. — Bill Schmarzo @schmarzo, CTO Hitachi Vantara.
Due to the fact that demand for Data Engineers is greater than supply, there is not a problem to find a job in this field. There are hundreds of jobs to scroll down on websites like Indeed, Linkedin, Glassdoor, Datajobs, Ziprecruiter, Angel, Stackoverflow, etc.
But much bigger problem than finding a job in Data Engineering is finding the candidate by growing startups. Some of them still need to hire on-site employee, ‘however hiring in big cities like London, New York, Berlin is hard, since for the same limited talent pool there are also fighting corporations having big budgets. Even if you finally manage to hire great techies they will be on the target of other hungry startups crazily looking for devs. And what if you lose your engineers at the peak of product development? Both investors and you would not be happy’. I encourage you to read the whole article by Rob Renner from @JavaShopPoland: Why Fintech Startups Should Build Their Development Teams In Poland? together with this one: Data engineers are there, can you see them? by Guillaume Payen from Moonshots.
9. Data Engineering Salary
Data Engineers’ salaries depend on variables such as the type of role, skills, experience, and location. According to Glassdoor, the global average salary for a Data Engineer is about $116,591 per year while the same index according to Payscale equals $91,845 per year.
According to Payscale ‘skills in Apache Spark, Hadoop, AWS and ETL are correlated to pay that is above average. Skills that pay less than market rate include Data Analysis and SQL’.
Below a short overview presenting the average Data Engineer’s salary per year in a few locations:
10. Data Engineering Interview Question
If you are looking for a job that is related to Data Engineering, you need to prepare for the interview questions you might face during the meeting or a call.
There are a few self-study pages recommended by us you should definitely visit to prepare for this conversation.
- Data Engineer Interview Questions — the list of top 10 Data Engineer Interview Questions and Answers Updated for 2019
- Things you should know when traveling via the Big Data Engineering hype-train —really interesting article by Wojtek Pituła @Krever01
- The Interview Study Guide For Data Engineers — with a practical Downloadable Data Engineering Interview Checklist
- Facebook Data Engineer Interview Questions & Amazon Data Engineer Interview Questions — covering real-life examples of Facebook and Amazon Data Engineer Interview Questions
- 30+ Best Data Engineer Interview Questions To Get Hired With — another set of 33 Q&A
- 1001 Data Engineering Interview Questions by Andreas Kretz also available on Github in PDF [from page 111].
11. Data Engineering Certification
If you are looking for the most demanded certifications in Data Engineering look at this awesome list by Thor Olavsrud @ThorOlavsrud. He provides a comprehensive overview of TOP 14 certifications including price, hosting organization, and how to prepare.
And here are the Top 14 Data Engineer and Data Architect Certifications:
- Amazon Web Services (AWS) Certified Big Data — Specialty
- Cloudera Certified Associate (CCA) Spark and Hadoop Developer
- Cloudera Certified Professional (CCP): Data Engineer
- Google Professional Data Engineer
- HDP Apache Spark Developer
- HDP Certified Developer Big Data Hadoop
- Hortonworks Certified Associate (HCA)
- IBM Certified Data Architect — Big Data
- IBM Certified Data Engineer — Big Data
- MapR Certified Hadoop Developer 1.0
- MapR Certified Spark Developer 2.1
- Oracle Business Intelligence Foundation Suite 11 Certified Implementation Specialist
- SAS Certified Big Data Professional
- SAS Certified Data Scientist Using SAS 9
source: Top 14 data engineer and data architect certifications
Certificates and Certification in Analytics, Big Data, Data Science, Machine Learning
12. Data Engineering [Online] Courses
There are many online courses offering significant training in this field:
Some sites, such as DataCamp, Memrise and EdX are heavily focused on data science and engineering, while others, such as Udemy and Galvanize, have a wider scope. Your choice of course provider will depend on your needs, level of knowledge and your wallet size.
TIP: Although a course can help you expand your knowledge it may only offer you a certificate or diploma rather than a certification. Therefore do not treat a course as a replacement for actual certification or accredited diploma issuance.
Learn Data Engineering: My Favorite Free Resources
Want to Become a Data Engineer? Here’s a Comprehensive List of Resources to get Started
13. Data Engineering Podcasts
Constant innovations can overwhelm anyone, but in tech, if you’re not learning, you’re becoming obsolete. Therefore, to stay up to date with what is going on in Data world try listening to these 5 podcasts’ resources:
However, if you prefer watching & listening than only listening go to Webcasts / Webinars on AI, Analytics, Big Data, Data Science, & Machine Learning by KdNuggets — an up-to-date list of webinars or watch Thomas Henson’s YouTube channel.
14. Data Engineering Case Studies
As for this point, there is a comprehensive case study collection created by Andreas Kretz in his Data Engineering CookBook. You will find here a great number of examples of companies like Twitter, Netflix, Amazon, Uber, Airbnb, and many other prominent players. Some of them are also available on Youtube. Download the PDF and follow the list of contents to find the required resources.
15. Data Engineering Conferences 2019
KDnuggets, one of the biggest websites in the Big Data and data science space provides us with a complete and updated list of Meetings and Conferences on AI, Analytics, Big Data, Data Mining, Data Science, & Machine Learning broken down by location in America, Europe, and Asia.
If it happens you find even more widespread list of conferences just put it in a comment.
16. Top 10 Data Engineering Mistakes
Other suggestions?
17. Recommended Resources
Top Active Blogs on AI, Analytics, Big Data, Data Science, Machine Learning — updated
Data-Science — Cheat-Sheet/Data Engineering