The world’s leading publication for data science, AI, and ML professionals.

What Skills Are Important for a Data Engineering Role at FAANG?

What you need to know before landing a data engineering role at large companies like FAANG

DATA ENGINEERING

!Photo by Danil Sorokin on Unsplash. Modified by the author.
!Photo by Danil Sorokin on Unsplash. Modified by the author.

As a data engineer who works for a large FAANG company, I hear people frequently ask what skills are important to land a Data Engineering job at a well-known tech company? Many people think they need to be fluent in Spark or know everything about Hadoop systems to get a job with these companies. Although it is true for some data engineering roles, it is not relevant to many other data engineering roles inside large companies (including FAANG companies).

In this article, I mention the most important skills you need to land data engineering jobs in large technology companies. Before going further, I want to emphasize that this article is more toward data engineers who work with the analytics teams rather than data engineers who work more with infrastructure teams.

SQL

SQL is probably the most important skill you must have to land a data engineering role at any company. If someone told you that SQL is old and no one cares about SQL anymore, please ignore it. Even in large companies like FAANG, SQL is still the most widely used language by many data engineers. Remember, to be a universal data engineer, try to learn standard/ANSI SQL. Many companies are working with different databases, and they use platforms like Presto to access and query them from a single point.

To those who want to learn SQL seriously, my number one recommendation is Vertabelo Academy (disclosure: no affiliation with the author)

Learn SQL anywhere, anytime. | Online Courses by Vertabelo Academy

One more piece of advice to improve your SQL skills. Always think about how you can improve the performance of a SQL query. Many can finish a task with a complex and non-optimized SQL query, but when it comes to large data, writing optimized queries are critical and separate you from SQL beginners.

Python

Perhaps, the most important task of a data engineer is to build data/ETL pipelines. Many modern data pipelines are written in Python language. Python provides a flexible and extensive set of tools for building complex ETL pipelines. Python is the language for many workflow management platforms (see the next section). Most modern data pipelines are mixed scripts of Python and SQL queries. For most data engineering tasks, you should be able to complete the task with Python’s basic tools, objects, and libraries.

Airflow or Workflow Management Platform

Most of the pipelines are needed to run on a regular schedule. Also, monitoring pipelines and maintaining them without a comprehensive special tool is difficult. Therefore, many large companies started using workflow management platforms like Apache Airflow. Some companies have their own workflow management platform, and others use open source options like Airflow. Since Airflow is widely used by many technology companies and is very similar to other proprietary tools in some large companies, it is encouraged to invest time and learn Apache Airflow. You have many options to learn (and even get certified for) Airflow. Here are some options (disclosure: no affiliation with the author).

The Complete Hands-On Introduction to Apache Airflow

(disclosure: no affiliation with the author)

Astronomer | The Enterprise Framework for Apache Airflow

Fundamentals of Warehousing Systems

A data warehouse is a structured organization of all available data (ideally) in the company. Using data warehouses, data scientists and decision-makers can answer important business questions and analyze business performance. The persons who build and maintain data warehouses are data engineers. Interacting with data warehouses is an essential part of data engineers’ jobs. Therefore, understanding data warehousing fundamentals and knowing best practices in this domain are vital for data engineers.

"Star Schema The Complete Reference" by Christopher Adamson is the best starting point for those who want to learn about data warehousing and best practices (disclosure: no affiliation with the author).

Star Schema The Complete Reference

If you don’t have time to read this excellent book, or if you want to get motivated to read it, please read the book summary that I published on Toward Data Science. It covers all the essential basics in a short article.

Fundamentals of Data Warehouses for Data Scientists

Basics of Hadoop, Spark, and Hive

Due to the size of the data (usually petabytes of data for a simple query), most large tech companies, data warehouses are based on Hadoop systems. Although some data engineers are familiar with Hadoop and Spark, many of these companies take advantage of Hive (and Presto) to enable all data engineers (who are most familiar with SQL). It is expected that a data engineer will be familiar with the fundamentals of Hadoop, Spark, and Hive. However, many data engineers (especially DEs on the analytics side) don’t need more than basic knowledge to complete their everyday job. Systems like Presto and Hive enable data engineers to interact with many types of databases (including Hadoop systems) through ANSI SQL. Again, if you are working as DE in upstream teams (teams involved in infrastructure systems and maintaining data warehouses), you might need better skills in Hadoop and Spark, but for DE roles in analytics teams, the basic knowledge should be enough. Here is a book that can help you familiarize yourself with Hive and Hadoop (disclosure: no affiliation with the author).

Programming Hive: Data Warehouse and Query Language for Hadoop

Summary

You need to know SQL and Python very well to find a data engineering job at well-known tech companies (e.g., FAANG companies). Experience with a workflow management platform like Apache Airflow can help you build ETL pipelines and have this skill in your toolbox. In addition, you need to know the fundamentals of Hadoop, Hive, Spark, and data warehousing to pass the interviews and be able to do your daily job. For many companies and roles, knowing too many details about Hadoop, Hive, or Spark is unnecessary since many DE roles interact with data warehouses through tools like Presto or Hive, which are SQL-based.


Related Articles