The world’s leading publication for data science, AI, and ML professionals.

Data Engineering Books

Readers Digest to Learn Data Engineering Gradually

Photo by Tamas Pap on Unsplash
Photo by Tamas Pap on Unsplash

In this story, I would like to talk about data engineering books and resources that might be of interest to those who learn data engineering (DE). I realised that there aren’t many of them in the market explaining data engineering as a concept holistically as a whole thing. Some of them are great with how to use particular tools and data platform architectures and some of them are my favourite bedtime reads: astonishingly easy to fall asleep while reading and gloriously boring. Some are great for strategy decision-making and some might seem a bit outdated but still useful. I hope you’ll find it interesting.

Disclosure: This post may contain affiliate links, meaning I get a commission if you decide to make a purchase through my links, at no cost to you.

1. Data Engineering with Python

Work with Massive Datasets to Design Data Models and Automate Data Pipelines Using Python Paul Crickard, 2020

This is a great book for those who would like to learn open-source Apache tools for data engineering. It covers all essential data engineering topics such as data modeling and offers an abundance of examples of the most common data transformations. As mentioned in the book description it is about Python and data modelling so readers will focus on ETL techniques to extract, cleanse and enrich the datasets using Python tools. It explains Apache Kafka and Apache Spark in detail but also covers the essentials of working with file formats, data transformation and cleansing. The book offers some really good views on data pipeline deployments as well as working with data environments.

One of my stories with advanced ETL techniques to complement this book:

Python for Data Engineers

2. Fundamentals of Data Engineering

by Joe Reis, Matt Housley Released June 2022 Publisher: O’Reilly Media, Inc.

Overall it’s a very good book which I believe is the closest match to the book I am working on at the moment. It covers the fundamentals and it is great indeed. However, it doesn’t explain how to become a data engineer. According to this book, there are no shortcuts and no easy ways of getting into this role. A reader will need to invest 2–3 years in studying this particular field.

What I like about this book is that it offers an independent view of technology and architecture.

We won’t see any marketing here. It has a very clean focus on the data engineering lifecycle in Chapter 2 and explains how it works from project requirements gathering and pipeline design to going live covering the best practices in its area.

The book is all about SQL, and Python and how to use them to solve real-world data engineering tasks. Chapter 4 introduces the framework to choose the right DE technology very similar to one I wrote about in one of my previous stories:

Data Platform Architecture Types

Overall this is one of my favourites. It covers not only the intricacies of data generation, ETL, aggregation, and cleansing but also has a focus on strategy which might be useful for data engineering managers.

3. The Data Warehouse Toolkit: The Definitive Guide to Dimensional

Modeling, 3rd Edition by Ralph Kimball , Margy Ross Released 2013 Publisher(s): Wiley

I remember I bought this one years ago when I started working with Snowflake.

Released in 2013 this book is still valid for many scenarios of data modelling.

What I liked about this particular book is the case studies. It offers more than 20 really useful scenarios from different industries, i.e. retail, marketing, etc. It helped me a lot to understand dimensional modelling and data warehouse design on an advanced level. Basically, it explains everything you need to know about fact and dimension tables and how to run ETL in the data warehouse solution.

It is really interesting to read it even now to witness the evolution of data warehouse platforms.

4. Data Mesh

by Zhamak Dehghani Released 2022 Publisher: Wiley

Nice and fresh overview of Data Mesh principles. Data Mesh and decentralized data management is definitely one of the major trends in DE world.

Data Mesh defines the state when we have different data domains (company departments) with their teams and shared data resources.

I previously wrote about it in one of my stories on modern data engineering.

Modern Data Engineering

This book is a good read for those who want to learn about data mesh design, strategy and architecture. The book explains the data ownership model in a logically coherent manner to move beyond the traditional data warehouse approach towards a decentralized and distributed data platform.

5. Data Pipelines Pocket Reference: Moving and Processing Data for Analytics 1st Edition

by James Densmore Format: Kindle Edition Released Feb 2021 Publisher: O’Reilly Media, Inc.

This is one of my favourite books on data pipelines. Some Python and SQL code snippets were very useful for me at some point in my career. The Github repository code for this book demonstrates how to extract data from external data sources and transform it into datasets.

The book introduces a "build vs buy" approach and this is what data engineers are meant to do. Indeed, there are many managed ETL solutions in the market right now, i.e. Stitch, Fivetran, etc. The book covers data pipeline design principles and explains how to create robust data processing for successful analytics. The book explains many crucial points of data pipeline design from the architecture point of view. It also covers the aspects of modern data infrastructure in the cloud, data pipeline monitoring and alerting. I remember I wrote an article on data pipeline design patterns which provides a similar idea and focus on strategy to choose the right tool.

Data pipeline design patterns

6. Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale

by Jan Kunigk, Ian Buss, Paul Wilkinson, Lars George Released 2019 Publisher: O’Reilly Media, Inc.

This book is great in explaining Hadoop technology. Even though technology is not very popular on the SME level it argues that enterprise application is still viable. Interesting read focusing on practical use cases to create Big Data infrastructure both in the cloud and on-premises. I’m sure it will be useful for seasoned data engineers tasked to create enterprise-level pipelines in the cloud and ensure a high level of security and availability. This is not the book I read regularly but still useful as it gives an overview of something that was considered long dead. Nice to know that Hadoop is still alive.

7. Spark: The Definitive Guide: Big Data Processing Made Simple 1st Edition

by Bill Chambers, Matei Zaharia Released 2018 Publisher: O’Reilly Media, Inc.

This is one of my favourites when it comes to ETL in big data pipelines for datalakes. We all like Spark for its unprecedented scalability and cost-effectiveness. It is a wonderful book for beginners and intermediate users who would want to learn scalable data processing in the data lake. It covers some essential DE concepts and data lake data processing using Apache Spark. Apache Spark is used in many cloud products such as AWS Glue for example. It makes this book a great choice for aspiring data engineers.

8. Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing 1st Edition

by Tyler Akidau, Slava Chernyak, Reuven Lax Released 2018 Publisher: O’Reilly Media, Inc.

Great book on one of the most popular data pipeline design patterns – streaming. It explains streaming data processing pipelines and their core principles. For data engineers, it is very important to understand the nature of data pipeline design patterns and apply them correctly, i.e. batch data processing, streaming ETL, etc. Applications can trigger immediate responses to new data events thanks to stream processing.

Streaming is a "must-have" solution for enterprise data.

This book helped me a lot in choosing the right way to process data and create close to real-time analytics pipelines. Often streaming is not required and might become a costly solution in the end.

9. Storytelling with Data: A Data Visualization Guide for Business Professionals 1st Edition

by Cole Nussbaumer Knaflic (Author) Released 2015 Publisher: Wiley

Great book on data visualization techniques and Business Intelligence (BI). Although BI is an important part of data engineering (and vice versa) it is not a career guide. The book explains how data engineering can supplement business intelligence. It demonstrates how to communicate data insights in an informative, compelling and engaging manner. It helped me a lot with my dashboard design. Adding this to my bookshelf.

10. Fluent Python: Clear, Concise, and Effective Programming 2nd Edition

by Luciano Ramalho Released 2022 Publisher: O’Reilly Media, Inc.

Another really useful book on Python I keep very close. Python is a big chunk of Data Engineering and it makes this book extremely useful. The book is split into five parts that cover pretty much everything that data engineers might want to use in their data pipelines, i.e. context managers, decorators, generators and async.

Python for Data Engineers

11. 97 Things Every Data Engineer Should Know: Collective Wisdom from the Experts

by Tobias Macey Released 2021 Publisher: O’Reilly Media, Inc.

This is a great book that confirms that data engineers are in high demand now. This book is a collection of data engineers’ experience. Many of them designed data pipelines and ETL processes for companies that achieved notable success in the big data and AI field. It’s great to see people are still willing to share their knowledge and explain how they managed to solve challenging ETL problems. The book consists of 97 use cases that can be used by almost every data engineer for data processing and data pipeline design. I like to read one a day.

Conclusion

If you are a learner or an aspiring data enthusiast willing to acquire new data skills there are plenty of opportunities to do it for free in the cloud. I would strongly recommend setting up accounts with one of the cloud platform vendors to start learning DE tools available in the market. Many of them offer free tier services and shouldn’t cost anything to explore the latest data engineering advances. Just make sure you keep an eye on billing while using free tier tools. The overview of the books given in this article will support your learning curve. The majority of them assume that readers are comfortable working with JSON, SQL, REST APIs and know the basics of Python programming. It corresponds with what I wrote before about the data engineering skillset in one of my previous articles. I hope you will find it useful.

How to Become a Data Engineer

Recommended read:

  1. https://medium.com/towards-data-science/python-for-data-engineers-f3d5db59b6dd
  2. https://towardsdatascience.com/data-platform-architecture-types-f255ac6e0b7
  3. https://towardsdatascience.com/modern-data-engineering-e202776fb9a9
  4. https://towardsdatascience.com/data-pipeline-design-patterns-100afa4b93e3
  5. https://towardsdatascience.com/how-to-become-a-data-engineer-c0319cb226c2

Related Articles