The world’s leading publication for data science, AI, and ML professionals.

5 Simple Projects to Start Today: A Learning Roadmap for Data Engineering

Start with 5 practical projects to lay the foundation for your data engineering roadmap.

Tutorials help you to understand the basics. You will definitely learn something. However, the real learning effect comes when you directly implement small projects. And thus combine theory with practice.

You will benefit even more if you explain what you have learned to someone else. You can also use ChatGPT as a learning partner or tutor – explain in your own words what you have learned and get feedback. Use one of the prompts that I have attached after the roadmap.

In this article, I present a roadmap for 4 months to learn the most important concepts in data engineering for beginners. You start with the basics and increase the level of difficulty to tackle more complex topics. The only requirements are that you have some Python programming skills, basic knowledge of data manipulation (e.g. simple SQL queries) and motivation 🚀

Why only 4 months?

It is much easier for us to commit to a goal over a shorter period of time. We stay more focused and motivated. Open your favorite app right away and start a project based on the examples. Or set a calendar entry to make time for the implementation.

5 projects for your 4-month roadmap

As a data engineer, you ensure that the right data is collected, stored and prepared in such a way that it is accessible and usable for data scientists and analysts.

You are the kitchen chef, so to speak, who organizes the kitchen and ensures that all ingredients are fresh and ready to hand. The data scientist is the cook chief who combines them into creative dishes.

Month 1 – Programming and SQL

Deepen your knowledge of Python basics CSV and JSON files are common formats for data exchange. Learn how to edit CSV and JSON files. Understand how to manipulate data with the Python libraries with Pandas and NumPy.

A small project to start in Month 1 Clean a CSV file with unstructured data, prepare it for data analysis and save it in a clean format. Use Pandas for data manipulation and basic Python functions for editing.

  1. Read the file with ‘pd.read_csv()’ and get an overview with ‘df.head()’ and ‘df.info()’.
  2. Remove duplicates with ‘df.drop_duplicates()’ and fill in missing values with the average using ‘df.fillna(df.mean())’. Optional: Research what options are available to handle missing values.

  3. Create a new column with ‘df[‘new_column’]’, which, for example, fills all rows above a certain value with a ‘True’ and all others with a ‘False’.
  4. Save the cleansed data with ‘df.to_csv(‘new_name.csv’, index=False)’ in a new CSV file.

What problem does this project solve? Data quality is key. Unfortunately, this is not always the case when you receive data in the business world.

Tools & Languages: Python (Pandas & NumPy library), Jupyter Lab

Understanding SQL SQL allows you to query and organize data efficiently. Understand how to use the most important commands such as CREATE TABLE, ALTER TABLE, DROP TABLE, SELECT, WHERE, ORDER BY, GROUP BY, HAVING, COUNT, SUM, AVG, MAX & MIN, JOIN.

A small project to deepen your knowledge in month 1: Create a relational data model that maps real business processes. Do you have a medium-sized bookstore in your city? This is certainly a good scenario to start.

  1. Think about what data the bookshop manages. For example, books with the data title, author, ISBN (unique identification number), customers with the data name, e-mail, etc.
  2. Now draw a diagram that shows the relationships between the data. A bookstore has several books, which can be from several authors. Customers buy these books at the same time. Think about how this data is connected.
  3. Next, write down which tables you need and which columns each table has. For example, the columns ISBN, title, author and price for the book table. Do this step for all the data you identified in step 1.
  4. Optional: Create the tables with ‘CREATE TABLE nametable ();’ in a SQLite database. You can create a table with the following code.
-- Creating a table with the name of the columns and their data types
CREATE TABLE Books (
    BookID INT PRIMARY KEY,
    Title VARCHAR(100),
    Author VARCHAR(50),
    Price DECIMAL(10, 2)
);

What problem does this project solve? With a well thought-out data model, a company can efficiently set up important processes such as tracking customer purchases or managing inventory.

Tools & languages: SQL, SQLite, MySQL or PostgreSQL

Month 2 – Databases and ETL pipelines

Mastering relational DBs and NoSQL databases Understand the concepts of tables, relationships, normalization and queries with SQL. Understand what CRUD operations (Create, Read, Update, Delete) are. Learn how to store, organize and query data efficiently Understand the advantages of NoSQL over relational databases.

Tools and languages: SQLite, MySQL, PostgreSQL for relational databases; MongoDB or Apache Cassandra for NoSQL databases

Understand the Etl basics Understand how to extract data from CSV, JSON or XML files and from APIs. Learn how to load cleansed data into a relational database.

A small project for month 2 Create a pipeline that extracts data from a CSV file, transforms it and loads it into a SQLite database. Implement a simple ETL logic.

  1. Load a CSV file with ‘pd.read_csv()’ and get an overview of the data again. Again, remove missing values and duplicates (see project 1). You can find publicly accessible datasets on Kaggle. For example, search for a dataset with products.
  2. Create a SQLite database and define a table according to the data from the CSV. Below you can see an example code for this. SQLite is easier to get started with, as the SQLite library is available in Python by default (module sqlite3).
  3. Load the cleaned data from the DataFrame into the SQLite database with ‘df.to_sql(‘tablename’, conn, if_exists=’replace’, index=False)’.
  4. Now execute a simple Sql query e.g. with SELECT and ORDER BY. Limit the results to 5 outputs. Close the connection to the database at the end.
import sqlite3

# Create the connection to the SQLite-DB
conn = sqlite3.connect('produkte.db')

# Create the table
conn.execute('''
CREATE TABLE IF NOT EXISTS Produkte (
    ProduktID INTEGER PRIMARY KEY,
    Name TEXT,
    Kategorie TEXT,
    Preis REAL
)
''')
print("Tabelle erstellt.")

Tools and languages: Python (SQLAlchemy library), SQL

Month 3 – Workflow orchestration and cloud storage

Workflow orchestration Workflow orchestration means that you automate and coordinate processes (tasks) in a specific order. Learn how to plan and execute simple workflows. You will also gain a basic understanding of the DAG (Directed Acyclic Graph) framework. A DAG is the basic structure in Apache Airflow and describes which tasks are executed in a workflow and in which order.

Tools and languages: Apache Airflow

Cloud storage Learn how to store data in the cloud. Know at least the names of the major products from the biggest cloud providers such as S3, EC2, Redshift from AWS, BigQuery, Dataflow, Cloud Storage from Google Cloud and Azure Blob Storage, Synapse Analytics, Azure Data Factory from Azure. The many different products can be overwhelming – start with something you enjoy.

A small project for month 3 Create a simple workflow orchestration concept with Python (without Apache Airflow, as this lowers the barrier to getting started for you) that sends you automated reminders during your daily routine:

  1. Plan the workflow: Define tasks such as reminders to "Drink water", "Exercise for 3 minutes" or "Get some fresh air".
  2. Create a sequence of the tasks (DAG): Decide the order in which the tasks should be executed. Define if they are dependent on each other. For example, Task A ("Drink water") runs first, followed by Task B ("Exercise for 3 minutes"), and so on.
  3. Implement the task in Python: Write a Python function for each reminder (see code snippet 1 below as an example).
  4. Link the tasks: Arrange the functions so that they execute sequentially (see code snipped 2 below as an example).
import os
import time

# Task 1: Send a reminder
def send_reminder():
    print("Reminder: Drink water!")  # Print a reminder message
    time.sleep(1)  # Pause for 1 second before proceeding to the next task
if __name__ == "__main__":
    print("Start Workflow...")  # Indicate the workflow has started

    # Execute tasks in sequence
    send_reminder()  # Task 1: Send a reminder to drink water

    # Additional tasks (uncomment and define these functions if needed)
    # reminder_exercise()  # Example: Send the second reminder
    # create_task_list()    # Advanced-Example: Create a daily task list

    print("Workflow is done!")  # Indicate the workflow has completed

Too easy? Install Apache Airflow and create your first DAG that performs the task of printing out "Hello World" or load your transformed data into an S3 bucket and analyze it locally.

Tools and languages: AWS, Google Cloud, Azure

Implement the 5 projects to learn twice as much as if you only look at the theory.
Implement the 5 projects to learn twice as much as if you only look at the theory.

Month 4 – Introduction to Big Data and Visualization

Big data basics Understand the basics of Hadoop and Apache Spark. Below you can find a great, super-short video from simplilearn to introduce you to Hadoop and Apache Spark.

Tools and languages: Hadoop, Apache Spark, PySpark (Python API für Apache Spark), Python

Data visualization Understand the basics of data visualization

A small project for month 4 To avoid the need for big data tools like Apache Spark or Hadoop, but still apply the concepts, download a dataset from Kaggle, analyze it with Python and visualize the results:

  1. Download a publicly available medium sized dataset from Kaggle (e.g. weather data), read in the dataset with Pandas and get an overview of your data.
  2. Perform a small exploratory data analysis (EDA).
  3. Create e.g. a line chart of average temperatures or a bar chart of rain and sun days per month.

Tools and languages: Python (Matplotlib & Seaborn libraries)

2 prompts to use ChatGPT as a learning partner or tutor

When I learn something new, the two prompts help me to reproduce what I have learned and use ChatGPT to check whether I have understood it. Try it out and see if it helps you too.

  1. I have just learned about the [topic / project] and want to make sure I have understood it correctly. Here is my explanation: [your explanation]. Give me feedback on my explanation. Add anything that is missing or that I have not explained clearly.
  2. I would like to understand the topic [topic/project] better. Here is what I have learned so far: [your explanation]. Are there any mistakes, gaps or tips on how I can explain this even better? Optional: How could I expand the project? What could I learn next?

What comes next?

  • Deepen the concepts from months 1–4.
  • Learn complex SQL queries such as subqueries and database optimization techniques.
  • Understand the principles of data warehouses, data lakes and data lakehouses. Look at tools such as Snowflake, AmazonRedshift, GoogleBigQuery or Salesforce Data Cloud.
  • Learn CI/CD practices for data engineers.
  • Learn how to prepare data pipelines for machine learning models
  • Deepen your knowledge of cloud platforms – especially in the area of serverless computing (e.g. AWS Lambda)
Own visualization - Illustrations from unDraw.co
Own visualization – Illustrations from unDraw.co

Final Thoughts

Companies and individuals are generating more and more data – and the growth continues to accelerate. One reason for this is that we have more and more data from sources such as IoT devices, social media and customer interactions. At the same time, data forms the basis for machine learning models, the importance of which will presumably continue to increase in everyday life. The use of cloud services such as AWS, Google Cloud or Azure is also becoming more widespread. Without well-designed data pipelines and scalable infrastructures, this data can neither be processed efficiently nor used effectively. In addition, in areas such as e-commerce or financial technology, it is becoming increasingly important that we can process data in real-time.

As data engineers, we create the infrastructure so that the data is available for machine learning models and real-time streaming (zero ETL). With the points from this roadmap, you can develop the foundations.

Where can you continue learning?


Related Articles