Learn the talk of data — An MLEs journey to the dark depths of data

Published in

Towards Data Science

8 min readJun 2, 2022

Image by ElasticComputeFarm from Pixabay

Let’s face it, if there’s anything data scientists (DSs) or machine learning engineers (MLEs) would shy-away from it’ll be data that’s not yet a Numpy array or a pandas DataFrame. Without insinuating at the fear of databases in data scientists, let’s try to understand some of these concepts.

Why does it matter as a DS/MLE?

Companies that serve millions of customers collect copious amounts of structured and unstructured data every second of the day. To serve these customers with millisecond latency while enabling analytics teams to do their job is challenging, and require carefully designed data architectures. Most companies would have data-engineers to do this for us. As an MLE can’t I be blissfully ignorant of the whole facet and do my job? Most of the time the answer is a resounding No. For few reasons,

Data in organizations(especially for data at scale) can quickly get hairy. By knowing the foundational concepts, you’ll be able to select the most efficient approach for gathering high quality data for your ML use case confidently.
It’s always more efficient to own the data sourcing of your ML model by MLEs, as it can be slow to wait for help.
When doing things by yourself, you may need to go and talk to data engineers to solve problems. Your data-knowledge will decide how meaningful your conversations are.

Who is this for?

This the interpretation of the data-world by an MLE. Most of this knowledge came from experience and self-studies. This is for someone who,

is keen to know about data at scale works and how it’s applied in companies
touches data but not aware of the concepts,
is not an experienced data engineers who works with complex data pipelines

As a person who’s only scratched the surface of data, there can be certain nuances that I’m not aware of or other interpretations to certain aspects I discuss. If you see anything that needs to be added/modified, please do let me know in the comments.

With that little disclaimer, let’s understand some of the basic data concepts in data you may come across. Next time, you’d be able to confidently go to a chat with a data engineer and collaborate with them to solve your problems quicker.

Architecture

Let’s first understand what a typical data architecture would look like for a companies with a reasonably large influx of data.

The data flow used to bring heterogeneous data from various sources to the data-warehouse (Image by Author)

The application would read from and write to a relational (or transactional) database. We’ll call them operational DBs. These databases are optimized to reading and writing records (or transactions). However reading/writing straight to a DB is expensive. Therefore we use a cache layer in between, where the application will read or write to the cache and the cache will read or write to the operational DB (Read on read-through/write-through cache). Cache could be something like Redis. These databases are dumped frequently and stored to a data-lake such as Amazon S3. The database may be replicated every few minutes to avoid losing data. Additionally, other third-party data may be written to the data lake.

Next this data is loaded and centralized in a Data Warehouse, which then is transformed to tailor towards analytics. Finally, the clean and transformed data available in the warehouse can be used directly or by a reporting layer for visualizing and making data-informed decisions.

Data Lake vs Data Warehouse

Data lake and data warehouse is at the spotlight of data flows in companies. What are they and how do they compliment each other? A data lake stores data with little to no structure, meaning writing data is very quick. But reads are expensive as you’ll need to infer the schema and cleanse before being able to use the data effectively (known as schema on read). Data lakes offer cheap storage for storing large amounts of data. For example you may store dumps of your online operational DBs to a lake before ingesting them for offline analytical purposes. Some examples of data lakes are Amazon S3 or Google cloud storage. You may use space-efficient formats like Parquet to store the data.

This data is then consumed by the data-warehouse. Data warehouses typically enforce schema on write and use structured SQL DBs. They are optimized for complex computations (e.g. mathematical operations, windowing, etc.), therefore it is relatively cheaper and efficient to compute in data warehouse. Examples are Amazon Redshift.

Data lakes are optimized for storage and unstructured data, where data warehouse are optimized for compute and structured data

We also got more advance technologies combining the best of both worlds, known as a data lakehouse. Snowflake is an example of this. For example, while Snowflake provides data-warehousing capabilities such as storing SQL tables, complex analytics functions, you can leverage VARIANT data type to store unstructured data (e.g. JSON blobs) and query them effectively. Snowflake also provides separation between storage and compute instances, enabling to optimize for them separately to reduce cost.

Data lakehouse combine the advantages from data lakes and warehouses into one solution

Operational DBs vs Data Warehouse

On the most user-facing front you find the operational DBs (sometimes called OLTP database) which are optimized for transaction processing. That’s pretty much what they do. They do lots and lots of create, read, update and delete operations on individual records (or transactions). In other words, you do not want complex analytical capabilities baked into these DBs. So something like MySQL or PostGres DBs work well most of the time.

A key difference is type of storage used in two solutions. Operational DBs would store data row-by-row (or row-based storage). This makes sense as operational DBs operate on rows of data. However the data-warehouse uses columnar storage which is more suited for analytical queries.

Internals of a Data warehouse

Let’s now turn our attention to the data warehouse. This is where you’ll be spending most of your time as a DS or MLE.

Layers of the data warehouse

A data-warehouse solution is usually complex and has multiple layers. You can find the following layers commonly in it,

Staging area — This is where source data gets dumped to. Dumps of operational DBs, third party data will be loaded to the staging area
Transform/Storage layer — The data from staging area is aggregated and transformed to new tables. These tables may serve as the datasets for ML models or providing insights of the product/services for data analysts
Reporting layer — Going a step above, you may decide to have a reporting layer storing more bespoke tables built on source data and transformed data that can be directly consumed by visualization reporting tools

Tables vs Views

Another distinction to note is the difference between tables and views. Tables physically live in your databases, where views are virtual and gets populated on-demand. For example, for data that is used occasionally (e.g. train a model every few days), views are a good choice. This will help you to save on storage.

Fact and dim tables

When you collate all of your source data from different micro-services in the warehouse, these tables are organized into fact and dim tables. They make up the data architecture of your service/product. The easiest way to understand fact and dim tables are by considering a user story. Let’s imagine a online library system. One of the user stories might be,

As a borrower, I need to be able to library, select a book and check the availability

In this story, we may have the following tables

Organization of fact and dim tables (Star schema). PK refers to Primary Key and FK refers to Foreign Key (Image by Author)

Star vs Snowflake schemas

The purpose of loading data to the warehouse is to provide a centralized haven for analytical processing. Therefore, it is important to have a schema design that enables easier/faster analytical query processing. Star and snowflake schemas are ways to organize the tables to suit Analytics.

The schema in the user story we discussed is known as the star schema. You can see how dim tables are fanning out from a fact table. This is an overly simplified representation of reality, where the number of dim tables, thus the number of foreign keys appear in fact tables can go up from few 10s up to 100s.

The Snowflake schema takes the star schema a step further. It further normalizes dim tables, allowing dim tables to fan-out to other dim tables. For example, if you decide to provide the ability to search by author, you may introduce another dim table called dim_author and add the Author ID as a foreign key to dim_book.

Example snowflake schema (Image by Author)

As you can see this allows further analysis on data, such as group-bys based on author, etc.

ETL/ELT Pipelines

The ELT pipeline is an integral part of the data architecture. They take in real-time data from messy heterogeneous databases containing and pump that to a homogenous data warehouse. You can think for ETL pipeline as the process through which service DB data reaches the data-warehouse. ETL stands for Extract, Load and Transform. Let’s understand what happens in each step,

Extract — Extracts the data from heterogeneous sources. This can be done on an automatic schedule, manual, etc.
Transform — Clean source data and build the transformed tables
Load — Load the clean transformed data to a target DB in the data-warehouse

But due to storage becoming cheaper, it is typical to extract and load the source data into the data-warehouse first and then transform the data. This process is known as ELT (as opposed to ETL). Typically modern data-warehouse solutions already provide lot of ELT capabilities (such as COPY INTO in Snowflake) out of the box.

If you’re using a self-managed service such as AWS you can do most of extract, load and transform using their services. For example, Amazon DMS (database migration service) provides you easy ways to replicate data from operational DBs to a lake like S3. Then AWS Glue can help you to do most of ETL work such as cleaning and transforming data as well as building data catalogs.

Transforming tables is not any easy task when working with several hundreds of tables. We need to be able to track changes, version-control transformation logic. An evolving company such as a startup can highlight the need even more. Tools like dbt helps you to do this. You can easily integrate your dbt workflows with warehouse solutions like Snowflake. Few advantages of dbt are,

Ability to version control your transform logic just like a code repository
Know the lineage of data transformations through the referencing mechanism in dbt

Conclusion

We discussed many different topics here. With this knowledge, you’ll be easily make lot of decisions when it comes to where to go looking for data. Additionally, it’ll help you to speak the language of data-engineers, making your conversations with them effective.

First we discussed the typical high-level data flow of a company that operates at scale. Next we compared and contrasted various data storages like operational DBs, warehouse and data lake. Next we dived into the details of a data warehouse. Starting from different layers of a data warehouse we discussed how data is organized inside the warehouse. Finally we discussed what ETL pipelines are and how they fit in the data flow.