Photo by Vlad Chețan from Pexels

A Guide To The Data Lake — Modern Batch Data Warehousing

Redefining the batch data extraction patterns and the data lake using “Functional Data Engineering”

Daniel Mateus Pires
10 min readJan 8, 2020

--

The last decades have been greatly transformative for the Data Analytics landscape. With the decrease in storage costs and the adoption of cloud computing, the restrictions that guided the design of data tools were made obsolete and so — the data engineering tools and techniques had to evolve.

I suspect many data teams are taking on complex projects to modernize their data engineering stack and use the new technologies at their disposal. Many others are designing new data ecosystems from scratch in companies pursuing new business opportunities made possible by advances in Machine Learning and Cloud computing.

A pre-Hadoop batch data infrastructure was typically made of a Data Warehouse (DW) appliance tightly coupled with its storage (e.g. Oracle or Teradata DW), an Extract Transform Load (ETL) tool (e.g. SSIS or Informatica) and a Business Intelligence (BI) tool (e.g. Looker or MicroStrategy). The philosophy and design principles of the Data organization, in this case were driven by well-established methodologies as outlined in books such as Ralph Kimball’s The Data Warehouse Toolkit (1996) or Bill Inmon’s Building The Data Warehouse (1992).

I contrast this approach to its modern version that was born of Cloud technology innovations and reduced storage costs. In a modern stack, the roles that were handled by the Data Warehouse appliance are now handled by specialized components like, file formats (e.g. Parquet, Avro, Hudi), cheap cloud storage (e.g. AWS S3, GS), metadata engines (e.g. Hive metastore), query/compute engines (e.g. Hive, Presto, Impala, Spark) and optionally, cloud-based DWs (e.g. Snowflake, Redshift). Drag and drop ETL tools are less common, instead, a scheduler/orchestrator (e.g. Airflow, Luigi) and “ad-hoc” software logic take on this role. The “ad-hoc” ETL software is sometimes found in separate applications and sometimes within the scheduler framework which is extensible by design (Operator in Airflow, Task in Luigi). It often relies on external compute systems such as Spark clusters or DWs for heavy Transformation. The BI side also saw the rise of an open source…

--

--