The world’s leading publication for data science, AI, and ML professionals.

What is a Data Lake?

Terms and Definitions, and how you can profit from it

Photo by Pietro De Grandi on Unsplash
Photo by Pietro De Grandi on Unsplash

Both, Data Lakes and Data Warehouses are established terms when it comes to storing Big Data, but the two terms are not synonymous. A data lake is a large pool of raw data for which no use has yet been determined. A data warehouse, on the other hand, is a repository for structured, filtered data that has already been processed for a specific purpose [1].

Features of a Data Lake

In a Data Lake, the data is ingested into a storage layer with minimal transformation while maintaining the input format, structure and granularity. This contains structured and unstructured data. This results in several features, such as:

  • Collection of multiple data sources, such as bulk data, external data, real time data and many more.
  • Control of ingested data and focus on documenting data structure.
  • Generally useful for analytical reports and Data Science.
  • But can also include an integrated Data Warehouse to provide classic management reports and dashboards.
  • A Data Lake is a data storage pattern that prioritizes availability over everything else, across the enterprise, across all departments, and for all users of the data.
  • Easy integratability of the new data source.

Differences between a Data Lake and Data Warehouse

While data warehouses use the classic ETL process in combination with structured data in a relational database, a data lake uses paradigms such as ELT and a schema on read as well as often unstructured data [2].

Differences Data Warehouse vs. Lake - Image by Author
Differences Data Warehouse vs. Lake – Image by Author

In the figure above, you can see the main differences. Also the technologies you use are quite different. For a data warehouse you will use SQL and relational databases while for data lakes you will probably use NoSQL or a mixture of both.

My own experience has often shown that a Data Lakes can be realized much faster. Once all data is available, Data Warehouses can still be built on top of it as a hybrid solution.

Hybrid Data Lake Concept - Image from Author
Hybrid Data Lake Concept – Image from Author

This makes rigid and classically planned data warehouses a thing of the past. This greatly accelerates the provision of dashboards and analyses and is a good step towards a data-driven culture. An implementation with new SaaS services from the cloud and approaches such as ELT instead of ETL also accelerate the development.

Conclusion

This article explains shortly what a data lake is and how it gives your company the flexibility to capture every aspect of business operations in data form while keeping the traditional data warehouse alive. The advantages over the classic data warehouse are that different data and data formats, whether structured or unstructured, must be able to be stored in the data lake. Distributed data silos are thus avoided. Use cases from the area of data science and classic data warehouse approaches can also be served. Data Scientists can retrieve, prepare, and analyze data faster and with greater accuracy. So in the end Data lakes will not replace Data Warehouses. Rather the two options are complements to one another.

Sources and Further Readings

[1] talend, Data Lake vs. Data Warehouse

[2] IBM, Charting the data lake: Using the data models with schema-on-read and schema-on-write (2017)


Related Articles