3 Types of Data Architecture Components You Should Know About: Applications, Warehouses and Lakes

What’s the difference, and why does it matter?

Charlotte Tu
Towards Data Science

--

Photo by Pietro De Grandi on Unsplash

When I was starting out in the world of analytics, data architecture was pretty intimidating. The terms were alien to me, people would throw terms around like ‘ETL’, and ‘data lake’, I’d nod along having only a vague sense of what they were talking about and put it in the ‘too difficult’ box.

Today, I want to describe 3 parts of data architecture in simple terms; applications, data warehouses, and data lakes. I’ll focus on what you need to know as a person within the business (rather than analytics) and to keep it practical, I’ll use an example that most of us probably do at least weekly, buying something on a credit card.

Application database

Applications are where the data gets generated.

Say I go to Tesco each week to do my weekly grocery shop. When I hand over my card to pay, the card application will register that I paid Tesco £45.50 on 19th April at 5.05 pm (there are a few intermediary steps but for this article, it’s not important). But the card application only generates and keeps data about cards, there is no information on whether you also hold a mortgage, your address, your marketing preferences etc. Now say you wanted to understand the total number of transactions a customer made per month, split by whether the customer also has a mortgage. This wouldn’t be possible just using the card application data, and this is why we need a data warehouse.

Data warehouse

The data warehouse is the single version of the truth. Different pieces of application data are brought together and reformatted so that the data from different applications is defined and structured in the same way.

Sticking with my Tesco example, the transaction data from the card application is transferred to the data warehouse. Data from the mortgage application is also transferred to the data warehouse to see which customers hold mortgages and their outstanding balances. We also get data from third parties; every month we receive a report from credit agencies with the credit score for each customer. The frequency at which the data moves depends on the type of data. The credit score for the customer is only available monthly, whereas the credit card transaction data is happening minute by minute. This is where the data architects come in — and define the best frequency to move the data.

Image by Author

Now, with the different sources brought together, we can run our analysis on the number of transactions a customer makes per month, split by whether the customer holds a mortgage.

The data warehouse holds historic data up to a point (ours typically keep 2–3 years), this is where the data lake comes in.

Data lake

The data lake is basically a giant parking lot. Data lakes are mainly used for aged data where the chance of needing to access that data has gone down.

Going back to my example, you can imagine the number of card transactions that happen in the UK daily (multiple millions!). It isn’t practical to keep these in the data warehouse for a long time, as it slows things down and keeping that data can be expensive. After a fixed time, the card transaction data will be shifted from the data warehouse to the data lake. It is still accessible, but not in the tables that analysts will use day to day.

Image by author

As a business person or product owner — how much of this do I need to understand?

The reality is, not much but there are a few ‘watch out’ areas that can affect the feasibility to do the work or the length of time to complete the request.

If the data we need isn’t in the data warehouse or data lake, the data and analytics team will need to engage with the application owners to set up the process to transfer the data to the data warehouse. This can take time and might involve data-sharing agreements.

If the data is quite old and in the data lake, the data is available, but might take an analyst a bit more time to extract it so you might want to allow for a bit more time in your project plan.

In this article, we haven’t considered the difference between whether the data is stored on a physical server, or in the cloud — but that’s for another day!

Thank you for reading — and do let me know any feedback.

References:

Inmon, W.H. (2019) Data architecture: a primer for the data scientist. 2nd edition. San Diego, CA: Elsevier.

_________

I like to write about data science for business users and I’m passionate about using data to deliver tangible business benefits.

You can connect with me on LinkedIn and follow me on Medium to stay up to date with my latest articles.

--

--

Analytics @ HSBC, ex PwC. Passionate about making technical content easy to understand, visualisation and delivering commercial outcomes. All views are my own