The world’s leading publication for data science, AI, and ML professionals.

How Data Scientists Can Reduce Data Wrangling Time with a Data Mart

What's a data mart and why data scientists should use one

Photo by Dima Valkov from Pexels
Photo by Dima Valkov from Pexels

As a data scientist, you can spend up to 80% of your time cleaning and transforming data in order to generate actionable insights and build machine learning models to create business impact. Now imagine a world where you can spend more time on analysis and model development instead of cleaning data. This can become a reality by having a data mart defined as a subset of data within a data warehouse developed for a specific group of users or business unit.


Introduction

When I started as a data scientist, there was just raw data in the data warehouse with no ETL pipelines in place to create a single centralized table I could use to query customer information. Every time I needed customer data, I had to join multiple tables together and apply the proper business logic. This was tedious to rerun for every analysis. Eventually, I put these frequent queries into ETL pipelines and created an analytics Data Mart that helped reduce my data cleaning and preparation time by more than 50%. Now that you know the benefits of having a data mart, let me review the process I used to build one and how you can apply it in your company.

1. Determine the business unit and users for the data mart

The intended users will use the data mart to answer questions from stakeholders in the business unit. For example, you can build a data mart to answer questions from product managers about user behavior and engagement. The users of the data mart can be data scientists or data analysts with product stakeholders.

2. Create a list of questions the data mart will be used to answer

This will determine the type of data you’ll have in the data mart. For example, the product data mart needs to answer questions about the number of daily signups, the number of weekly active users, and product A/B test results. I recommend starting with a common list of questions to create the initial version of the data mart and adding tables later as needed.

3. Document schema for data mart tables

Include as much information as possible in the schema document because it can be used as a reference if anyone has questions about the data in the future instead of asking you. Add any business logic that needs to be applied when reading in the data such as filters and transformation logic as well as noting the time frame of data needed and frequency of update. Following along in the product data mart example from step 2, we’ll need to use data sources related to signups, product behavior, and user experiments.

Below is an example of the user table schema where I specified the table should be updated daily. This is an important detail because it’ll let data engineers how often to schedule the ETL job and allow users querying the data to know how often the data is updated.

I listed 5 fields with the field name and field type and business logic to apply if applicable such as removing spaces from the email address and deriving the latest login date by taking a max of the login_date field from the logins table. Note the last field is a reference field called update_date that should be set to the last time the ETL was run for this table to let the user know when the data was last updated. Occasionally ETL jobs may fail and this can help troubleshoot if the table was refreshed for the day.

User table schema example created by the author
User table schema example created by the author

Another possible table for the data mart is a logins table to report weekly active users. However, instead of just creating a weekly active users table, it would be more flexible to have a daily user login table as I’ve shown below for use in building an aggregate table with weekly active user ( WAU ) count. Notice the business logic for wau is the distinct count of users where the login date is current date-1 and current date-6. The reason we use current date-1 is because the most recent data is typically from yesterday and taking yesterday minus 6 days gives us 7 days to calculate wau.

When deciding on tables in the data mart, the more granular a time period, the better because it gives you more flexibility to answer questions about any time period.

Logins and wau table schema example created by the author
Logins and wau table schema example created by the author

4. Create sample tables according to the schema document

After the table schemas are documented, it’s time to write the code to create sample tables. These sample tables can be created by you or by a data engineer. If it’s a data engineer, ask them to source production data for you to validate the tables. I’ve had times when data engineers used test data and all I could do was validate the table schema. After the sample tables pass your QA checks, you can work with the data engineer to back run any history if needed and then have them put the ETL code into production.

Final Thoughts

As a data scientist, having a data mart dramatically boosted my productivity because I could spend less time cleaning and transforming data and more time on data analysis and developing Machine Learning models to drive business impact. Building a data mart may sound intimidating but it will be worth the effort in the long run to help you and your stakeholders get more insights in less time.


You might also like…

How Data Scientists and Data Engineers Can Collaborate Effectively

6 Bad Data Engineering Practices You Shouldn’t Apply as a Data Scientist

6 Best Practices I Learned as a Data Engineer


Related Articles