From raw Salesforce case data to organized dimension and fact tables for your data warehouse.

Alright, so about 3 months ago, a stakeholder wanted to understand why customers have been churning from product XYZ. Churn is the name we give for situations where a customer cancels a subscription to a service or product we provide. It is pretty fair for this stakeholder to be curious about this straight forward question and integral component of the business. Of course a company of Shopify’s size and importance will have beautifully modelled data for this already. Surely the data will be in a ready-to-go, self-serve report so we can focus on more important things like machine learning algorithms for solving world hunger, right?
Nope. At least not for product XYZ.
So today, I am writing about my experience in building out a dimensional model for product churn and not my machine learning algorithm for world hunger, that will have to wait.
By the end of this you will have a good understanding of how and why we did things the way we did and hopefully may have an idea of how to apply this to your business!
Square One.
Like a lot of businesses these days, Shopify manages a lot of their customer profiles, relationships and interactions in Salesforce. Shopify uses ‘cases’ in Salesforce to record, review and approve churn requests for product XYZ.
Given the above, our starting point for understanding customer churn sits in a raw unclean salesforce cases dataset and looks something like this;
At first glance, this doesn’t look like too bad of a dataset we have a lot of the important keys like customer_id and date which we can use to get more information on these churn cases. However, when we try and visualize this we will run into some issues:
- There is varying case styles (feature X != Feature X)
- We have alike reasons under different labels (Pricing, Low Perceived Value). We should probably create a map to fit alike churn reasons into overarching categories.
- We should send customer 105 my new hot fire playlist 🔥
Given the above, and because customer churn is an integral part of the business that we will want to understand long term, it makes sense that we make the upfront investment and create a clean dimensional model for customer churn.
So, what is a dimensional model you may ask? Google will tell you;
Dimensional Modelling (DM) is a data structure technique optimized for data storage in a Data warehouse. The purpose of dimensional modelling is to optimize the database for faster retrieval of data. The concept of Dimensional Modelling was developed by Ralph Kimball and consists of "fact" and "dimension" tables.
Fast and easy retrieval of data is important to making great decisions quickly, a key cultural value at Shopify. So we need data that is easily understood by an analyst who may be new to the subject of product XYZ or customer churn and creates a mostly frictionless path to answer questions like:
- How does churn trend over time?
- Which customers churned due to pricing?
- What is the most prominent churn reason?
- How long does it take for a customer to churn?
So how do we do this? Let me take you through my process.
Understand the business process.
The starting point for me was getting a good understanding of the business process and how that relates to our dataset. In our case, the business process was the submission of a churn request by the account management team. The fields that they used to submit the case would appear in our cases dataset. I mapped out this business process for us below.

Given our input dataset, and the flow above, we know the following:
- Churn requests can be cancelled
- Churn requests require approval before becoming completed
- A single churn request can be associated with multiple reasons
- Churn reasons need to be formatted and grouped
So now that we think we understand the business process, I always go check with the team who owns the process and my senior data scientists so they can tell me where I am wrong. After a few cycles of iterating on their feedback and getting everyone aligned we can move onto the next step.
Design your dimensional model.
At this point, we know the questions that we want to answer, we know the business process and it’s now time to start scheming up our data warehouse design.
To be able to answer the questions that we outlined above, we know that we will probably need the following columns in our model.
Timestamps:
- Request Opened At
- Request Closed At
Descriptive Data:
- Churn Reason (Feature X, Feature Y, Pricing, Hated Mike’s playlist)
- Churn Reason Category (Product Limitations, Low Perceived Value, Support Experience)
- Customer details
Current state data:
- Was the request successfully completed, cancelled or pending?
On top of the above we will leverage already existing customer, employee, date, and region dimensions to get other insights that would be helpful to understand churn, like customer age or region.
Knowing that we have a mix of transactional data, (a churn request can be thought of as a transaction), and descriptive data that goes along with these transactions (Churn Reason, Churn Reason Category), this is a pretty good indication that we will need both a fact and dimension.
In a few words, we can think of a fact table as a place where you can store measurements and a dimension where you store descriptive attributes associated with the facts measurement. In this exercise, we will have a dimension for the churn reason, which is an attribute of the churn fact, but as with most facts, we would also be able to join on customer dimensions, date dimension and country dimensions for example.
The caveat with our dataset (there always is one) is that unfortunately there is a one-to-many relationship between the business process (a churn request) and the attribute we would like to measure; churn reason. Lucky for us, Ralph Kimball thought of this and gave us the gift of Bridge Dimensions. We can use Bridge Dimensions to parse out the multi-value row into individual rows so we can cleanly join a cases churn reasons on a key to the churn reason dimension. Our end relationship will look like this:

Here is how our example data will flow through our data model.

In this relationship we can connect the Churn request fact to the churn reason dimension via the bridge dimension and the case_id / churn_reason_key.
Before we build however, how does our model look in practice?
Build SQL prototypes.
Now that we know the questions that our model should answer, and we understand the columns and tables that our model will provide us with. Let’s check to make sure that the questions which we would like to answer get answered!
- Churns over time?
- List of completed customer churns and the associated reasons?
There will be lots of ways to cut and splice this data, but our core questions get addressed by this design – which is what we need to be sure of.
Check it & Ship it!
Great, so to recap;
- We understand the business process
- We understand the questions that we want to answer
- We have a design which makes it possible to answer those questions
As a final sanity check, I looped in the stakeholders to get feedback on the following
- Are there any core questions that this model does not include that need to be addressed?
- Is this model flexible and able to answer questions down to as small as grain as necessary?
- What might be some future uses of this data? If it may end up getting rolled up into a customer dimension for example – does this design make that easy/possible?
Once everyone is aligned, it’s time to start building! Go get it.
Comment and let me know if you would be interested in seeing what the build looked like for this project in a follow up post!