Advance DAX Tutorial: Basket Analysis 2.0

Explore the potential of products through customer purchase behaviour in Power BI

Davis Zhang
Towards Data Science

--

This article aims to use DAX to analyze customer purchase behavior in Power BI and to gain insight into product potential.

Marco Russo and Alberto Ferrari have published a blog a few years ago called “Basket Analysis”, this interesting article describes in detail how to use DAX to calculate very useful measures such as the number of orders and the number of customers under any product portfolio. This article can be seen as an extension of “Basket Analysis”, which takes into account the chronological order in which customers purchase different products.

Compared with “Basket Analysis”

Assuming that A and B represent two different products, then “Basket Analysis” calculates P(AB), while this paper calculates P(A|B) and P(B|A), as you can compare the two figures shown below:

Image by author

The figure above is the measure of “Customers with Both Products” in “Basket Analysis”, which shows that 72 customers have purchase records for both “Bottles and Cages” and “Bike Racks”. However, the data shown in the figure below takes into account the chronological order in which customers purchase products. You can find that 8 of the customers who bought Bike Racks first and then purchased Bottles and Cages later, and 14 of the customers who bought Bottles and Cages first but bought Bike Racks later.(Note: We temporarily ignore the situation of buying A and B at the same time)

Image by author

Why this analysis meaningful?

The customer’s order record reflects some very useful facts that give direction to the correlation between products. In other words, “basket analysis” is very useful when analyzing supermarket data because customers often choose multiple products at the time of shopping and then go to the cashier to place an order together. In this case, all products are treated as simultaneous orders. But in fact, you can’t trace the record of different products selected by customers during the shopping process in the supermarket. But if it is in other scenarios, such as customers ordering on the e-commerce platform or the official website, if you as the store manager, you may want to know that A and B are the best-selling models, which one can bring more return customers, which one is easier to lose customers. Therefore, we need to know the repurchase % of each product. For example, all customers who purchase products A first, how many people will come back to buy products again in the future, further analysis, in these people, the purchase is still product A or other products? What is the proportion of each, this is a question worth studying.

The calculation process.

Following the calculation process, we will finally achieve the calculation results shown in the figure below (Note: I use the same data set as “basket analysis”):

Image by author

As mentioned earlier, it shows which customers who purchased the product A first and have subsequent purchase records, how much of them purchased the product B, or product C, etc.

Therefore, in order to achieve this calculation result, there are five steps here:

1. First, classify all orders for the sales table, in all orders of the customer, one or more orders with the earliest order date are classified as the first order, the rest are “non-first”:

IsFirstOrder = 
VAR
E_Date = 'Sales'[OrderDateKey]
VAR
CUST = 'Sales'[CustomerKey]
RETURN
IF(
SUMX(
FILTER('Sales',
CUST = 'Sales'[CustomerKey]&&
E_Date > 'Sales'[OrderDateKey]),
COUNTROWS('Sales'))>0,FALSE,TRUE)

2. Filter the order data of all products A in Sales, and then further filter which orders are marked as the customer’s first order, and we extract the customer list in this filtered table, and add a virtual column named “ROWS” to it, as shown in the following code — virtual table “VT1”.

3. Use Sales as the main table and use NATURALLEFTOUTERJOIN() to associate with the virtual table “VT1”, then use filter() to exclude those rows whose [ROWS] value is not equal to 1 so that the rest of the data (VT2) is all orders for all customers returned by “VT1”. Finally, the data is further filtered for all orders except “first-order”, and the result is named “CustDistinctValue”:

CustDistinctValue = 
VAR
FIRSTORDERPROD =
IF(HASONEVALUE('Product'[Subcategory]),
VALUES('Product'[Subcategory]),0)
VAR
VT1 =
SUMMARIZE(
FILTER(Sales,
AND(related('Product'[Subcategory]) = FIRSTORDERPROD,
'Sales'[IsFirstOrder]=TRUE)),
'Sales'[CustomerKey],
"ROWS",
DISTINCTCOUNT(Sales[CustomerKey]))
VAR
VT2 =
FILTER(
NATURALLEFTOUTERJOIN(ALL(Sales),VT1),
[ROWS] = 1)
RETURN
CALCULATE(
DISTINCTCOUNT('Sales'[CustomerKey]),
FILTER(VT2,'Sales'[IsFirstOrder] = FALSE)
)

4. After that, we need to make sure this data can be filtered by the product (in this case we just use sub-category). Here is basically the same as Macro’s calculation method, using the copy of the product table (Filter Product) and the main table to establish a non- Active relationship, then create a measure so that its context ignores all fields of the product table, and accepts the context from its copy (Filter Product).

CustPurchaseOthersSubcategoryAfter = 
VAR CustPurchaseOthersSubcategoryAfter =
CALCULATE (
'Sales'[CustDistinctValue],
CALCULATETABLE (
SUMMARIZE ( Sales, Sales[CustomerKey] ),
'Sales'[IsFirstOrder] = FALSE,
ALLSELECTED ('Product'),
USERELATIONSHIP ( Sales[ProductCode],
'Filter Product'[Filter ProductCode] )
)
)
RETURN
IF(NOT([SameSubCategorySelection]),
CustPurchaseOthersSubcategoryAfter)

Note: “SameSubCategorySelection” is used to exclude the data of choosing the same Sub-Category. This formula also uses Macro’s method to complete:

SameSubCategorySelection = 
IF (
HASONEVALUE ( 'Product'[Subcategory] )
&& HASONEVALUE ( 'Filter Product'[Filter Subcategory] ),
IF (
VALUES ( 'Product'[Subcategory])
= VALUES ( 'Filter Product'[Filter Subcategory] ),
TRUE
)
)

5. Now, we have figured out how many of the customers who purchased Product A first purchased each of the other products, and now we need to calculate the proportion of these customers who accounted for the total number of customers who purchased Product A first and then had a purchase record. The following is the code for calculating the denominator of this proportion.

AsFirstOrderCust = 
VAR
FIRSTORDERPROD =
IF(
HASONEVALUE('Product'[Subcategory]),
VALUES('Product'[Subcategory]),0)
VAR
VT1 =
SUMMARIZE(
FILTER(Sales,
AND(
RELATED('Product'[Subcategory]) = FIRSTORDERPROD,
'Sales'[IsFirstOrder]=TRUE)),
'Sales'[CustomerKey]
)
return
CALCULATE(
DISTINCTCOUNT('Sales'[CustomerKey]),
VT1)
-------------------------------------------------------------------------------
IsLastOrder =
VAR
E_Date = 'Sales'[OrderDateKey]
VAR
CUST = 'Sales'[CustomerKey]
RETURN
IF(
SUMX(
FILTER('Sales',
CUST = 'Sales'[CustomerKey]&&
E_Date < 'Sales'[OrderDateKey]),
COUNTROWS('Sales'))>0,"F","T")
-------------------------------------------------------------------------------
AsFirstOrderCustRepurchase =
CALCULATE(
'Sales'[AsFirstOrderCust],
'Sales'[IsLastOrder] = "F")

Now we get the final result: CustPurchaseOthersSubCategoryAfter %, the name of this measure is very long, because its logic is complex, just like the calculation process above.

CustPurchaseOthersSubCategoryAfter % = 
DIVIDE ( 'Sales'[CustPurchaseOthersSubcategoryAfter],
'Sales'[AsFirstOrderCustRepurchase])

The final results.

In the end, we will succeed in getting the final result as below and choose to visualize it using a custom-visuals called “CHORD”.

Image by author
Image by author

As you can see, the customers who bought road bikes first, 1853 of them bought mountain bikes later, while the interesting thing is only 200 customers bought road bikes after they have bought the mountain bikes.

I am very grateful to Gerhard’s advice before, this time I have attached the PBIX file in the article, you can download it here if you are interested in it.

End~

--

--