Building and Optimizing Multiple Linear Regression in PowerBI using DAX

Davis Zhang
Towards Data Science
7 min readJun 4, 2020

--

This article mainly introduces how to use DAX to complete multiple linear regression in PowerBI for time series analysis. Next, I will introduce the idea of “ridge regression” and “Lasso regression” into the model optimization.

Formula Derivation of Multiple Linear

The multiple linear regression formula is as follows:

Image by Wikipedia

DAX can not perform matrix operations, so the regression formula refers to Klim’s law. Taking binary regression as an example, its principle is to obtain the optimal solutions of beta 0, beta 1, and beta 2 by calculating the data sets. The formulas have been deduced as follows:

Image by Wikipedia

We can complete the regression only by creating new measures in DAX according to this formula.

About the Data Set of this Article

The data set in this article comes from the Tableau Forum (Superstore. xls)

Image by author

Where “Order Date” will be the first independent variable and “Sales” will be the dependent variable (ie the predicted target).

Building the Multiple Linear Regression

As we can see from the figure below, the sales data is formed on a yearly basis, so we can use “month” as the second independent variable.

Image by author

Based on the above linear regression formula, the code is as follows:

Binary Linear Regression = 
VAR VT =
FILTER (
SELECTCOLUMNS(
SUMMARIZE(ALLSELECTED('Date'),'Date'[Date],'Date'[Month Number]),
"VT[X1]", 'Date'[Date],
"VT[X2]", 'Date'[Month Number],
"VT[Y]",'Orders'[Qty]
),
AND (
NOT ( ISBLANK ( VT[X1] ) ),
AND(
NOT ( ISBLANK ( VT[X2] ) ),
NOT ( ISBLANK ( VT[Y] ) )
))
)
VAR Average_X1 =
AVERAGEX ( VT, VT[X1] )
VAR Average_X2 =
AVERAGEX ( VT, VT[X2] )
VAR Average_Y =
AVERAGEX ( VT, VT[Y] )
VAR Sum_X1_2 =
SUMX ( VT, (VT[X1] - Average_X1) ^ 2 )
VAR Sum_X2_2 =
SUMX ( VT, (VT[X2] - Average_X2) ^ 2 )
VAR Sum_X1Y =
SUMX ( VT, (VT[X1] - Average_X1) * (VT[Y] - Average_Y))
VAR Sum_X2Y =
SUMX ( VT, (VT[X2] - Average_X2) * (VT[Y] - Average_Y))
VAR X12 =
SUMX( VT, (VT[X1] - Average_X1)*(VT[X2] - Average_X2))
VAR Beta1 =
DIVIDE (
Sum_X1Y*Sum_X2_2 - sum_x2y*X12,
Sum_X1_2*Sum_X2_2 - X12 ^ 2
)
VAR Beta2 =
DIVIDE (
Sum_X2Y*Sum_X1_2 - sum_x1y*X12,
Sum_X1_2*Sum_X2_2 - X12 ^ 2
)
VAR Intercept =
Average_Y - Beta1 * Average_X1 - Beta2 * Average_X2
VAR RESULT =
SUMX (
SUMMARIZE('Date','Date'[Date],'Date'[Month Number]),
Intercept + Beta1 * 'Date'[Date] + Beta2 * 'Date'[Month Number]
)
RETURN
RESULT

You can see that there is a red curve in the figure below: this is the binary linear regression model we just built.

Evaluation Model

Now that the model has been built, we need to evaluate it.

I introduce two indicators:

1.RMSE (root mean square error) — the smaller the better

Image by Wikipedia

2.R² (goodness of fit) — usually the closer to 1 the better

Image by Wikipedia

According to the above formula, you can complete the code as below.

RMSE = 
VAR
VT =
SUMMARIZE(
ALLSELECTED('Date'),'Date'[Date],'Date'[Month Number])
RETURN
SQRT(
divide(
SUMX(VT,
('Orders'[Binary Linear Regression] - 'Orders'[Qty]) ^ 2),
COUNTROWS(VT)))
--------------------------------------------------------------------------------------------------
R^2 =
VAR
VT =
SUMMARIZE(ALLSELECTED('Date'),'Date'[Date],'Date'[Month Number])
VAR
ESS =
SUMX(VT,POWER('Orders'[Binary Linear Regression]-AVERAGEX(VT,[Qty]),2))
VAR
TSS =
SUMX(VT,POWER([Qty]-AVERAGEX(VT,[Qty]),2))
RETURN
DIVIDE(ESS,TSS)

After running the code, we know that the fitting effect of the model is not optimistic.

Image by author

Optimization Model

We know there are two main methods for linear regression optimization: lasso regression and ridge regression. These two methods add the L1 & L2 Regularization to the model respectively. We don’t talk about complex statistics, we only care about how to use DAX to solve the problem of model optimization efficiently.

In fact, both methods reduce the certain coefficients of the model by adding a penalty term. The former can directly compress the unimportant coefficients to 0. Therefore, this algorithm is essentially a selective compression of each β coefficient in the model. In addition, DAX itself is not a language for algorithms. Thus, considering the performance of code running, we can simply multiply the coefficient of the model by a variable to achieve the effect without performing complex derivation calculations. Create parameters in PowerBI as the value of this variable, so that we can adjust the value of the variable in the visual interface to achieve manual optimization of the model. Therefore, we modified the code of the original regression model:

Manual Binary Regression = 
VAR R = '_Slope'[Regular factor Value]
VAR A = 'α'[α Value]
VAR B = 'β'[β Value]
VAR VT =
FILTER (
SELECTCOLUMNS(
SUMMARIZE(ALLSELECTED('Date'),'Date'[Date],'Date'[Month Number]),
"VT[X1]", 'Date'[Date],
"VT[X2]", 'Date'[Month Number],
"VT[Y]",'Orders'[Qty]
),
AND (
NOT ( ISBLANK ( VT[X1] ) ),
AND(
NOT ( ISBLANK ( VT[X2] ) ),
NOT ( ISBLANK ( VT[Y] ) )
))
)
VAR Average_X1 =
AVERAGEX ( VT, VT[X1] )
VAR Average_X2 =
AVERAGEX ( VT, VT[X2] )
VAR Average_Y =
AVERAGEX ( VT, VT[Y] )
VAR Sum_X1_2 =
SUMX ( VT, (VT[X1] - Average_X1) ^ 2 )
VAR Sum_X2_2 =
SUMX ( VT, (VT[X2] - Average_X2) ^ 2 )
VAR Sum_X1Y =
SUMX ( VT, (VT[X1] - Average_X1) * (VT[Y] - Average_Y))
VAR Sum_X2Y =
SUMX ( VT, (VT[X2] - Average_X2) * (VT[Y] - Average_Y))
VAR X12 =
SUMX( VT, (VT[X1] - Average_X1)*(VT[X2] - Average_X2))
VAR Beta1 =
DIVIDE (
Sum_X1Y*Sum_X2_2 - sum_x2y*X12,
Sum_X1_2*Sum_X2_2 - X12 ^ 2
) * A
VAR Beta2 =
DIVIDE (
Sum_X2Y*Sum_X1_2 - sum_x1y*X12,
Sum_X1_2*Sum_X2_2 - X12 ^ 2
) * B
VAR Intercept =
Average_Y - Beta1 * Average_X1 - Beta2 * Average_X2
VAR Result =
SUMX (
SUMMARIZE('Date','Date'[Date],'Date'[Month Number]),
Intercept + Beta1 * 'Date'[Date] + Beta2 * 'Date'[Month Number]
)
RETURN
Result * (1-1/R)

The code runs as follows (Note: The adjusted model is shown in the dark blue curve, and RMSE and R² have also been changed to the evaluation values for this model):

Image by author

(Note:”α” is used to adjust the first parameter; “β” is used to adjust the second parameter)

So far, we have completed the manual optimization of the binary linear regression model, we can adjust the parameters, and then evaluate the results by observing RMSE and R².

This article should have ended here, but then I feel that this manual parameter adjustment is a bit blind, we should have an intuitive and scientific reference for adjusting parameters to improve the efficiency of use. So I made a model optimization reference graph, we take the first argument (order date) as an example, with its parameters as arguments (X-axis), R² and RMSE as dependent variables (Y-axis), so, we can intuitively provide a reference for model optimization as follow:

Image by author

(Note: Since the value of RMSE is much larger than the value of R², the RMSE value in the figure above is compressed to one-tenth of the original to make RMSE and R² easier to observe. Of course, you can also consider using a two-axis diagram)

Summary

So far, we already know how to use DAX for multiple linear regression and model optimization. Although the code is long, it is not difficult to understand. You may wonder why we don’t use R language for linear regression. Yes, it’s easier to use the R language for regression in PowerBI. But PowerBI Service currently only supports part of the R library, and if you use the Report Server version, as of April 2019, Power BI Service(RS version) does not support any R library (including custom visual developed in R language). You can only choose to use DAX to complete all the modeling work unless you only need to use the desktop version. Besides, this is a good opportunity to deepen your understanding of DAX and linear regression.

If you have a better way to implement time series prediction using PowerBI, I hope to get your share.

Attached: ternary linear regression code

Ternary Linear Regression = 
VAR VT =
FILTER (
SELECTCOLUMNS(
SUMMARIZE(
ALLSELECTED('Date'),
'Date'[Date],'Date'[Month Number],
DateFactorTable[DateFactor]),
"VT[X1]", 'Date'[Date],
"VT[X2]", 'Date'[Month Number],
"VT[X3]", 'DateFactorTable'[DateFactor],
"VT[Y]",'Orders'[Qty]
),
AND (
NOT ( ISBLANK ( VT[X1] ) ),
AND(
NOT ( ISBLANK ( VT[X2] ) ),
AND(
NOT ( ISBLANK ( VT[X3] ) ),
NOT ( ISBLANK ( VT[Y] ) )
))
))
VAR N =
COUNTROWS ( VT )
VAR L11 =
SUMX(VT,'VT'[X1]^2)-DIVIDE(SUMX(VT,'VT'[X1]^2),N)
VAR L22 =
SUMX(VT,'VT'[X2]^2)-DIVIDE(SUMX(VT,'VT'[X2]^2),N)
VAR L33 =
SUMX(VT,'VT'[X3]^2)-DIVIDE(SUMX(VT,'VT'[X3]^2),N)
VAR L12 =
SUMX(VT,'VT'[X1]*'VT'[X2])-DIVIDE(SUMX(VT,'VT'[X1]*'VT'[X2]),N)
VAR L13 =
SUMX(VT,'VT'[X1]*'VT'[X3])-DIVIDE(SUMX(VT,'VT'[X1]*'VT'[X3]),N)
VAR L32 =
SUMX(VT,'VT'[X3]*'VT'[X2])-DIVIDE(SUMX(VT,'VT'[X3]*'VT'[X2]),N)
VAR L10 =
SUMX(VT,'VT'[X1]*'VT'[Y])-DIVIDE(SUMX(VT,'VT'[X1]*'VT'[Y]),N)
VAR L20 =
SUMX(VT,'VT'[X2]*'VT'[Y])-DIVIDE(SUMX(VT,'VT'[X2]*'VT'[Y]),N)
VAR L30 =
SUMX(VT,'VT'[X3]*'VT'[Y])-DIVIDE(SUMX(VT,'VT'[X3]*'VT'[Y]),N)
VAR D4 =
L30*L12*L11*L22-L30*L12^3-L11*L22*L13*L20+L13*L20*L12^2-
L11*L20*L30*L22+L11*L20^2*L13+L12*L10*L30*L22-L12*L10*L13*L20
VAR D3 =
L32*L13*L11*L22-L32*L13*L12^2-L12*L33*L11*L22+L12^3*L33-
L30*L22*L12*L13+L13^2*L20+L11*L32*L20*L22-L11*L32*L13*L20
VAR BETA3 =
DIVIDE(D4,D3)
VAR D2 =
(L12*L13-L11*L32)*BETA3+L11*L20-L12*L10
VAR D1 =
L11*L22-L12^2
VAR BETA2 =
DIVIDE(D2,D1)
VAR D0 =
L30-BETA2*L32-BETA3*L33
VAR BETA1 =
DIVIDE(D0,L13)
VAR Intercept =
AVERAGEX(VT,VT[Y]) - BETA1 * AVERAGEX(VT,VT[X1]) -
BETA2 * AVERAGEX(VT,VT[X2]) -
BETA3 * AVERAGEX(VT,VT[X3])
VAR Result =
SUMX (
SUMMARIZE('Date','Date'[Date],'Date'[Month Number],
DateFactorTable[DateFactor]),
Intercept +
BETA1* 'Date'[Date] +
BETA2 * 'Date'[Month Number] +
BETA3 * 'DateFactorTable'[DateFactor]
)
RETURN
RESULT

End~

--

--