Modeling Marketing Mix Using Smoothing Splines

Capturing non-linear advertising saturation and diminishing returns without explicitly transforming media variables

Slava Kisilevich
Towards Data Science

--

Photo by Pawel Czerwinski on Unsplash

The established approach among marketers for modeling marketing mix is to apply linear regression models which assume the relationship between marketing activities such as advertisement spend and the response variable (sales, revenue) is linear. Prior to modeling, media spend variables should undergo two necessary transformations to properly capture the carryover effect and the saturation effect of the advertisement spend. It is known, that advertisement spend is not linear with respect to the response variable and follows the law of diminishing returns. However, the functional form of the saturation curve is not known in advance. Therefore, the modeler should first hypothesize about the possible transformation functions that should be applied to each media activity channel to match the true spend-to-response relationship. In this article, I show an alternative approach to modeling marketing mix by using Smoothing Splines, which is the way to model the non-linear relationship between dependent and independent variables within the framework of a linear model. By following this approach, the model will establish the non-linear relationship between media activity variables and the response variable without the need to transform those independent variables to account for the non-linear relationships.

The intuition behind the transformation of media activity variables

Let’s understand by example why the transformation of media activity variables is a necessary step in the framework of a linear model.

I generate a media variable with advertisement spend whose saturation effect is described by the hill function with two parameters γ, controlling the inflection point, and α the shape of the curve.

import numpy as np
spend = np.random.rand(1000, 1).reshape(-1)
noise = np.random.normal(size = 1000, scale = 0.2)
#hill transformation
alpha = 2
gamma = 0.1
spend_transformed = spend**alpha / (spend ** alpha + gamma ** alpha)response = 1 + 3 * spend_transformed + noise

For an overview of different transformation functions that can be used for modeling saturation effects and diminishing returns you may check the following article:

Let’s plot the spend-to-response relationship:

Image by Author

Now, what happens when we fit a linear regression model without first transforming our independent variable?

import statsmodels.api as sm
spend_with_intercept = sm.add_constant(spend)
ols_model = sm.OLS(response, spend_with_intercept)
ols_results = ols_model.fit()
ols_results.params
Image by Author

As expected, the linear regression is not able to capture the non-linear relationship. Now, in a real-life situation, at this moment, we have to decide on the transformation function and its parameters. Let’s transform our spend variable using the hill function with slightly different parameters that were used to simulate the response variable and fit the linear regression model.

alpha = 2
gamma = 0.3
#hill transformation
spend_transformed = spend**alpha / (spend ** alpha + gamma ** alpha)
X_hill_transformed = sm.add_constant(spend_transformed)
ols_model_hill_transformed = sm.OLS(response, X_hill_transformed)
ols_results_hill_transformed = ols_model_hill_transformed.fit()
ols_results_hill_transformed.params
Image by Author

The model now captures non-linearities but in the region of lower spendings, there is an obvious mismatch. Let’s try again with different parameter values.

Image by Author

As we move the gamma value closer to the original value used to generate the response, we get a better and better fit. Of course, the modeler doesn’t try different parameters manually. Instead, this process is automated by hyperparameter optimization frameworks. However, the transformation step still requires effort to agree upon the proper transformation function and computing time to find the approximate transformation parameters to better fit a model to data.

Is there a way to omit the transformation step and let the model find the non-linear relationships? Yes. In one of my previous articles, I described a machine-learning approach to achieving this:

Despite the advantages of using arbitrary tree-based machine learning algorithms that improve accuracy over linear regression approaches and deal with non-linearities, one of the important disadvantages is that these algorithms do not maintain much interpretability as linear models do. So additional approaches like SHAP are needed to be applied to explain media performance, which may not be intuitive to the marketers. The second approach is to extend linear models in such a way that allows to:

  • Maintain the interpretability
  • Model the non-linear effects

Such an extension is called Generalized Additive Models where modeling of non-linear effects is achieved by using the so-called Smoothing Splines.

An overview of Generalized Additive Models (GAM) and Smoothing Splines

The standard (additive) linear model takes the following form

Image by Author

where betas are coefficients and epsilon is the error term.

The generalized additive model extends linear regression with the following form:

Image by Author

where f is a smooth function. Essentially GAM adds the sum of smoothing terms to the sum of linear terms. There are quite a few approaches for smoothing functions of which penalized smoothing splines are recommended for their computational efficiency.

The penalized smoothing splines work as follows:

First, the range of data variables is divided into K distinct regions having K region boundaries called knots. Second, within each region, low-degree polynomial functions are fitted to the data. These low-degree polynomials are called basis functions. The spline is a sum of weighted basis functions, evaluated at the values of the data. A penalty is added to the least square loss function to control the model’s smoothness. The smoothing parameter lambda controls the trade-off between the smoothness and wiggliness of the estimated smoothing function f. Low values of lambda result in an un-penalized spline, whereas high values of lambda result in a linear line estimate.

Generalized Additive Models in Python

Let’s return to the previous example and fit the penalized smoothing splines to the spend-response data.

In Python, pyGAM is the package for building Generalized Additive Models

pip install pygam

I model the linear GAM using the LinearGAM function. The spline term is defined by using an s function which expects an index of the variable. n_splines controls how many basis functions should be fitted. I set lambda to a very low value of 0.1 which means I almost don’t penalize the wiggliness of the spline

from pygam import LinearGAM, s
gam_hill = LinearGAM(s(0, n_splines=10), lam = 0.1).fit(spend, response)

Let’s understand the output of the model:

gam_hill.summary()
Image by Author

The important parts worth mentioning are:

  • Rank — number of splines or basis functions
  • Lambda — the smoothing parameter
  • EDoF — effective degrees of freedom. Higher EDoF suggests more complex splines. When EDoF is close to 1, it suggests that the underlined term is linear. The documentation of mgcv package in R (an equivalent of pyGAM) suggests the following rule of thumb for the selection of a number of splines:
    As with all model assumptions, it is useful to be able to check the choice of n_splines informally. If the effective degrees of freedom for a model term is estimated to be much less than n_splines-1 then this is unlikely to be very worthwhile, but as the EDoF approaches n_splines-1, checking can be important: increase the n_splines and refit the original model. If there are no statistically important changes as a result of doing this, then n_splines was large enough. Change in the smoothness selection criterion, and/or the effective degrees of freedom, when n_splines is increased, provide the obvious numerical measures for whether the fit has changed substantially.

Let’s extract the spend component:

XX = gam_hill.generate_X_grid(term=0, n = len(response))
YY = gam_hill.partial_dependence(term=0, X=XX)

The expected value of the response is the sum of all individual terms including the intercept:

intercept_ = gam_hill.coef_[-1]
response_prediction = intercept_ + YY

Let’s plot the final spend-to-response relationship:

Image by Author

The model could capture the non-linear spend-to-response relationship without transforming the independent variable.

For completeness, let’s take a look at the basis functions of the spline:

basis_functions = pd.DataFrame(gam_hill._modelmat(spend).toarray())
basis_functions["spend"] = spend
#plot
ggplot(pd.melt(basis_functions, id_vars="spend"), aes("spend", "value", group="variable", color = "variable")) + geom_line() + ggtitle(f"{len(basis_functions.columns) - 2} Basis Functions plus Intercept")
Image by Author

The resulting spline is a weighted sum of basis functions multiplied by their corresponding coefficients:

basis = gam_hill._modelmat(spend).toarray()  
coefficients = gam_hill.coef_
#omit intercept
spline = basis[:,:-1] @ coefficients[:-1]
Image by Author

I omitted the addition of the intercept, so the plot shows a decomposed smoothing effect of the spend variable.

Modeling Marketing Mix with Generalized Additive Models

Now, when we understand how GAM works, I show how it can be incorporated into the MMM framework. I will reuse the framework I introduced in the following paper:

The changes are minimal. Instead of Random Forest, I am using GAM. The SHAP component which was necessary to explain black-box models is not needed anymore. Because of the linear additivity of components in GAM, we analyze the effects of each smoothed media variable almost similarly as in linear regression models.

Data

As in my previous articles on MMM, I continue using the dataset provided by Robyn under MIT Licence for benchmarking and follow the same data preparation steps by applying Prophet to decompose trends, seasonality, and holidays.

The dataset consists of 208 weeks of revenue (from 2015–11–23 to 2019–11–11) having:

  • 5 media spend channels: tv_S, ooh_S, print_S, facebook_S, search_S
  • 2 media channels that have also the exposure information (Impression, Clicks): facebook_I, search_clicks_P (not used in this article)
  • Organic media without spend: newsletter
  • Control variables: events, holidays, competitor sales (competitor_sales_B)

The analysis window is 92 weeks from 2016–11–21 to 2018–08–20.

Modeling

The modeling steps are exactly the same as I described in the previous article:

I will mention the important changes with respect to GAM.

pyGAM expects quite a non-standard definition of input features. I’ve written a helper function that takes two types of features: those that should be modeled as linear and those that should be modeled as smoothing splines. In addition, the number of splines is provided as a parameter

def build_gam_formula(data, 
base_features,
smooth_features,
n_splines = 20):

#set the first spline term
channel_index = data.columns.get_loc(smooth_features[0])
formula = s(channel_index,
n_splines,
constraints='monotonic_inc')

for smooth_channel in smooth_features[1:]:
channel_index = data.columns.get_loc(smooth_channel)
#smooth term
formula = formula + s(channel_index,
n_splines,
constraints='monotonic_inc')
for base_feature in base_features:
feature_index = data.columns.get_loc(base_feature)
#linear term
formula = formula + l(feature_index)
return formula

The resulting formula applied to 6 media channels and 5 control variables looks like this:

s(5) + s(6) + s(7) + s(8) + s(9) + s(10) + l(0) + l(1) + l(2) + l(3) + l(4)

s(5) for example means applying smoothing splines on the variable whose column index in the data set is 5.

Pay attention to the constraints parameter with the value monotonic_inc. In the context of linear regression, we apply some agreed-upon business logic related to the saturation effect and diminishing returns by constraining linear coefficients to be positive so that an increase in advertisement spend will not cause a decrease in response.

Check this article for more details:

In the context of GAM, I put a constraint on smoothing splines to be monotonically increasing.

Parameter Optimization

Along with 5 adstock parameters, I optimize the smoothing parameter lambda in the range between 0.1 and 1000 for all variables. Note: it is not quite clear to me why pyGAM requires smoothing parameters for variables that are explicitly defined as linear.

Final Model

I built the final model using the window of 92 weeks. The summary of the model is shown below:

Image by Author

The summary shows that 3 out of 6 media channels got a high smoothing lambda, and their corresponding EDoF near 1 suggests an almost linear relationship.

If comparing error metrics to the evaluations, GAM achieved higher accuracy than the Bayesian approach and slightly lower than the Random Forest

Image by Author

Diminishing return / Saturation effect

Let’s plot the resulting spend-to-response relationship. The vertical line corresponds to the average spend in the channel. The x-axis is the media spend. The y-axis is the effect of spend on the response variable

Image by Author
Image by Author
Image by Author
Image by Author
Image by Author

As was expected after reviewing the final model, several media channels (search_S, ooh_S) have a linear effect.

Taking facebook_S, we observe a strong s-shape of the diminishing returns that suggests that advertisement spend starting close to 200K won’t bring any additional increase in the response.

Taking print_S, its response rate declines at about 125K spend

Observing search_S, its response rate increases almost linearly but at a very slow rate.

Conclusion

In this article, I showed yet another approach for modeling marketing mix by applying penalized smoothing splines on media channels. This approach does not require transforming media spend variables to account for non-linearities of diminishing returns. Finally, smoothing splines resulted in a better model fit compared to the linear regression models, which increases confidence in the decomposed effects.

The complete code can be downloaded from my Github repo

Thanks for reading!

--

--