Modeling Marketing Mix using PyMC3

Experimenting with priors, data normalization, and comparing Bayesian modeling with Robyn, Facebook’s open-source MMM package

Published in

Towards Data Science

16 min readFeb 23, 2022

In this article, I apply a Bayesian approach to the marketing problem of estimating the impact of advertising spend in different media channels on revenue. I cover several aspects of Bayesian modeling which should be important to the MMM practitioner:

Normalization of dependent and independent variables and selection of priors
The influence of normalization of the response variable on the estimated effects

In addition, I compare the results with the Robyn framework, following its methodology, the open-source package for MMM modeling that uses Ridge regression, and a gradient-free optimization framework for hyperparameter optimization.

Previous publications on Marketing Mix Modeling and Bayesian inference

There were quite a few articles published on modeling Marketing Mix using Bayesian programming covering different aspects of the modeling such as:

Bayesian framework and its robustness to uncover the true parameters of the data: Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects, Carryover and Shape Effects in Media Mix Modeling: Paper Review
Multiplicative MMM: Python/STAN Implementation of Multiplicative Marketing Mix Model
Modeling of adstock and saturation effects: Bayesian Marketing Mix Modeling in Python via PyMC3
Practical usage of the Bayesian model: HelloFresh

The structure of the article is following

Marketing Mix Modeling — I give a short introduction into the theory behind MMM.
Adstock / Carryover Effect— I cover differences in geometric adstock functions, provide a Theano implementation of the geometric function used in Robyn framework
Diminishing Returns / Saturation Effect — I cover various functions that can be used for modeling diminishing returns
Modeling — the main part of this article in which I explore the influence of data normalization on the results. I use the demo data provided by Robyn’s team and follow Robyn’s methodology for data processing. Finally, I compare the results to Robyn’s modeling.

Marketing Mix Modeling

Half the money I spend on advertising is wasted; the trouble is I don’t know which half (John Wanamaker)

The goal of MMM is to understand the drivers of sales, measuring the impact of all factors that may influence sales. These factors can be assigned to two main groups: the group of factors having an only indirect influence on sales (also called baseline) such as economical situation, holidays, weather, competition, and factors that have a direct influence on sales (also called marketing contribution) such as spend on advertising (ad spend) on different media channels like TV, Radio, Online Platforms or price, and promotions.

The goal of the modeler is to define the relevant components that may influence sales, prepare the historical time-series data of marketing activity and other (control) variables and build a statistical model which estimates the impact of each of the components on sales. This is usually being done using multiple linear regression.

The whole process is being complicated by the two marketing effects: carryover effect and diminishing returns. The effect of advertising on sales may have a carryover effect, meaning that sales will be influenced by the advertising for some days or weeks in the future following the advertising. Moreover, the advertising effect may not be immediate due to the delayed consumer response and in this case, we are talking about a lagged carryover effect.

Diminishing returns suggest that starting from particular media spend the relationship between spend and sales is not linear and reaches a point of saturation in which an additional euro invested in the advertisement will not lead to an increase in sales.

The challenge in MMM is to model the carryover and saturation effects for each media channel along with baseline and media components, applying some constraints to modeling that makes sense from a marketing perspective. One of the constraints is that media spending has a positive influence on sales requires that the linear model estimates positive coefficients for media channels.

Solving this challenge requires a modeling framework capable of optimizing various parameters subject to different constraints and prior knowledge. This can be achieved either by using some general hyperparameter optimization frameworks or using Bayesian programming. In this article, I use PyMC3, a Bayesian framework, to model marketing mix.

Adstock / Carryover Effects

There are several adstock transformation functions we can use to model the carryover effect. The commonly used transformation is a so-called adstock with geometric decay. However, it comes in two flavors. The first is described in this paper and has three parameters to be estimated:

decay rate of ad effect: 0 < α < 1, simply meaning if we invest 100 euro today and α is 0.5, the expected effect tomorrow will be 50 euro
the maximum duration of effect (in days, weeks) L assumed for a media channel
delay of the peak effect θ (0 ≤ θ ≤ L-1), modeling the possible lag effect of ad spend that will not start immediately.

The implementation of this version of the transformation can be found here. Other implementations based on this version but having a slightly different calculation of delay weights can be found here and here.

The second version is much easier to implement. It has only one parameter — the decay rate α.
The Robyn team uses this version in their framework too.

where α is the decay rate, x(t) is the current ad spend, y(t-1) is the cumulative adstock effect at time t-1 and y(t) is the final adstock at time t.

The implementation in python has several lines of code:

def adstock_geometric(x: float, alpha: float):
  x_decayed = np.zeros_like(x)
  x_decayed[0] = x[0]  for xi in range(1, len(x_decayed)):
    x_decayed[xi] = x[xi] + alpha* x_decayed[xi - 1]  return x_decayed

Example:

x = np.array([1.0, 2.0, 3.0, 4.0, 5.0 , 6.0])
adstock_geometric(x, 0.5)#output: 1.0, 2.5, 4.25, 6.125, 8.062, 10.031

Adstock at day 2 is 2.5: 2.0 + 1.0 * 0.5
Adstock at day 3 is 4.25: 3.0 + 2.5 * 0.5
Adstock at day 4 is 6.125: 4 + 4.25 * 0.5
….

When applied on a time series the result of adstock effect can be seen in the graph below:

Since we are going to use PyMC3 we have to rewrite the geometric decay function in Theano.

def adstock_geometric_theano_pymc3(x, theta):
    x = tt.as_tensor_variable(x)
        def adstock_geometric_recurrence_theano(index, 
                                            input_x, 
                                            decay_x,   
                                            theta):
        return tt.set_subtensor(decay_x[index], 
               tt.sum(input_x + theta * decay_x[index - 1]))    len_observed = x.shape[0]    x_decayed = tt.zeros_like(x)
    x_decayed = tt.set_subtensor(x_decayed[0], x[0])    output, _ = theano.scan(
        fn = adstock_geometric_recurrence_theano, 
        sequences = [tt.arange(1, len_observed), x[1:len_observed]], 
        outputs_info = x_decayed,
        non_sequences = theta, 
        n_steps = len_observed - 1
    )
    
    return output[-1]

Diminishing Returns / Saturation Effect

There are various functions that can be used to model the non-linear relationship between ad spend and sales such as Power Function, Negative Exponential, or Logistic. Robyn Team uses the Hill Function described in this paper.

where α controls the shape of the curve and γ controls the inflection point. The larger the α, the diminishing returns have more S-shape. The smaller the α, the more C-shape. The following plots demonstrate different saturation curves as a function of α and γ.

Since saturation hill function is applied after adstock transformation we don’t need special treatment in Theano. The input x is already a Theano tensor.

def saturation_hill_pymc3(x, alpha, gamma): 
    x_s_hill = x ** alpha / (x ** alpha + gamma ** alpha)
    return x_s_hill

Modeling

Data

I use the demo dataset provided by Robyn and follow the same methodological steps in data preparation to have the same baseline for comparison. For comparison with Robyn solutions I ran the demo.R file that comes with Robyn package, written in R, without any changes in settings.

The dataset consists of 208 weeks of revenue having:

5 media spend channels: tv_S, ooh_S, print_S, facebook_S, search_S
2 media channels that have also the exposure information (Impression, Clicks): facebook_I, search_clicks_P
Organic media without spend: newsletter
Control variables: events, holidays, competitor sales (competitor_sales_B)

The modeling window is 92 weeks from 2016–11–21 to 2018–08–20. To make it generic I define two index variables to refer to this time window:

START_ANALYSIS_INDEX = 52
END_ANALYSIS_INDEX = 144

Data Preparation

We have to apply two preparation steps as described in Robyn:

Robyn leverages Prophet, Facebook’s open source ML library for time series forecasting. We use Prophet to automatically decompose the trend, seasonality and holidays effects directly from the response as input variables for further modeling. This capability often increases model fit and reduces autoregressive patterns in residuals.

In addition, we can convert categorical variables such events into a numeric using Prophet’s decomposition.

The second preparation step:

When using exposure variables (impressions, clicks, GRPs etc) instead of spend Robyn fits an nonlinear model with Michaelis Menten function between exposure and spend to establish the spend-exposure relationship

The second step will allow us to convert exposure into spend for the final share of spend vs. share of effect comparison.

Let’s load the data first:

data = pd.read_csv("./data/data_raw_2015-11-23__2019-11-11.csv", parse_dates = ["DATE"])
data.columns = [c.lower() if c in ["DATE"] else c for c in data.columns]
data

The holidays data is a separate file that is originally shipped with Prophet. The original holidays data has a daily granularity so it should be aggregated on a weekly level first. Robyn uses German holidays in their demo.

holidays = pd.read_csv("./data/prophet_holidays_daily.csv", parse_dates = ["ds"])
holidays["begin_week"] = holidays["ds"].dt.to_period('W-SUN').dt.start_time
#combine same week holidays into one holiday
holidays_weekly = holidays.groupby(["begin_week", "country", "year"], as_index = False).agg({'holiday':'#'.join, 'country': 'first', 'year': 'first'}).rename(columns = {'begin_week': 'ds'})
holidays_weekly_de = holidays_weekly.query("(country == 'DE')").copy()
holidays_weekly_de

Prophet Decomposition

Now we are ready to fit Prophet on our data, including holidays, a categorical variable, and using yearly seasonality. Important to note that we apply the decomposition on all available data and not on the modeling window.

prophet_data = data.rename(columns = {'revenue': 'y', 'date': 'ds'})
#add categorical into prophet
prophet_data = pd.concat([prophet_data, pd.get_dummies(prophet_data["events"], drop_first = True, prefix = "events")], axis = 1)prophet = Prophet(yearly_seasonality=True, holidays=holidays_weekly_de)
prophet.add_regressor(name = "events_event2")
prophet.add_regressor(name = "events_na")prophet.fit(prophet_data[["ds", "y", "events_event2", "events_na"]])
prophet_predict = prophet.predict(prophet_data[["ds", "y", "events_event2", "events_na"]])

Let’s extract the seasonality, trend, holidays, and events components:

prophet_columns = [col for col in prophet_predict.columns if (col.endswith("upper") == False) & (col.endswith("lower") == False)]
events_numeric = prophet_predict[prophet_columns].filter(like = "events_").sum(axis = 1)final_data = data.copy()
final_data["trend"] = prophet_predict["trend"]
final_data["season"] = prophet_predict["yearly"]
final_data["holiday"] = prophet_predict["holidays"]
final_data["events"] = (events_numeric - np.min(events_numeric)).values

Spend-Exposure estimation

In this step, we estimate the spend-exposure non-linear relationship using the Michaelis Menten function

The spend-to-exposure function is defined as follows:

where Vmax and Km are two parameters we have to estimate. These two parameters will be later used to find the reverse relationship of exposure-to-spend. The parameters are estimated for the modeling window.

#define the function
spend_to_exposure_menten_func = lambda spend, V_max, K_m: V_max * spend / (K_m + spend)media_exposures = ["facebook_I", "search_clicks_P"]
media_spends = ["facebook_S", "search_S"]media_spend_exposure_df = pd.DataFrame()
for (media_exposure, media_spend) in zip(media_exposures, media_spends):
    V_max = final_data[media_exposure].values[START_ANALYSIS_INDEX : END_ANALYSIS_INDEX].max()
    K_m   = V_max / 2
    spend = final_data[media_spend].values[START_ANALYSIS_INDEX : END_ANALYSIS_INDEX]
    exposure = final_data[media_exposure].values[START_ANALYSIS_INDEX : END_ANALYSIS_INDEX]
    best_values, _ = optimize.curve_fit(f = spend_to_exposure_menten_func, xdata = spend, ydata = exposure, p0 = [V_max, K_m])
    media_spend_exposure_df = pd.concat([media_spend_exposure_df, pd.DataFrame({'spend': [media_spend], 'exposure': [media_exposure], 'V_max': [best_values[0]], 'K_m': [best_values[1]]})]).reset_index(drop = True)
    
media_spend_exposure_df

Bayesian Modeling

Now we are ready for modeling. We model our response variable (revenue) as an additive linear regression, which can be described by the following equation:

where b_0 corresponds to the baseline revenue, b_m coefficients correspond to media variables that were transformed through adstock and saturation functions, b_c coefficients correspond to control variables and ϵ is some noise. All mentioned coefficients, noise, and the parameters of the adstock and saturation function should be estimated by the model.

The first decision we have to make prior to modeling is how we are going to normalize our dependent and independent variables. By normalizing independent variables we generalize our modeling to other sources of data because we can use the same priors for most of our independent variables. Moreover, it will be very difficult to set priors on non-normalized data. Therefore I apply 0–1 normalization on independent variables and experiment with two different normalizations of the response variable:

Scaling by 100K
0–1 Normalization

Scaling by 100K

The original revenue range is 672250–3827520, so by scaling the revenue by 100K, I get the following range: 6.72–38.27 which makes it easier to experiment with priors.

First, I define my input variables:

data = final_data
transform_variables = ["trend", "season", "holiday", "competitor_sales_B", "events", "tv_S", "ooh_S", "print_S", "facebook_I", "search_clicks_P", "newsletter"]delay_channels = ["tv_S", "ooh_S", "print_S", "facebook_I", "search_clicks_P", "newsletter"]media_channels = ["tv_S", "ooh_S", "print_S", "facebook_I", "search_clicks_P"]control_variables = ["trend", "season", "holiday", "competitor_sales_B", "events"]target = "revenue"

Then I normalize independent variables using MinMaxScaler and scale the dependent variable by 100K

data_transformed = data.copy()numerical_encoder_dict = {}for feature in transform_variables:
    scaler = MinMaxScaler()
    original = data[feature].values.reshape(-1, 1)
    transformed = scaler.fit_transform(original)
    data_transformed[feature] = transformed
    numerical_encoder_dict[feature] = scalerdependent_transformation = None
original = data[target].values
data_transformed[target] = original / 100_000

The modeling part consists of several steps:

I iterate first through media channels (delay channels) and define the priors for adstock and saturation transformation. I experimented with different priors but am finally using those described in the Bayesian Methods paper
I apply adstock transformation on all available data to allow the carryover effect build up using the historical data
I apply saturation transformation on the modeling window defined by START_ANALYSIS_INDEX and END_ANALYSIS_INDEX
I force the regression coefficients for delay channels to be positive using HalfNormal distribution
Next, I iterate through the rest of the variables without any constraints on the sign of the coefficients
I defined the prior for Intercept as having normal distribution starting at the average of the revenue. I follow the Robyn’s methodology constraining the Intercept to be positive.

response_mean = []
with pm.Model() as model_2:
    for channel_name in delay_channels:
        print(f"Delay Channels: Adding {channel_name}")
        
        x = data_transformed[channel_name].values
        
        adstock_param = pm.Beta(f"{channel_name}_adstock", 3, 3)
        saturation_gamma = pm.Beta(f"{channel_name}_gamma", 2, 2)
        saturation_alpha = pm.Gamma(f"{channel_name}_alpha", 3, 1)
        
        x_new = adstock_geometric_theano_pymc3(x, adstock_param)
        x_new_sliced = x_new[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX]
        saturation_tensor = saturation_hill_pymc3(x_new_sliced, saturation_alpha, saturation_gamma)
        
        channel_b = pm.HalfNormal(f"{channel_name}_media_coef", sd = 3)
        response_mean.append(saturation_tensor * channel_b)
        
    for control_var in control_variables:
        print(f"Control Variables: Adding {control_var}")
        
        x = data_transformed[control_var].values[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX]
        
        control_beta = pm.Normal(f"{control_var}_control_coef", sd = 3)
        control_x = control_beta * x
        response_mean.append(control_x)
        
    intercept = pm.Normal("intercept", np.mean(data_transformed[target].values), sd = 3)
    #intercept = pm.HalfNormal("intercept", 0, sd = 3)
        
    sigma = pm.HalfNormal("sigma", 4)
    
    likelihood = pm.Normal("outcome", mu = intercept + sum(response_mean), sd = sigma, observed = data_transformed[target].values[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX])

I generate samples from prior distribution to check if my choice of priors was reasonable:

And plot prior distributions:

Now we can fit the model:

with model_2:
    trace = pm.sample(1000, tune=1000, step=None, target_accept = 0.95, return_inferencedata=True)
    trace_summary = az.summary(trace)

I increased the target_accept parameter to 0.95 because I got some convergence warnings with default values.

When the fitting is finished I sample data from the posterior

with model_2:
    ppc_all = pm.sample_posterior_predictive(
        trace, var_names=["outcome"] + list(trace_summary.index), random_seed=42
    )
az.plot_ppc(az.from_pymc3(posterior_predictive=ppc_all, model=model_2), var_names = ["outcome"])

The distribution of the observed data is not normal and strictly positive while the likelihood that I defined is normally distributed that’s why we see a mismatch at the lowest levels of revenue. This could be also the reason for some convergence issues but playing with StandardScaler or PowerTransformer didn’t make any better, so I decided to stick with more intuitive normalization.

Having the posterior samples we can now measure the goodness of fit, plot various auxiliary plots like residual plots and perform decomposition of revenue and measure the channel contributions.

Goodness of fit

One of the metrics that Robyn uses to measure the prediction error is a normalized root-mean-square error (NRMSE).

def nrmse(y_true, y_pred):
   return np.sqrt(np.mean((y_true - y_pred) ** 2)) / (np.max(y_true) - np.min(y_true))

The predicted revenue is the average of posterior samples multiplied by 100K:

y_true = data[target].values[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX]#restore the original revenue by multiplying back 100K
y_pred = ppc_all["outcome"].mean(axis = 0) * 100_000print(f"RMSE: {np.sqrt(np.mean((y_true - y_pred)**2))}")
print(f"MAPE: {np.mean(np.abs((y_true - y_pred) / y_true))}")
print(f"NRMSE: {nrmse(y_true, y_pred)}")

Decomposition

In order to decompose revenues by channels, we have to apply adstock and saturation on media channels using the parameters estimated by the model and then multiply by the corresponding estimated coefficient.

The summary of parameters and coefficients estimated by the model:

I plot the actual revenue, the predicted posterior revenue along with the revenue calculated by summing up revenue contributions of each component to compare if the decomposed revenue matches up with the predicted posterior.

NRMSE of the decomposed revenue: 0.058, MAPE: 0.067

The last step is to calculate the share of media channel spend and compare it with the share of revenue (effect share)

I calculate the spend share using the original data. I find the spend for variables that were modeled using their exposure information (facebook_I, search_clicks_P) using exposure to spend relationship:

exposure_to_spend_menten_func = lambda exposure, V_max, K_m: exposure * K_m / (V_max - exposure)spend_df = pd.DataFrame()
for media_channel in media_channels:
    temp_series = data[media_channel].iloc[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX].values
    #exposure to spend should
    if len(media_spend_exposure_df[media_spend_exposure_df.exposure == media_channel]) > 0:
        
        vmax = media_spend_exposure_df[media_spend_exposure_df.exposure == media_channel]["V_max"].iloc[0]
        km = media_spend_exposure_df[media_spend_exposure_df.exposure == media_channel]["K_m"].iloc[0]
        spends = exposure_to_spend_menten_func(temp_series, V_max = vmax, K_m = km)
        spends_total = spends.sum()
    else:
        spends_total = temp_series.sum()
        
    spend_df = pd.concat([spend_df, pd.DataFrame({'media': [media_channel], 'total_spend': [spends_total]})]).reset_index(drop=True)spend_df["spend_share"] = spend_df["total_spend"] / spend_df["total_spend"].sum()
spend_df

Then, I find the share of effect for these variables using the decomposed information.

response_df = pd.DataFrame()
for media_channel in media_channels:
    response = data_transformed_decomposed[media_channel].iloc[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX].values
    response_total = response.sum()
    
    response_df = pd.concat([response_df, pd.DataFrame({'media': [media_channel], 'total_effect': [response_total]})]).reset_index(drop=True)
response_df["effect_share"] = response_df["total_effect"] / response_df["total_effect"].sum()response_df

And finally, plot the share of spend vs. share of effect graph:

0–1 Normalization

Let’s conduct the same experiment but this time normalizing the response variable between 0 and 1

dependent_transformation = MinMaxScaler()
original = data[target].values.reshape(-1, 1)
transformed = dependent_transformation.fit_transform(original)
data_transformed[target] = transformed

The priors for Intercept and coefficients should be now adjusted for the response range:

response_mean = []
with pm.Model() as model_3:
    for channel_name in delay_channels:
        print(f"Delay Channels: Adding {channel_name}")
        
        x = data_transformed[channel_name].values
        
        adstock_param = pm.Beta(f"{channel_name}_adstock", 3, 3)
        saturation_gamma = pm.Beta(f"{channel_name}_gamma", 2, 2)
        saturation_alpha = pm.Gamma(f"{channel_name}_alpha", 3, 1)
        
        x_new = adstock_geometric_theano_pymc3(x, adstock_param)
        x_new_sliced = x_new[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX]
        saturation_tensor = saturation_hill_pymc3(x_new_sliced, saturation_alpha, saturation_gamma)
        
        channel_b = pm.HalfNormal(f"{channel_name}_media_coef", sd = 0.1)
        response_mean.append(saturation_tensor * channel_b)
        
    for control_var in control_variables:
        print(f"Control Variables: Adding {control_var}")
        
        x = data_transformed[control_var].values[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX]
        
        control_beta = pm.Normal(f"{control_var}_control_coef", 0.1, sd = 0.1)
        control_x = control_beta * x
        response_mean.append(control_x)
        
    
    intercept = pm.HalfNormal("intercept", 0.1)
        
    sigma = pm.HalfNormal("sigma", 0.15)
    
    likelihood = pm.Normal("outcome", mu = intercept + sum(response_mean), sd = sigma, observed = data_transformed[target].values[START_ANALYSIS_INDEX:END_ANALYSIS_INDEX])

Prior distribution seems reasonable:

The modeling and posterior predictive checks are the same as before so let’s check the goodness of fit and plot spend vs. share of effect:

NRMSE: 0.058, MAPE: 0.065

Now compare both models:

There are small differences in shares of estimated effects but I may conclude that both models are consistent in estimating the relative magnitude of share of effect vs. share of spend

Comparison with Robyn

Robyn generates a set of models and allows the modeler to choose the one that is most relevant to the business. For several best solutions, it generates share of spend vs. share of effect graphs and saves them in files.

Among the best models I’ve subjectively picked up one which is more or less comparable to models generated by PyMC3:

I compare Robyn with two PyMC3 models:

The comparison shows that the results of both models generated in PyMC3 are more similar to each other than to Robyn. One of the possible reasons can be that a solution similar to the one generated by PyMC3 was among the top candidates but not among the top solutions. Another reason can be related to the way Robyn selects top candidates: Robyn performs multi-objective optimization using two measures: NRMSE, aimed to reduce the model’s prediction error and RSSD (decomposition root-sum-square distance):

The distance accounts for a relationship between spend share and a channel’s coefficient decomposition share. If the distance is too far, its result can be too unrealistic — e.g. media activity with the smallest spending gets the largest effect

Conclusion

In this article, I experimented with the normalization of independent and dependent variables, priors, and posterior distributions. I used the methodology for data preparation proposed by Robyn, which allows reusing code in this article in a real-life scenario. Likewise, I compared the results of Bayesian modeling with Robyn. As the final selection of the model is business-relevant there is no easy way in deciding whether models generated by Robyn or PyMC3 were better in these experiments without additional business context and calibration. If comparing NRMSE alone, the selected top model generated by Robyn had lower NRMSE and thus better fit. Since Robyn additionally optimizes a business-relevant metric, it has fewer chances to generate a model that is statistically accurate but unrealistic from the marketing perspective. I believe there can be further improvement of the proposed Bayesian MMM solution. Some questions that remain open for me are how to improve prior parameterization to solve the convergence warnings and what likelihood distribution other than Normal can be used for a non-normally distributed positive response variable.

The complete code can be downloaded from my Github repo

Thanks for reading!