Python/STAN Implementation of Multiplicative Marketing Mix Model

With Deep Dive into Adstock, Diminishing Return, ROAS, and mROAS

Sibyl He
Towards Data Science

--

Full code and simulated dataset are posted on my Github repo:
https://github.com/sibylhe/mmm_stan

The methodology of this project is based on this paper by Google, but is applied to a more complicated, real-world setting, where 1) there are 13 media channels and 46 control variables; 2) models are built in a stacked way.

1. Introduction

Marketing Mix Model, or Media Mix Model (MMM) is used by advertisers to measure how their media spending contributes to sales, so as to optimize future budget allocation. ROAS (return on ad spend) and mROAS (marginal ROAS) are the key metrics to look at. High ROAS indicates the channel is efficient, high mROAS means increasing spend in the channel will yield a high return based on the current spending level.

Procedures

1. Fit a regression model with priors on coefficients, using media channels’ impressions (or spending) and control variables to predict sales;

2. Decompose sales to each media channel’s contribution. Channel contribution is calculated by comparing original sales and predicted sales upon removal of the channel;

3. Compute ROAS and mROAS using channel contribution and spending.

Intuition of MMM

  • Offline channel’s influence is hard to track. E.g., a customer saw a TV ad, and made a purchase at store.
  • Media channels’ influences are intertwined.

Actual Customer Journey: Multiple Touchpoints
A customer saw a product on TV > clicked on a display ad > clicked on a paid search ad > made a purchase of $30. In this case, 3 touchpoints contributed to the conversion, and they should all get credits for this conversion.

(Image by Author, Icons by icons8.com)

​What’s trackable: Last Digital Touchpoint
Usually, only the last digital touchpoint can be tracked. In this case, SEM, and it will get all credits for this conversion.

(Image by Author, Icons by icons8.com)

So, a good attribution model should take into account all the relevant variables leading to conversion.​

1.1 Multiplicative MMM

Since media channels work interactively, a multiplicative model structure is adopted:

Take log of both sides, we get the linear form (log-log model):

Constraints on Coefficients

1. Media coefficients are positive.

2. Control variables like discount, macroeconomy, event/retail holiday are expected to have positive impact on sales, their coefficients should also be positive.

1.2 Adstock

Media effect on sales may lag behind the original exposure and extend several weeks. The carry-over effect is modeled by Adstock:

L: length of the media effect
P: peak/delay of the media effect, how many weeks it’s lagging behind first exposure
D: decay/retention rate of the media channel, concentration of the effect
The media effect of the current weeks is a weighted average of the current week and previous (L− 1) weeks.

Adstock Example

Adstock with Varying Decay
The higher the decay, the more scattered the effect.

(Image by Author)

Adstock with Varying Length
The impact of length is relatively minor. In model training, the length could be fixed to 8 weeks or a period long enough for the media effect to finish.

(Image by Author)
import numpy as np
import pandas as pd
def apply_adstock(x, L, P, D):
‘’’
params:
x: original media variable, array
L: length
P: peak, delay in effect
D: decay, retain rate
returns:
array, adstocked media variable
‘’’
x = np.append(np.zeros(L-1), x)

weights = np.zeros(L)
for l in range(L):
weight = D**((l-P)**2)
weights[L-1-l] = weight

adstocked_x = []
for i in range(L-1, len(x)):
x_array = x[i-L+1:i+1]
xi = sum(x_array * weights)/sum(weights)
adstocked_x.append(xi)
adstocked_x = np.array(adstocked_x)
return adstocked_x

1.3 Diminishing Return

After a certain saturation point, increasing spend will yield diminishing marginal return, the channel will be losing efficiency as you keep overspending on it. The diminishing return is modeled by Hill function:

K: half-saturation point
S: slope

Hill function with varying K and S

(Image by Author)
def hill_transform(x, ec, slope):
return 1 / (1 + (x / ec)**(-slope))

2. Model Specification & Implementation

Dataset
Four years’ (209 weeks) records of sales, media impression, and media spending at a weekly level.

1. Media Variables
- Media Impression (prefix=’mdip_’): impressions of 13 media channels: direct mail, insert, newspaper, digital audio, radio, TV, digital video, social media, online display, email, SMS, affiliates, SEM.
- Media Spending (prefix=’mdsp_’): spending of media channels.

2. Control Variables
- Macro Economy (prefix=’me_’): CPI, gas price.
- Markdown (prefix=’mrkdn_’): markdown/discount.
- Store Count (‘st_ct’)
- Retail Holidays (prefix=’hldy_’): one-hot encoded.
- Seasonality (prefix=’seas_’): month, with Nov and Dec broke into weeks. One-hot encoded.

3. Sales Variable (‘sales’)

df = pd.read_csv(‘data.csv’)# 1. media variables
# media impression
mdip_cols=[col for col in df.columns if ‘mdip_’ in col]
# media spending
mdsp_cols=[col for col in df.columns if ‘mdsp_’ in col]
# 2. control variables
# macro economics variables
me_cols = [col for col in df.columns if ‘me_’ in col]
# store count variables
st_cols = [‘st_ct’]
# markdown/discount variables
mrkdn_cols = [col for col in df.columns if ‘mrkdn_’ in col]
# holiday variables
hldy_cols = [col for col in df.columns if ‘hldy_’ in col]
# seasonality variables
seas_cols = [col for col in df.columns if ‘seas_’ in col]
base_vars = me_cols+st_cols+mrkdn_cols+va_cols+hldy_cols+seas_cols
# 3. sales variables
sales_cols =[‘sales’]

Model Architecture

The model is built in a stacked way. Three models are trained:
- Control Model
- Marketing Mix Model
- Diminishing Return Model

Model Architecture (Image by Author)

2.1 Control Model / Base Sales Model

Goal: predict base sales (X_ctrl) as an input variable to MMM, this represents the baseline sales trend without any marketing activities.

X1: control variables positively related to sales, including macro economy, store count, markdown, holiday.
X2: control variables that may have either positive or negtive impact on sales: seasonality.
Target variable: ln(sales).
The variables are centralized by mean.

Priors

import pystan
import os
os.environ[‘CC’] = ‘gcc-10’
os.environ[‘CXX’] = ‘g++-10’
# helper functions
def apply_mean_center(x):
mu = np.mean(x)
xm = x/mu
return xm, mu
def mean_center_trandform(df, cols):
df_new = pd.DataFrame()
sc = {}
for col in cols:
x = df[col].values
df_new[col], mu = apply_mean_center(x)
sc[col] = mu
return df_new, sc
def mean_log1p_trandform(df, cols):
df_new = pd.DataFrame()
sc = {}
for col in cols:
x = df[col].values
xm, mu = apply_mean_center(x)
sc[col] = mu
df_new[col] = np.log1p(xm)
return df_new, sc
# mean-centralize: sales, numeric base_vars
df_ctrl, sc_ctrl = mean_center_trandform(df, [‘sales’]+me_cols+st_cols+mrkdn_cols)
df_ctrl = pd.concat([df_ctrl, df[hldy_cols+seas_cols]], axis=1)
# variables positively related to sales: macro economy, store count, markdown, holiday
pos_vars = [col for col in base_vars if col not in seas_cols]
X1 = df_ctrl[pos_vars].values
# variables may have either positive or negtive impact on sales: seasonality
pn_vars = seas_cols
X2 = df_ctrl[pn_vars].values
ctrl_data = {
’N’: len(df_ctrl),
‘K1’: len(pos_vars),
‘K2’: len(pn_vars),
‘X1’: X1,
‘X2’: X2,
‘y’: df_ctrl[‘sales’].values,
‘max_intercept’: min(df_ctrl[‘sales’])
}
ctrl_code1 = ‘’’
data {
int N; // number of observations
int K1; // number of positive predictors
int K2; // number of positive/negative predictors
real max_intercept; // restrict the intercept to be less than the minimum y
matrix[N, K1] X1;
matrix[N, K2] X2;
vector[N] y;
}
parameters {
vector<lower=0>[K1] beta1; // regression coefficients for X1 (positive)
vector[K2] beta2; // regression coefficients for X2
real<lower=0, upper=max_intercept> alpha; // intercept
real<lower=0> noise_var; // residual variance
}
model {
// Define the priors
beta1 ~ normal(0, 1);
beta2 ~ normal(0, 1);
noise_var ~ inv_gamma(0.05, 0.05 * 0.01);
// The likelihood
y ~ normal(X1*beta1 + X2*beta2 + alpha, sqrt(noise_var));
}
‘’’
sm1 = pystan.StanModel(model_code=ctrl_code1, verbose=True)
fit1 = sm1.sampling(data=ctrl_data, iter=2000, chains=4)
fit1_result = fit1.extract()

MAPE of control model: 8.63%

Extract control model parameters from the fit object and predict base sales -> df[‘base_sales’]

2.2 Marketing Mix Model

Goal:

- Find appropriate adstock parameters for media channels;
- Decompose sales to media channels’ contribution (and non-marketing contribution).

L: length of media impact
P: peak of media impact
D: decay of media impact
X: adstocked media impression variables and base sales
Target variable: ln(sales)
Variables are centralized by means.

Priors

df_mmm, sc_mmm = mean_log1p_trandform(df, [‘sales’, ‘base_sales’])
mu_mdip = df[mdip_cols].apply(np.mean, axis=0).values
max_lag = 8
num_media = len(mdip_cols)
X_media = np.concatenate((np.zeros((max_lag-1, num_media)), df[mdip_cols].values), axis=0) # padding zero * (max_lag-1) rows
X_ctrl = df_mmm[‘base_sales’].values.reshape(len(df),1)
model_data2 = {
’N’: len(df),
‘max_lag’: max_lag,
‘num_media’: num_media,
‘X_media’: X_media,
‘mu_mdip’: mu_mdip,
‘num_ctrl’: X_ctrl.shape[1],
‘X_ctrl’: X_ctrl,
‘y’: df_mmm[‘sales’].values
}
model_code2 = ‘’’
functions {
// the adstock transformation with a vector of weights
real Adstock(vector t, row_vector weights) {
return dot_product(t, weights) / sum(weights);
}
}
data {
// the total number of observations
int<lower=1> N;
// the vector of sales
real y[N];
// the maximum duration of lag effect, in weeks
int<lower=1> max_lag;
// the number of media channels
int<lower=1> num_media;
// matrix of media variables
matrix[N+max_lag-1, num_media] X_media;
// vector of media variables’ mean
real mu_mdip[num_media];
// the number of other control variables
int<lower=1> num_ctrl;
// a matrix of control variables
matrix[N, num_ctrl] X_ctrl;
}
parameters {
// residual variance
real<lower=0> noise_var;
// the intercept
real tau;
// the coefficients for media variables and base sales
vector<lower=0>[num_media+num_ctrl] beta;
// the decay and peak parameter for the adstock transformation of
// each media
vector<lower=0,upper=1>[num_media] decay;
vector<lower=0,upper=ceil(max_lag/2)>[num_media] peak;
}
transformed parameters {
// the cumulative media effect after adstock
real cum_effect;
// matrix of media variables after adstock
matrix[N, num_media] X_media_adstocked;
// matrix of all predictors
matrix[N, num_media+num_ctrl] X;

// adstock, mean-center, log1p transformation
row_vector[max_lag] lag_weights;
for (nn in 1:N) {
for (media in 1 : num_media) {
for (lag in 1 : max_lag) {
lag_weights[max_lag-lag+1] <- pow(decay[media], (lag — 1 — peak[media]) ^ 2);
}
cum_effect <- Adstock(sub_col(X_media, nn, media, max_lag), lag_weights);
X_media_adstocked[nn, media] <- log1p(cum_effect/mu_mdip[media]);
}
X <- append_col(X_media_adstocked, X_ctrl);
}
}
model {
decay ~ beta(3,3);
peak ~ uniform(0, ceil(max_lag/2));
tau ~ normal(0, 5);
for (i in 1 : num_media+num_ctrl) {
beta[i] ~ normal(0, 1);
}
noise_var ~ inv_gamma(0.05, 0.05 * 0.01);
y ~ normal(tau + X * beta, sqrt(noise_var));
}
‘’’
sm2 = pystan.StanModel(model_code=model_code2, verbose=True)
fit2 = sm2.sampling(data=model_data2, iter=1000, chains=3)
fit2_result = fit2.extract()

RMSE (log-log model): 0.04977
MAPE (multiplicative model): 15.71%

Distribution of Media Coefficients

Media Coefficients Distribution (red line: mean, green line: median. Image by Author)

Adstock Parameters

{‘dm’: {‘L’: 8, ‘P’: 0.8147, ‘D’: 0.5048},
‘inst’: {‘L’: 8, ‘P’: 0.6339, ‘D’: 0.4053},
‘nsp’: {‘L’: 8, ‘P’: 1.1077, ‘D’: 0.4613},
‘auddig’: {‘L’: 8, ‘P’: 1.883, ‘D’: 0.5118},
‘audtr’: {‘L’: 8, ‘P’: 1.9893, ‘D’: 0.5046},
‘vidtr’: {‘L’: 8, ‘P’: 0.0552, ‘D’: 0.0846},
‘viddig’: {‘L’: 8, ‘P’: 1.8626, ‘D’: 0.5075},
‘so’: {‘L’: 8, ‘P’: 1.7027, ‘D’: 0.5046},
‘on’: {‘L’: 8, ‘P’: 1.4170, ‘D’: 0.4907},
‘em’: {‘L’: 8, ‘P’: 1.0590, ‘D’: 0.4442},
‘sms’: {‘L’: 8, ‘P’: 1.8488, ‘D’: 0.5090},
‘aff’: {‘L’: 8, ‘P’: 0.6019, ‘D’: 0.3989},
‘sem’: {‘L’: 8, ‘P’: 1.3495, ‘D’: 0.4788}}

Notes:
- For SEM, P=1.3, D=0.48 does not make a lot of sense to me, because SEM is expected to have an immediate and concentrated impact (P=0, low decay). Same with online display.
- Try more specific priors in future models.​

Decompose sales to media channels’ contribution

Each media channel’s contribution = total sales — sales upon removal of the channel
In the previous model fitting step, parameters of the log-log model have been found:

Plug them into the multiplicative model:

2.3 Diminishing Return Model

Goal: for each channel, find the relationship (fit a Hill function) between spending and contribution, so that ROAS and marginal ROAS can be calculated.

x: adstocked media channel spending
K: half-saturation point
S: shape
Target variable: the media channel’s contribution
Variables are centralized by means.

Priors

def create_hill_model_data(df, mc_df, adstock_params, media):
y = mc_df[‘mdip_’+media].values
L, P, D = adstock_params[media][‘L’], adstock_params[media][‘P’], adstock_params[media][‘D’]
x = df[‘mdsp_’+media].values
x_adstocked = apply_adstock(x, L, P, D)
mu_x, mu_y = x_adstocked.mean(), y.mean()
sc = {‘x’: mu_x, ‘y’: mu_y}
x = x_adstocked/mu_x
y = y/mu_y

model_data = {
’N’: len(y),
‘y’: y,
‘X’: x
}
return model_data, sc
model_code3 = ‘’’
functions {
// the Hill function
real Hill(real t, real ec, real slope) {
return 1 / (1 + (t / ec)^(-slope));
}
}
data {
// the total number of observations
int<lower=1> N;
// y: vector of media contribution
vector[N] y;
// X: vector of adstocked media spending
vector[N] X;
}
parameters {
// residual variance
real<lower=0> noise_var;
// regression coefficient
real<lower=0> beta_hill;
// ec50 and slope for Hill function of the media
real<lower=0,upper=1> ec;
real<lower=0> slope;
}
transformed parameters {
// a vector of the mean response
vector[N] mu;
for (i in 1:N) {
mu[i] <- beta_hill * Hill(X[i], ec, slope);
}
}
model {
slope ~ gamma(3, 1);
ec ~ beta(2, 2);
beta_hill ~ normal(0, 1);
noise_var ~ inv_gamma(0.05, 0.05 * 0.01);
y ~ normal(mu, sqrt(noise_var));
}
‘’’
# train hill models for all media channels
sm3 = pystan.StanModel(model_code=model_code3, verbose=True)
hill_models = {}
to_train = [‘dm’, ‘inst’, ‘nsp’, ‘auddig’, ‘audtr’, ‘vidtr’, ‘viddig’, ‘so’, ‘on’, ‘sem’]
for media in to_train:
print(‘training for media: ‘, media)
hill_model = train_hill_model(df, mc_df, adstock_params, media, sm3)
hill_models[media] = hill_model

Diminishing Return Model

Diminishing Return Model (blue dots: real data points, red curve: fitted Hill function. Image by Author)

Calculate overall ROAS and weekly ROAS
- Overall ROAS = total media contribution / total media spending
- Weekly ROAS = weekly media contribution / weekly media spending

Distribution of Weekly ROAS (Recent 1 Year)

Weekly ROAS Distribution (red line: mean, green line: median. Image by Author)

Calculate mROAS

Marginal ROAS represents the return of incremental spending based on current spending. For example, I’ve spent $100 on SEM, how much sales will the next $1 bring.

mROAS is calculated by increasing the current spending level by 1%, the incremental channel contribution over incremental channel spending.
1. Current spending level cur_sp is a list of weekly spending in a given period.
Next spending level next_sp is increasing cur_sp by 1%.
2. Plug cur_sp and next_sp into the Hill function:
Current media contribution cur_mc = beta * Hill(cur_sp, K, S)
Next-level media contribution next_mc = beta * Hill(next_sp, K, S)
3. mROAS = (sum(next_mc) - sum(cur_mc)) / sum(0.01 * cur_sp)

3. Results & Marketing Budget Optimization

Media Channel Contribution
About 80% sales are contributed by non-marketing factors, media channels contributed 20% sales.

Media Channel Contribution (Image by Author)

Top media contributors:

  • TV (31.18%)
  • Affiliates (19.49%)
  • Insert (6.99%)
  • SEM (6.56%)
Media Contribution Percentage (Image by Author)

ROAS
High ROAS: TV (11.04), insert (5.85), online display (4.83)
ROAS shows the historic channel efficiency, how much sales $1 channel spend has brought.

(y: contribution percentage, x: ROAS, size: contribution percentage)

Media Channel Contribution vs ROAS (Image by Author)
  • TV: yields both high contribution and ROAS. Followed by insert.
  • Online display: its contribution volume (5.24%) is not large, but its ROAS is high (4.83).
  • SEM: low ROAS (2.06), the channel is a big spender but not quite efficient.

mROAS
High mROAS: TV (16.40), insert (9.35), radio (7.00), online display (6.58)
Marginal ROAS indicates the future potential of the channel, how much sales will the next $1 spend on this channel bring. The idea of budget optimization is moving money from low mROAS channels to high mROAS channels.

ROAS & mROAS (Image by Author)

Since TV, insert, radio, insert yield high mROAS, SEM has a low mROAS, part of SEM budget will be reallocated to TV, insert, radio, insert.

Run simulations with different optimation combinations, for example:

  • SEM: -10%, -20%, -30%
  • TV: +10%, +20%, +30%
  • Insert: +10%, +20%, +30%
  • Radio: +10%, +20%, +30%
  • Online display: +10%, +20%, +30%

Plug them into the MMM, see which option leads to the highest total contribution.

Note: trivial channels: newspaper, digital audio, digital video, social (spending/impression too small to be qualified, so that their results are not trustworthy).

Q&A

Please check this running list of FAQ. If you have questions, comments, suggestions, and practical problems (when applying this script to your datasets) that are unaddressed in this list, feel free to open a discussion or response to this article.

References

[1] Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46001.pdf
[2] STAN tutorials:
Prior Choice Recommendations. https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
Pystan Workflow. https://mc-stan.org/users/documentation/case-studies/pystan_workflow.html
A quick-start introduction to Stan for economists. https://nbviewer.jupyter.org/github/QuantEcon/QuantEcon.notebooks/blob/master/IntroToStan_basics_workflow.ipynb
HMC Sampling and STAN. https://education.illinois.edu/docs/default-source/carolyn-anderson/edpsy590ca/lectures/9-hmc-and-stan/hmc_n_stan_post.pdf

Pystan Installation Tips (mac, anaconda3)

I previously installed pystan directly using “pip install pystan”, but got “CompileError: command ‘gcc’ failed with exit status 1” when compiling the model. After many tries, the following method works for me.

1. In bash (terminal): 
(create a stan environment, install pystan, current version is 2.19)
conda create -n stan_env python=3.7 -c conda-forge
conda activate stan_env
conda install pystan -c conda-forge
(install gcc5, pystan 2.19 requires gcc4.9.3 and above)
brew install gcc@5
(look for ‘gcc-10’, ‘g++-10’)
ls /usr/local/bin | grep gcc
ls /usr/local/bin | grep g++
2. Open Anaconda Navigator > Home > Applications on: select stan_env as environment, launch Notebook
3. In python:
import os
os.environ[‘CC’] = ‘gcc-10’
os.environ[‘CXX’] = ‘g++-10’

Thanks for reading! If you like this project, please leave a star on my Github for motivation:)

--

--