Carryover and Shape Effects in Media Mix Modeling: Paper Review

In the following post, I’ll review and implement the main portions of “Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects”.

Chris Barton

Published in

Towards Data Science

9 min readSep 16, 2020

The full paper can be found here: link

Table of Contents

Media Mix Modeling
Lagged Effects
Shape Effects
Full Model
Simulation
Calculating ROAS / mROAS
Optimizing marketing budget

— — —

To begin, media mix models (MMM) aim to uncover the causal effect of paid media on a metric of interest, typically sales. Historically, the problem has largely been modeled via linear regression and the causal impact has been derived using Rubin’s potential outcomes framework.

In simple (data science) terms, this translates to

Training a regression model that predicts sales using media spend and control variables.
Deriving causal impact by comparing sales when media spend is at observed amount and when it is set to zero.

Estimating casual impact from observational data has a number of issues i.e. “correlation doesn’t equal causation” for starters. And media mix models have a host of unique issues to take note of. An excellent review of these issues can be found here: Challenges And Opportunities In Media Mix Modeling

This paper focuses on two specific issues:

Carryover Effects i.e. lagged effects
Shape Effects i.e. diminishing returns

While also providing a Bayesian model, ROAS calculations and optimization methods.

Carryover Effects

Carryover effects, often called lagged effects, occur when media spend effects sales across a number of days. For example, if we spend $100 on display advertising today, we may not see the effects of this spend for several days. The adstock function attempts to parameterize this phenomenon and the paper takes two approaches to adstock modeling:

Geometric

This is a weighted average going back L days, where L can vary by media channel.
The effect has the largest impact on the day of spend and decays thereafter.

Delayed Adstock

The effect of media spend spikes T (theta) days after media spend.

Implementation

def geoDecay(alpha, L):
    '''
    weighted average with geometric decay
    
    weight_T = alpha ^ T-1 
    
    returns: weights of length L to calculate weighted averages with. 
    '''
    return alpha**(np.ones(L).cumsum()-1)

def delayed_adstock(alpha, theta, L):
        '''
    weighted average with dealyed adstock function
    
    weight_T = 
    
    returns: weights of length L to calculate weighted averages with. 
    '''
    return alpha**((np.ones(L).cumsum()-1)-theta)**2

def carryover(x, alpha, L, theta = None, func='geo'):
    '''
    1. x is a vector of media spend going back L timeslots, so it should be len(x) == L
    
    2. Weights is a vector of length L showing how much previous time periods spend has on current period. 
    
    3. L is max length of Lag.
    
    returns transformed vector of spend
    
    # update with numpy average 
    # np.average(x[:2], weights=[1,.9])
    '''
    transformed_x = []
    if func=='geo':
        weights = geoDecay(alpha, L)
        
    elif func=='delayed':
        weights = delayed_adstock(alpha, theta, L)
    
    for t in range(x.shape[0]):
        upper_window = t+1
        lower_window = max(0,upper_window-L)
        current_window_x = x[:upper_window]
        t_in_window = len(current_window_x)
        if t < L:
            new_x = (current_window_x*np.flip(weights[:t_in_window], axis=0)).sum()
            transformed_x.append(new_x/weights[:t_in_window].sum())
        elif t >= L:
            current_window_x = x[upper_window-L:upper_window]
            ext_weights = np.flip(weights, axis=0) 
            new_x = (current_window_x*ext_weights).sum()
            transformed_x.append(new_x/ext_weights.sum())
            
    return np.array(transformed_x)

Above, we can see the shape of geometric and delayed adstock functions with the given parameters. As expected, the geometric functions starts with a strong impact and then steadily declines whereas the delayed adstock functions spikes 3 days (theta=3) after spend and then rapidly declines.

Shape Effects and Diminishing Returns

Next, the paper addresses the phenomenon of diminishing returns, referred to as the shape effects in the paper. This arises when a media channel starts to loose its effectiveness i.e. the difference between spending $0 and $50 is much larger than $200 and $250.

The Hill function is used to model this, which has its roots in biochemistry.

where;

S = Slope; greater than 0
K = Half Saturation Point; greater than 0
beta = Coefficient for channel; greater than 0
x = media spend

def beta_hill(x, S, K, beta):
    return beta - (K**S*beta)/(x**S+K**S)

The equation above is a modified version of the original. In the original Hill function, as X approaches infinity the function approaches 1, therefore we multiply by a beta coefficient to account for various strengths of marketing channels. If we excluded the beta variable each channel would have the same effectiveness.

The paper does point out that the function has poor identifiability. For example, when K (the half saturation) point is outside of the observed media spend.

We can see this in the graph below in the blue line. Notice that we don’t observe the half-saturation point i.e. the blue line hasn’t started to flatten due to diminishing returns, therefore it is impossible for our model to find the correct value. As a result, our model could incorrectly project high returns as media spend continues to increase.

The graph above illustrates various shapes the hill function can take on.

Combining: Carryover and Shape / Lag and Diminishing Returns

Next, we need a way to combine the carryover and shape effects. The paper suggest two possible approaches.

Apply the adstock function first and then the shape function.
Apply the shape function first and then the adstock function.

The paper recommends route 1 if there is small spend for any given time period. This makes sense as the shape effect is likely not to be activated by small spend. Alternatively, for large sustained spend it makes sense to first apply the shape (diminishing return) effect and then apply the adstock function.

Model

Now we can specify the functional form of our (the papers) media mix model:

where Xt has been transformed through the Adstock function and Z represents the control variables.

Simulation

Next we move onto simulation. Simulation is always important when estimating causal effects from observational studies. With simulation we can generate a synthetic dataset, use our model to estimate causal effects and compare our estimates to the ground truth. This allows us to test our model in a way that isn’t possible with observational data.

The simulated data set will contain:

Two Years of Sales Data
Three Media Channels
One Control Variable (Price)
Media spend is generated by adding white noise to a sinusoidal seasonality with one year as a period
Price is generated via an AR(1) series

# media channels 
N = 102 # number of data points
t = np.linspace(0, 4*np.pi, N)
data = 3+np.sin(t+0.001) + 0.5 + np.random.randn(N)
media_1 = ((data-min(data))/(max(data)-min(data)) ) 

t = np.linspace(0, 4*np.pi, N)
data = 3+np.sin(t+0.001) + 0.5 + np.random.randn(N)
media_2 = ((data-min(data))/(max(data)-min(data)) )  

t = np.linspace(0, 4*np.pi, N)
data = 3+np.sin(t+0.001) + 0.5 + np.random.randn(N)
media_3 = ((data-min(data))/(max(data)-min(data)) )  

# price 
from statsmodels.tsa import arima_process as arima

arparams = np.array([.7, .6])
maparams = np.array([.1, .02])
ar = np.r_[1, arparams] # add zero-lag and negate
ma = np.r_[1, maparams] price_variable = arima.arma_generate_sample(ar,ma,102)alpha_media_1 = .6
theta_media_1 = 5
k_media_1     = .2
s_media_1     = 1
beta_media_1  = .8

alpha_media_2 = .8
theta_media_2 = 3
k_media_2     = .2
s_media_2     = 2
beta_media_2  = .6
 
alpha_media_3 = .8
theta_media_3 = 4
k_media_3     = .2
s_media_3     = 2
beta_media_3  = .3

L=13
ru=4
lamb = -.5
ep = .05**2

m1 = [beta_hill(x, s_media_1, k_media_1, beta_media_1) for x in carryover(media_1, alpha_media_1, L, theta = theta_media_1, func='delayed')]m2 = [beta_hill(x, s_media_2, k_media_2, beta_media_2) for x in carryover(media_2, alpha_media_2, L, theta = theta_media_2, func='delayed')]m3 = [beta_hill(x, s_media_3, k_media_3, beta_media_3) for x in carryover(media_3, alpha_media_3, L, theta = theta_media_3, func='delayed')]
 
y = np.repeat(ru, N) + m1 + m2 + m3 + (lamb*price_variable) + np.random.normal(0, ep, N)

Fitting the Model.

Now that the dataset has been simulated it is time to fit the model. The paper uses STAN, however, I use Python/PyMC3.

import arviz as az
import pymc3 as pmwith pm.Model() as m:
    alpha = pm.Beta('alpha'          , 3 , 3  , shape=3)
    theta = pm.Uniform('theta'       , 0 , 12 , shape=3)
    k     = pm.Beta('k'              , 2 , 2  , shape=3)
    s     = pm.Gamma('s'             , 3 , 1 , shape=3)
    beta  = pm.HalfNormal('beta'     , 1      , shape=3)
    ru    = pm.HalfNormal('intercept', 5) 
    lamb  = pm.Normal('lamb'         , 0 , 1) 
    noise = pm.InverseGamma('noise'  , 0.05, 0.0005) 
    
    transpose_m1 = [beta_hill(x, s[0], k[0], beta[0]) for x in carryover(media_1, alpha[0], L, theta = theta[0], func='delayed')]
    transpose_m2 = [beta_hill(x, s[1], k[1], beta[1]) for x in carryover(media_2, alpha[1], L, theta = theta[1], func='delayed')]
    transpose_m3 = [beta_hill(x, s[2], k[2], beta[2]) for x in carryover(media_3, alpha[2], L, theta = theta[2], func='delayed')]
    
    
    y_hat = pm.Normal('y_hat', mu=ru + transpose_m1 + transpose_m2 + transpose_m3 + lamb * price_variable,
                  sigma=noise, 
                  observed=y)trace = pm.fit(method='svgd')

Results

MAPE: 0.12
MAE: 0.026

As expected the in-sample fit (MAPE/MAE) is pretty good 1) because it’s in-sample and 2) because we generated the data with the same functional form we are modeling. So, it boils down to how well the MCMC approximation works.

True Parameters versus Approximated Parameters.

Below, we can see the posterior distribution of model parameters versus the true model parameter (blue line). This allows us to understand how well our model performed at recovering the true model parameters.

ROAS / mROAS Calculation

Now we calculate ROAS and mROAS with the following equations.

In English, the equations above translate to what predicted sales will be with all media channels turned on minus predicted sales with all but the M-th media channel, divided by spend. The only catch is that we have to factor in the post-period effect (i.e. carryover effects after synthetic experiment ends). So, instead of just calculating for the change period we have to include the post period as well.

# simulate ROAS media_1_roas = []
media_2_roas = []
media_3_roas = []burnin=500
for i in range(1000):
    burnin=5
    s = np.random.randint(1,1000-burnin)
    intercept = t['intercept'][s]
    
    s_sample1, s_sample2, s_sample3 = t['s'][:,0][burnin:][s],     t['s'][:,1][burnin:][s],  t['s'][:,2][burnin:][s]
    k_sample1, k_sample2, k_sample3 = t['k'][:,0][burnin:][s],     t['k'][:,1][burnin:][s],  t['k'][:,2][burnin:][s]
    b_sample1, b_sample2, b_sample3 = t['beta'][:,0][burnin:][s],  t['beta'][:,1][burnin:][s],  t['beta'][:,2][burnin:][s]
   
    a_sample1, a_sample2, a_sample3 = t['alpha'][:,0][burnin:][s], t['alpha'][:,1][burnin:][s],  t['alpha'][:,2][burnin:][s]
    t_sample1, t_sample2, t_sample3 = t['theta'][:,0][burnin:][s], t['theta'][:,1][burnin:][s],  t['theta'][:,2][burnin:][s]
   
    fitted_m1 = [beta_hill(x, s_sample1, k_sample1, b_sample1) for x in carryover(media_1, a_sample1, L, theta = t_sample1, func='delayed')]
    fitted_m2 = [beta_hill(x, s_sample2, k_sample2, b_sample2) for x in carryover(media_2, a_sample2, L, theta = t_sample2, func='delayed')]
    fitted_m3 = [beta_hill(x, s_sample3, k_sample3, b_sample3) for x in carryover(media_3, a_sample3, L, theta = t_sample3, func='delayed')]
 
    y_hat    = intercept + fitted_m1 + fitted_m2 + fitted_m3 + t['lamb'][burnin:][s] * price_variable
     
    y_hat_m1 = intercept + fitted_m2 + fitted_m3 + t['lamb'][burnin:][s] * price_variable
    y_hat_m2 = intercept + fitted_m1 + fitted_m3 + t['lamb'][burnin:][s] * price_variable
    y_hat_m3 = intercept + fitted_m1 + fitted_m2 + t['lamb'][burnin:][s] * price_variable    media_1_roas.append(sum(y_hat[L:]-y_hat_m1[L:]) / media_1[L:len(media_1)-L].sum())
    media_2_roas.append(sum(y_hat[L:]-y_hat_m2[L:]) / media_2[L:len(media_1)-L].sum())
    media_3_roas.append(sum(y_hat[L:]-y_hat_m3[L:]) / media_3[L:len(media_1)-L].sum())

Optimizing the Marketing Budget

Lastly, we move onto optimizing a budget. To do this we take our modeled ROAS numbers and translate this into a constrained optimization problem.

To implement this we can plug in our ROAS numbers into a SciPy optimization function to find at what spend our sales are maximized.

Summary

The paper explores many facets related to media mix modeling with a focus on carryover and shape effects, but also provides specifications for a bayesian model and explores it effectiveness at recovering true model parameters on a synthetic dataset and the effects of priors and sample size (not covered).

As a next step, I’d find it interesting to see how one can make decisions with this model to not only maximize sales, but to also maximize the “knowledge” of the model. We can think of this as the “exploration” phase in reinforcement learning.