The world’s leading publication for data science, AI, and ML professionals.

Correlated Variables in Monte Carlo Simulations

Can sales for vanilla ice cream overtake chocolate?

Getting Started

Image by Nicky • 👉  PLEASE STAY SAFE 👈  from Pixabay
Image by Nicky • 👉 PLEASE STAY SAFE 👈 from Pixabay

Table of contents:

  • Introduction
  • Problem Statement
  • Data preparation
  • Wrong method 1 – Independent simulation (parametric)
  • Wrong method 2 – Independent simulation (non-parametric)
  • Method 1 – Multivariate distribution
  • Method 2— Copulas with marginal distributions
  • Method 3— Simulating historical combinations of sales growth
  • Method 4— Decorrelating store sales growth using PCA

Introduction

Monte Carlo simulation is a great forecasting tool for sales, asset returns, project ROI, and more.

In a previous article, I provide a practical introduction of how monte Carlo simulations can be used in a business setting to predict a range of possible business outcomes and their associated probabilities.

In this article, we will tackle the challenge of correlated variables in Monte Carlo simulations. We will look into 4 appropriate approaches for handling the correlation.


Problem statement

In our sample dataset, 20 years ago, annual chocolate ice cream sales were $5 million, vanilla sales were $4 million. Demand for chocolate ice cream was 25% greater than that of vanilla.

Demand for both chocolate and vanilla has grown over the previous 20 years, but the gap between chocolate and vanilla sales has narrowed. In the last year, chocolate ice cream sales were $12.6 million, vanilla sales were $11.9 million. Demand for chocolate is now only 6% greater than vanilla.

NOTE – Chocolate and vanilla ice cream sales are positively correlated to each other, i.e. chocolate ice cream sales and vanilla ice cream sales tend to increase together and decrease together.

Question: Can next year be the year where vanilla ice cream sales exceed chocolate ice cream sales?

Illustration of correlation between chocolate and vanilla ice cream sales (image by author)
Illustration of correlation between chocolate and vanilla ice cream sales (image by author)

Solution: We can run a Monte Carlo simulation to answer our question.

It would be WRONG to simulate chocolate and vanilla ice cream sales independently.


Data preparation

We first configure sample starting sales figures for chocolate and vanilla ice cream sales. The chosen figures are $5 million and $4 million respectively. We assume most people have a slight preference for chocolate ice cream.

START_CHOC = 5
START_VAN = 4

We then want to draw 20 random annual sales growth rates, assuming that growth rates for vanilla and chocolate ice cream sales follow a multivariate normal distribution. This is an extension of the normal distribution where we assume that the growth rates in sales for both ice cream flavors follow normal distributions, but there is a relationship between the two normal distributions.

We assume the mean growth rate for both flavors is 5%, with a standard deviation of 10%, and the correlation between the growth rates of both flavors is 0.9.

# Config for multivariate normal distribution
CORR = 0.9
MU = 1.05
SIGMA = 0.1

Below is the code to generate the growth rates and annual sales that are produced.

# Generate sales and growth rates for 20 periods
# using a multivariate normal distribution
cov_matrix = [[(SIGMA**2), CORR*(SIGMA**2)], 
              [CORR*(SIGMA**2), (SIGMA**2)]]
distribution = ss.multivariate_normal(
    mean=[MU, MU], cov=cov_matrix)
sample_data = distribution.rvs(20, random_state=5)
chocolate_growth = sample_data[:, 0]
vanilla_growth = sample_data[:, 1]
chocolate_sales = np.cumprod([START_CHOC] + list(chocolate_growth))
vanilla_sales = np.cumprod([START_VAN] + list(vanilla_growth))
# Prepare dataframes
growth_df = pd.DataFrame(data={
    "Growth Rate - Chocolate": chocolate_growth,
    "Growth Rate - Vanilla": vanilla_growth})
sales_df = pd.DataFrame(data={
    "Sales Chocolate (USD mln)": chocolate_sales,
    "Sales Vanilla (USD mln)": vanilla_sales})
df = pd.concat([growth_df, sales_df], axis=1)
df
Ice cream sales and growth rates (image by author)
Ice cream sales and growth rates (image by author)

Wrong method 1 – Independent simulation (parametric)

First, we will run a Monte Carlo simulation where we look into the statistical distributions of vanilla and chocolate sales, but treat both distributions as independent variables. This is a wrong approach since vanilla and chocolate ice cream sales are correlated to each other.

We know the parameters of the distribution since we have created the dataset assuming the growth rates in sales for both chocolate and vanilla follows a normal distribution with mean 1.05 (5% growth on average) and standard deviation 0.1 (10%). See previous section.

However, let’s assume we didn’t know the parameters in order to replicate the challenges we would face in a real-life situation. We would have to estimate the parameters using historical data.

# Estimate means
mean_choc = np.mean(df["Growth Rate - Chocolate"])
print(f"Mean (Chocolate): {mean_choc}")
mean_van = np.mean(df["Growth Rate - Vanilla"])
print(f"Mean (Vanilla): {mean_van}")
# Estimate standard deviations
std_choc = np.std(df["Growth Rate - Chocolate"],
                 ddof=1)
print(f"StDev (Chocolate): {std_choc}")
std_van = np.std(df["Growth Rate - Vanilla"],
                ddof=1)
print(f"StDev (Vanilla): {std_van}")
# Define normal distributions
distribution_choc = ss.norm(mean_choc, std_choc)
distribution_van = ss.norm(mean_van, std_van)
Means and standard deviations derived from 20 year sample (image by author)
Means and standard deviations derived from 20 year sample (image by author)

Next, we simulate 1000 sample growth rates for each of chocolate and vanilla ice cream sales.

growth_vanilla = distribution_choc.rvs(1000, 
    random_state=1)
growth_chocolate = distribution_van.rvs(1000, 
    random_state=2)

Now let’s prepare a function we can use to check if vanilla sales would exceed chocolate sales given those growth rates.

def exceed_check(growth_vanilla, growth_chocolate):
    '''This function takes sample growth rates for
    vanilla and chocolate ice cream sales and checks
    if vanilla ice cream sales would exceed chocolate
    given the combination of growth rates.

    Args:
    growth_vanilla (list): growth rates for vanilla sales
    growth_chocolate (list): growth rates for chocolate sales

    Returns:
    flags_list (list): A list of True/False flags indicating
        whether vanilla sales exceeds chocolate sales for the
        given combination of growth rates
    mean_pass_rate: Mean value of flags_list indicating the 
        probability vanilla sales exceeds chocolates given the 
        combination of growth rates
    '''
    # Last year's chocolate and vanilla ice cream sales
    # set as constants to improve performance
    FINAL_CHOCOLATE = 12.59
    FINAL_VANILLA = 11.94

    flags_list = [v*FINAL_VANILLA > c*FINAL_CHOCOLATE 
        for v,c in zip(growth_vanilla, growth_chocolate)]

    mean_pass_rate = np.mean(flags_list)

    return flags_list, mean_pass_rate

Finally, let’s check the probability of vanilla sales exceeding chocolate…

exceed_check(growth_vanilla, growth_chocolate)[1]

Probability: 34.1%


Wrong method 2 – Independent simulation (non-parametric)

In the second wrong approach, we simulate sales growth rates for chocolate and vanilla independently, but we do not assume any statistical distribution. We draw sample growth rates from history.

growth_vanilla = df["Growth Rate - Vanilla"].dropna(
    ).sample(1000, replace=True, random_state=1)
growth_chocolate = df["Growth Rate - Chocolate"].dropna(
    ).sample(1000, replace=True, random_state=2)

Now we check for the probability of vanilla sales exceeding chocolate…

exceed_check(growth_vanilla, growth_chocolate)[1]

Probability: 35.9%

So far, both wrong approaches where we treat chocolate and vanilla sales independently give us a probability between 34% and 36%… there is a good chance next year will be the year for vanilla ice cream fans after all…

Image by Clker-Free-Vector-Images from Pixabay
Image by Clker-Free-Vector-Images from Pixabay

Method 1- Multivariate distribution

The first appropriate approach we will use to check if vanilla sales may exceed chocolate sales is to estimate the multivariate distribution that represents data and to draw potential growth rates from that sample. A multivariate distribution models the combined distribution of multiple variable. Correlations are captured.

Remember we actually generated the data from a multivariate normal distribution with means of 1.05 for both chocolate and vanilla, standard deviations of 0.1, and a correlation of 0.9. Here though we will assume we dont know the parameters and we’ll use the historical data to estimate the distribution parameters as we did in the independent parametric simulation ("Wrong method 1" section).

We have already estimated the means and standard deviations in the aforementioned sections, but we still need to estimate the correlation between chocolate and vanilla ice cream sales.

corr = df.corr().iloc[0, 1]
corr
The estimated correlation is not far from the actual 0.9 value we used to generate the data (image by author)
The estimated correlation is not far from the actual 0.9 value we used to generate the data (image by author)

We can now draw 1000 sample growth rates from the multivariate normal distribution…

# We need to recalculate the covariance matrix
# using the estimated paramaters
cov_matrix = [[std_choc**2, corr*std_choc*std_van], 
              [corr*std_choc*std_van, std_van**2]]
growth_rates = ss.multivariate_normal(
    mean=[mean_choc, mean_van],
    cov=cov_matrix).rvs(1000, random_state=1)
growth_chocolate = growth_rates[:, 0]
growth_vanilla = growth_rates[:, 1]

Finally, we calculate the probability of vanilla sales exceeding chocolate…

exceed_check(growth_vanilla, growth_chocolate)[1]

Probability: 11.3%! Notice how the probability is now much lower after we have taken the correlation between chocolate and vanilla ice cream sales into consideration.

Looks like chocolate may remain the dominant flavor after all…

Image by Please Don't sell My Artwork AS IS from Pixabay
Image by Please Don’t sell My Artwork AS IS from Pixabay

Method 2 – Copulas with marginal distributions

In the previous section, we looked at a multivariate normal distribution. Both chocolate and vanilla ice cream sales followed a normal distribution each and they are correlated. This made it easy to represent them in a multivariate normal distribution.

Sometimes we may have 2 or more variables that come from different statistical distributions and there is no known multivariate distribution that can explain their combined distribution. If we do know the individual (marginal) distributions of the variables + we know the correlations between them, then copulas can help.

A detailed explanation of copulas is out of scope, but simply put…

Suppose the lowest possible value a variable can have is 0 and the highest is 1. Copulas generate a combination of values in this range such that the relationship implied by the correlation between the variables is maintained. Since our variables are chocolate and vanilla ice cream sales and they are positively correlated, copulas will draw values for growth in chocolate ice cream sales closer to 1 (highest value) when vanilla also has a high value and vice verse.

There are existing libraries that can be installed to utilize copulas, but below we’ll use a hack to construct a copula (a Gaussian one) from scratch and it might help to understand the process better.

First, we need to recalculate the correlation between our 2 variables, chocolate and vanilla sales growth, because copulas are based on rank correlation. In the previous section we calculated the Pearson Correlation Coefficient, but now we will use Kendall’s Tau which is a measure of rank correlation. Rank correlation is based on calculating the correlation between the ranks of the values rather than the values themselves.

corr = ss.kendalltau(
    df.dropna().iloc[:, 0], df.dropna().iloc[:, 1])[0]
corr
Rank correlation of 0.74 is lower than the previous 0.93 correlation we have calculated (image by author)
Rank correlation of 0.74 is lower than the previous 0.93 correlation we have calculated (image by author)

Now we can construct our copula…

# We set variances to 1 because the covariance matrix we 
# are constructing will be used with a multivariate normal 
# distribution of means 0 and std 1 to derive a copula
cov_matrix = [[1, corr], 
              [corr, 1]]
# We will draw 1000 combinations from the distribution
random_vals = ss.multivariate_normal(cov=cov).rvs(
    1000, random_state=1)
# Finally a cumulative density function for a distribution
# ranges between 0 to 1 so it will be used to convert the
# drawn samples to the values we need for our copula
copula = ss.norm.cdf(random_vals)
print(copula.shape)
copula
Image by author
Image by author

We can see the copula has 1000 correlated entries across 2 variables and they range from 0 to 1.

Below is a scatter plot of the correlated copula values for our 2 variables.

sns.scatterplot(x=copula[:, 0], y=copula[:, 1])
Image by author
Image by author

The copula values between 0 and 1 capture the correlation between the variables, so now we just have to convert the 0–1 values to the actual values we want to use for the sales growth rates of chocolate and vanilla ice cream.

We use percent point functions for the distributions of each variable to the conversion. We have already estimated the distributions in the independent parametric simulation section ("Wrong method 1"). The difference is that we now don’t just randomly draw values from each distribution, instead, we use the copula values and percent point functions to draw correlated random values from each distribution.

# distribution_choc and distribution_van area already
# calculated in previous sections
growth_chocolate = distribution_choc.ppf(copula[:, 0])
growth_vanilla = distribution_van.ppf(copula[:, 1])

Finally, let’s pass those growth rates to our function to check the likelihood of vanilla sales exceeding chocolate…

exceed_check(growth_vanilla, growth_chocolate)[1]

Probability: 11.1%

The probability of 11% is similar to the probability from the previous section and confirms that we have grossly overestimated the probability in the 2 wrong approaches we applied at first.

There are several steps in this approach and probably a more thorough explanation of different concepts is needed. The one thing to keep in mind though is that the values between 0 and 1 that we generated using our copula can be used with any combination of statistical distributions.

Our example is easy where both variables’ distributions are normal, so the approach in the previous section (multivariate normal distribution) is sufficient and actually should yield the same results as this section. But imagine a different scenario where variable A is normally distribution, variable B is binomial etc. In such cases, copulas would be needed.

A hypothetical scenario would be forecasting movie ticket sales and popcorn sales where we model the number of cinema visitors as a Poisson distribution and then we model the popcorn sales using a Binomial distribution where every cinema visitor who bought a ticket has a certain probability of buying popcorn.


Method 3 – Simulating historical combinations of sales growth

This is a simple approach similar to the independent non-parametric simulation (section "Wrong method 2"), where we don’t estimate any statistical distributions and we directly draw samples from history.

The correction we apply is that we don’t independently take historical growth rates for chocolate and vanilla sales from different points of history. Instead we randomly take a point in history and we take both growth rates from that same point in history. We then draw another point in history and do the same until we get to number of rounds of simulation we want to run.

We replay history and take growth rates for both flavors from Year 5 in the first round of simulations, for example, then Year 11 for the 2nd round etc.

In the independent approach in the section "Wrong method 2", we could take the growth rate for vanilla from Year 5, for example, and the growth rate for chocolate from Year 11 in the same round of simulation. We were replaying the history of chocolate and vanilla ice cream sales independently of one another.

resampled_history = df[["Growth Rate - Chocolate", 
    "Growth Rate - Vanilla"]].dropna(
    ).sample(1000, replace=True, random_state=1)
growth_chocolate = resampled_history["Growth Rate - Chocolate"]
growth_vanilla = resampled_history["Growth Rate - Vanilla"]
exceed_check(growth_vanilla, growth_chocolate)[1]

Probability: 21% This probability is higher than the expected ~11% derived from the last 2 appropriate methods but lower than the ~35% implied by the wrong simulation approaches.

This method relies on having sufficient history and is normally more suitable for financial data, for example, where we have many historical data points.

In our example, we only have 20 years of data representing 20 combinations of historical growth rates for chocolate and vanilla ice cream sales. This is a small number and we would be giving too much weight for a few points that happened in the past.


Method 4 – Decorrelating store sales growth using PCA

In the previous method, the challenge in our approach is that we were resampling from only 20 data points because we had 20 combinations of historical vanilla and chocolate ice cream sales growth.

In the independent non-parametric simulation ("Wrong method 2" section), the approach was wrong because we treated vanilla and chocolate sales growth rates as independent. This allowed us though to have 400 (20 x 20) historical data points to resample from since we could combine any of the 20 historical sales growth rates for chocolate with any of the 20 historical growth rates of vanilla. We were are able to combine the growth rate of chocolate in Year 1 with vanilla in Year 2… in method 3, we couldn’t.

In this final method, we follow a very similar approach where we do not estimate any statistical distributions and we replay historical growth rates.

The cornenrstone of this method is that we first decorrelate the growth rates of ice cream sales of both flavors. This approach will allow us to combine components of historical growth rates of chocolates from a certain year with those of vanilla from a different year since we eliminated the correlation that we have to control for. We end up with more data points to resample from.

Step 1 is to use a decorrelation method. We use PCA in this example which I assume you are familiar with. The Choletsky decomposition is also a popular alternative used in finance.

Note that PCA is commonly used in Data Science for dimensionality reduction but what we are really interested in is the fact that PCA components are decorrelated. In our case, we are not looking to reduce dimensions so we will pass in two variables (sales growth rates for chocolate and vanilla) and we will obtain two decorrelated components. Since these principal components are decorrelated, we can mix and match the components from different points in history and get 400 combinations of chocolate and vanilla sales growth rates that we can resample from.

pca = PCA(n_components=2)
# Normalize chocolate and vanilla growth rates to apply PCA
normalized_chocolate = (
    df["Growth Rate - Chocolate"].dropna() - mean_choc
) / std_choc
normalized_vanilla = (
    df["Growth Rate - Vanilla"].dropna() - mean_van
) / std_van
# Apply PCA transformation on normalized values
components = pca.fit_transform(
    np.array([normalized_chocolate, 
    normalized_vanilla]).T)
components
20 x 2 array representing 20 years of data and 2 PCA components (image by author)
20 x 2 array representing 20 years of data and 2 PCA components (image by author)

We next run 1000 simulations where we sample from the components array.

np.random.seed(1)
sampled_components = [[x[0], x[1]] for x in zip(
    # Sampling 1000 entries from first PCA component
    np.random.choice(components[:, 0], 1000),
    # 2nd PCA component
    np.random.choice(components[:, 1], 1000))]

Next step is to convert the sampled component values into growth rates by inversing the transformations (normalization and PCA) we have done.

inverse_pca = pca.inverse_transform(sampled_components)
# Denormalizing
growth_chocolate = [(x * std_choc) + mean_choc 
    for x in inverse_pca[:, 0]]
growth_vanilla = [(x * std_van) + mean_van 
    for x in inverse_pca[:, 1]]

Finally…

exceed_check(growth_vanilla, growth_chocolate)[1]

Probability: 17.1% which is closer to the 11% we are expecting based on methods 1 and 2.

Conclusion

Below are the probabilities we have generated for vanilla ice cream sales overtaking chocolate. We used 2 wrong approaches and 4 appropriate ones.

Image by author
Image by author

The 2 wrong methods, which run the Monte Carlo simulations assuming vanilla and chocolate ice cream sales are independent, grossly overestimate the probability vanilla sales could overtake chocolate next year.

Since the sales for both flavors are assumed to be independent, there is an unrealistically high probability assumed where vanilla sales will see a much higher than normal growth rate while chocolate sees a much lower than usual growth rate. Reality indicates that the sales for both flavors are positively correlated to each other, so in most cases, growth is high for both flavors or low for both flavors with a slight variation in performance.

Methods 1 and 2, which imply a 11% probability vanilla overtakes chocolate, are the most accurate. That is because the actual data was derived from a multivariate normal distribution. However, the choice of best method to use depends on the dataset at hand.

Methods 1 and 2 require that the underlying statistical distribution can be estimated fairly accurately. Methods 3 and 4 rely on having a sufficiently large dataset.

The most important takeaway is to not ignore correlations in Monte Carlo simulations.


Related Articles