
What is a Monte Carlo Simulation?
Monte Carlo Simulation (to be referred onwards as MCS) – also known as the multiple probability simulation – is a method to estimate the probability of the outcomes of an uncertain event.
This method became famous after the mathematician Stanislaw Ulam considered it during the project to construct the atomic bomb. Besides the nasty purpose that made the MCS famous, it is actually very useful to get an approximate result for a number of problems just by simulating the problem over and over again. And with the increasing power of the computer processing speed, the MCS becomes another good resource to have in your toolbox.
Monte Carlo Simulation can be understood as the process of repeating the same experiment for n times, using randomly generated numbers that follow the same distribution of our data to simulate a variable of the problem.
Monte Carlo Simulation and the LLN
A concept that supports the MCS is the Law of Large Numbers. Such famous law states that, as we increase the number of repetitions of a given experiment, the mean value of the results tend to get closer to the real value of the probability of those events to happen.

We all know that the chance to get a Heads or Tails when flipping a coin is 50% for each side. But if you flip a coin twice in a row, you can get Heads-Heads (H-H), Tails-Tails (T-T) or H-T or T-H. Well, half of the chance is that you will get only Heads (100% | 0%) or Only Tails (0% | 100%), what is far from the real probability. Now if you flip the same coin hundreds of times, there you go. You will certainly see something around the 50/50 distribution.
That is why MCS runs many times the same experiment.
The Fishes Problem
Imagine a lake. We want to know how many fish are in there. A classic MCS problem solves that problem saying that we could initially fish 100 fishes and marking them. Now we know there are 100 fishes marked, but we don’t know what’s the proportion that it represents out of the total.
Therefore, we repeat that and fish 100 fishes a few times more and check what is the percentage of those that are marked and not marked. If we know that the approximate percentage of marked fishes, all we need to do to know the total is to calculate the whole (e.g. the MCS gives us 20% of fishes marked, that means that the number of marked fishes (100) in that lake represents 20% of the total. Thus, the total would be 500.)
Let’s see that in code.
# Creating the Lake with 100 marked fish and 400 not marked
fishes <- data.frame(num = seq(1:500),
marked = rep(c(1,1,0,0,0,0,0,0,0,0), times=50)
)
# Confirming 100 marked fishes
marked_fish_total <- sum(fishes[fishes$marked==1,2])
print( paste("Marked Fish in the lake:", marked_fish_total) )
### Monte Carlo Simulation ###
monte_carlo <- function (reps, dataset){
"Run the same experiment for n reps and return the average proportion
* reps: [int] How many repetitions of the same experiment
* dataset: [data.frame] it is the dataframe with the fish data
"
# Creating a vector to collect the marked fishes proportion
marked_prop <- c()
#Let's simulate a fishing of 100 fishes, repeated 1,000 times
for (experiment in seq(1,reps)) {
# Shuffle the dataset
shuffled <- sample(nrow(dataset))
dataset <- dataset[shuffled, ]
# Create a random index before catching the fish
index <- sample(1:500, size=100)
fishing <- dataset[index,]
# Calculate the proportion
p <- mean(fishing$marked)
# Store result in the vector
marked_prop <- c(marked_prop, p)
} # close for Loop
# Calculate and return the mean proportion after n repetitions
return (mean(marked_prop))
} #close the function
# Running the Monte Carlo Simulation
marked_fish_proportion <- monte_carlo(1000, fishes)
"If we know the percentage of marked fish and we know that the total
of marked fish in that lake is 100, then we can calculate the 100%
using this formula:
Marked Fish Total / Marked Fish Proportion "
# Total Fish in the lake
marked_fish_total / marked_fish_proportion
After running the entire code above for 500 times, here is the histogram with the results.

Well, this is an example problem shared in the Portuguese version of the MCS definition in Wikipedia, but I might say it resembles more a Bootstrap problem than Monte Carlo, although it is not.
How is MCS different than Bootstrapping?
Both techniques are closely related, given that they are based on repetitions to get to an estimated number, relying on the Large Numbers Law and Central Limit Theorem.
However, the big difference between them is that the Bootstrap technique will sample over the data you already have and it will get the estimated statistic out of that, like a fancy average calculation.
Now the MCS technique will create new random numbers using the same distribution as your real data, so it simulates a new dataset and helps you to answer this what if question: "what if new data keep coming in those same conditions, what would happen to that statistic/ number I am trying to predict?"
Understanding the MCS Logic using R
To apply the Monte Carlo method, it is important to keep in mind the following main steps:
1. Define your inputs. What are you trying to predict/ estimate and what are your explanatory variables.
Now we were hired by a restaurant that wants to know the average number of customers that will come to dinner in a month. The owner gives us 90 days of attendance: the average number of people per night is 193. 250 or more people happened a couple of times. And during the calmest nights, they saw around 150 people coming to dinner.
# Creating a Normal Distribution
n <- rnorm(90, mean=200, sd=40)
n <- as.integer(n)
[day1] 171 163 170 153 247 251 213 163 233 277 197 175 165 200 155 238 260 183 253 199 299 236 207 147 219 238 234
256 177 160 143 188 241 230 234 143 152 201 241 180 153 144 163 196 103 177 209 254 240 203 191 105 178 219
263 241 183 176 161 167 291 179 273 186 239 192 199 192 121 270 182 196 162 187 177 168 201 143 185 228 217
149 238 177 201 220 182 190 193 186 [day90]

2. Define the distributions of the explanatory variables (e.g. normal, continuous uniform, exponential).
In this case, I have created a normal distribution, but we also can consider average numbers as normal, since it relies on the Central Limit Theorem, where the averages of the samples will converge to an approximately normal curve.
3. Our next step is to create a distribution for our simulation. Therefore, we must calculate the Cumulative Distribution Function, or CDF, which will tell us what is the probability of a number to be in that given interval.
To make the statement above more clear, let’s plot the CDF.
# Plot the CDF
plot( x= n,
xlab= 'Customers',
y= pnorm(n, mean=193, sd=40),
ylab= 'Probability',
col='red',
main='CDF for Customers')

Notice that, for each number of customers in X, there’s a probability associated in Y. So, if you want to know what is the number of customers with 50% chance to happen, you will see the mean 193.
4. With that distribution in hands, now you can generate N random numbers between 0 and 1. Those numbers will represent percentages that you can associate with the distribution from step 3, thus generating a simulated attendance to the restaurant.
# Simulated attendance
mcs_customers <- function(simulations){
"This function takes an integer number and generates that amount of repetitions of the simulation"
# Create a list to store the results
mcs_results <- c()
# Loop
for (n in 1:simulations){
# Generate a random number
r <- runif(1)
# Use our CDF to capture the simulated quantity of customers
simulated <- qnorm(r, mean=customer_avg, sd= customer_std)
# Take the lowest integer rounded
simulated <- floor(simulated)
#Store result
mcs_results <- c(mcs_results, simulated)
} #end loop
# Plot histogram
hist(mcs_results, col='royalblue')
# Return vector
return(mcs_results)
}# end function
And applying the above function, that’s the result.
mcs <- mcs_customers(500)

Comparing both results, look how similar they are. We could really say that, maintaining the other variables constant, the average number of customers per evening will be floating around 200 people. With that number in hands, you can plan for staffing, number of tables, estimate revenue and make other strategic analysis.

Applying MC Simulation in Python
The same exercise can be performed with Python. I, personally, think that R is more suited to create quick MCS, as it has the needed statistical tools already built in – like those used to calculate random numbers, CDF and the inverse value of the distribution.
In Python, scipy
is the best library to do those things along with numpy
.
#Imports
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as scs
# 90 Days of Clients
customers = [171, 163, 170, 153, 247, 251, 213, 163, 233, 277, 197, 175, 165, 200, 155, 238, 260, 183, 253, 199, 299, 236, 207, 147 ,219, 238, 234, 256, 177, 160, 143, 188, 241, 230, 234, 143, 152,201, 241, 180, 153, 144, 163, 196, 103, 177, 209, 254, 240, 203, 191, 105, 178, 219,263, 241, 183, 176, 161, 167, 291, 179, 273, 186, 239, 192, 199 ,192, 121, 270, 182, 196, 162,187, 177,168, 201, 143, 185, 228, 217, 149, 238, 177, 201, 220, 182, 190, 193, 186]
# Histogram of customers
sns.histplot(customers)
plt.title('Histogram of Customers');
# Calculating Mean and Std of customers
cust_avg = np.mean(customers)
cust_std = np.std(customers)
# Calculating CDF from the distribution
cdf = scs.norm.cdf(customers, loc=cust_avg, scale=cust_std)
#Plot CDF
sns.lineplot(x=customers, y=cdf)
plt.annotate(f'mean: {round(cust_avg,1)}', xy=(cust_avg+10, 0.5),
bbox=dict(boxstyle="larrow, pad=0.4", fc="lightgray", ec="r", lw=0));

# Function to perform simulations
def mcs(simulations):
"Provide integer with number of simulations and the function returns an array and histogram with the results"
# List to store results
results = []
# Loop
for n in range(simulations):
# Generate random number as a probability
r = np.random.uniform()
# Calculate the number
sim_cust = scs.norm.ppf(r, loc=cust_avg, scale=cust_std)
# Math floor of the number
sim_cust = int(sim_cust)
#Store result
results.append(sim_cust)
# Plot Histogram
sns.histplot(results)
plt.title('Histogram of Results MCS')
#Return array
return results
# Apply mcs function
r = mcs(500)

We can see that the numbers follow closely the reality. Like the coins example, the more we simulate it, the more it will resemble the real probability.
Let’s take a look at simulated customers by weekday, considering the day 1 was a Monday.

It looks like we need more staffing on Tuesdays and weekends (Fri-Sun.

That is very powerful.
Before You Go
Before you go, I’d like to leave the logic printed, so you can practice and evolve that for more complex problems.
I have been studying Monte Carlo Simulations and I know that much more can be done.
- Determine your problem. Know what you want to simulate.
- Find the distribution of your original data (normal, uniform).
- Calculate the Cumulative Distribution Frequency (CDF)
- Generate a random number between 0 and 1 to be your random probability.
- This number will be translated into a simulated number using your CDF.
- Repeat as many times as needed to calculate what you want.
If you like this content, you can subscribe to Medium using this referral code.
You can also follow my blog for more content.
References
Definition of Monte Carlo Simulation: https://tinyurl.com/3nkxpe4p
Monte Carlo method – Wikipedia
What is Monte Carlo Simulation?
The 4 Simple Steps for Creating a Monte Carlo Simulation with Engage or Workspace