Simulations: Solve complex problems using randomness: Part 1

Estimate the probability of favorable outcome/s by designing experiments and repeating them several times in the virtual world.

Pratha Pawar
Towards Data Science

--

We all have dealt with probabilities in our life. Be it in a game which requires rolling a dice or dealing with cards, or estimating our chances of winning on sports bets, or be it in calculating the effectiveness of a vaccine. Probability shows up everywhere where you want to estimate a “chance” of a favorable (or not) outcome/s out of all the possible outcomes.

Note that this article is not introductory and some basic prior knowledge on probability, including Bayes’ theorem, is needed.

Photo by Naser Tamimi on Unsplash

A Quick Introduction

Simply defined, probability is the chance that a given event will occur. Or in technical terms, per merriam-webster, it is the ratio of the number of outcomes in an exhaustive set of equally likely outcomes that produce a given event to the total number of possible outcomes.

i.e. P(of an event) = (Number of favorable outcomes) / (Total possible outcomes)

  • For a quick and very familiar example, let’s take a fair coin. The two possible outcomes are Heads (H) and Tails (T), each with an equal chance of occurence.

Thus, P(H) = 1/2 and P(T) = 1/2, because in each scenario, the total number of outcomes is 2 → {H, T}, and the favorable outcomes for Heads is 1 {H} and that for Tails is 1 {T}. Thus the probability for each case is 1/2. And note that the sum of probability of all possible outcomes combined should be 1. That is, P(Heads or Tails) = P(Heads) + P(Tails) - P(Heads and Tails) = 1/2 + 1/2 - 0 = 1. Note that here, an event where the coin lands on both H and T at the same time is not possible. These are called mutually exclusive events (they cannot occur at the same time).

A quick recap of the Bayes’ theorem

Given a hypothesis H and evidence E, Bayes’ theorem establishes the relationship between the probability of hypothesis P(H) before getting the evidence E to the probability of hypothesis after getting the evidence P(H | E). It’s a very important concept to understand, and if you are not familiar with conditional probability and Bayes’ theorem, please feel free to refer to these links [1][2][3].

Bayes’ theorem
A note to make sense of the Bayes’ formula above

And if you want to do a more detailed recap of probability, please refer to this wiki link.

Simulating Probabilities: Why?

Now, most of the probability problems can be solved on paper. But sometimes, the problems can get complex and time consuming. And even after you solve it successfully, there may not be a reference to check against if your calculated answer is correct or not, especially if you are someone who uses probability in real life practical scenarios — because the problems will most likely be unique. So, simulating a problem or experiment to calculate the probability of an interested event is useful to:

  • Gain deeper insights into the problem by actually performing the experiment/s virtually.
  • Validate your calculated answer.
  • Learning how to simulate simple to complex real life experiments and scenarios using code; e.g. modeling economic stability of a region, or modeling COVID spread in a country, etc.

Simulating Probabilities: Design steps

  • Design the experiment. This comes from the problem statement or a scenario you are trying to figure out the probability for.
  • Repeat the experiment N number of times. The larger the N, the better are your estimated probability numbers. (law of large numbers)
  • Calculate the ratio of favorable outcomes to the total possible outcomes to calculate the probability.

Note: We will be using Python’s in-built random module to generate “random” samples. In simple words, if you ask the random module to pick from a Heads and Tails, each having an equal probability of 0.5, it will pick one randomly. But if you repeat it large number of times, it will, on average, ~50% of the times pick a Heads and another ~50% of the times pick a Tails. (We could also use numpy’s random module which is faster for arrays).

A side note on the random module: the random module generates pseudo-random numbers, that is, the numbers may appear random, but they aren’t actually totally random in true scientific sense [source]. Having said that, for most of our daily use cases, using this module should suffice. But it is not advised to use this module for any cryptography, or similar risk and privacy related applications.

Please refer to these [4] and [5] links for more details.

Let’s see this process in action with an example.

  • Let’s say you are trying to answer the question: What is the probability of getting 2 heads (H) when you flip two fair coins or flip a fair coin twice?
  • Since getting a coin toss is independent of another coin toss or its own previous tosses, P(2H) = P(H) * P(H) = 0.5 * 0.5 = 0.25 = 1/4. You can also intuitively put it this way → the four possible outcomes of two tosses are {HH, HT, TH, TT}. Out of these 4, only 1 matches our expectations of two heads {HH}. So, the answer is 1/4.
  • Let’s simulate this experiment and see what we find..
Simple Coin Flip example (Observing X Heads in N coin flips)

The function coin_flip is our single modular experiment which mimicks the flipping of n_flips number of fair coins OR flipping one fair coin n_flips number of times. We have used random.choices to simulate the flips.The population parameters is the list of outcomes, weights is the list of probability for each of those outcomes (if not passed, an equal probability for each outcome will be assumed) and k is the number of samples to be drawn with replacement. Note that we choose “with replacement” because getting Heads or Tails the 2nd time is not dependent on the outcome of the first flip, they are independent events. You could have also used random.choice for a single flip and put this in a for loop for n_flips number of times, and then caclulate the sum of those outcomes, but that would have been slower and the results wouldn’t have been that different.

The designed outcome of our ‘experiment’ function is sum of n_flips, where, for a single flip, H and T are assigned values of 1 and 0 respectively. So, for a two coin flip experiment, we are looking for outcomes where sum = 2 (i.e. 2 Heads). (Note that if we were interested in scenarios where we obtain 1H and 1T, our sum filter would have been sum = 1. And if we were interested in 2T, we could have used a filter of sum = 0. )

We then repeat the experiment N number of times (the higher the better), and then summarize the results into a pandas DataFrame.

The probability is calculated as the ratio of favorable outcomes (with a sum of 2) to the total possible outcomes. As you can see, this comes out to be ~0.25 which is closer to our calculated known probability of 1/4.

Are you ready? We will use the same approach to calculate probabilities and/or expected values for some simple to complex problems (in this post as well as in Part 2). What better way to start than the (in)famous Monty Hall problem, which had baffled some of the brightest statistical minds in the late 20th century.

Monty Hall

The orignal problem

Source: https://en.wikipedia.org/wiki/Monty_Hall_problem

You are in a game show. The host shows you 3 doors and tells you that there is a car behind one door and goats behind the other two. **The host knows exactly what is behind which door.

  • He asks you to pick one door. Say you pick door 1.
  • He then opens door 3, with a goat behind it. Note that the host is guaranteed to open a door with a goat behind it, because he knows.
  • He then asks you whether you want to go ahead with your previous choice of door 1 or switch to door 2?

What should you do to maximize your chances of winning?

Let’s solve this problem on paper first. Initially, when you pick a door, there is a 1/3rd chance of you picking a door with a car behind it. So, door 1 has 1/3rd chance of being the correct choice, and doors 2 and 3 combined have a 2/3rd chance of having a car behind any one of them. When the host opens door 3 to show a goat behind it, the P(car behind door 3) becomes 0, and P(car behind door 2) becomes 2/3rd. The P(car behind door 1) remains the same (1/3rd) because you gain no new information about door 1. So, to answer the question, you will win with a 2/3rd probability if you switch to door 2.

Yet another way to visualize the Monty Hall problem. **Note: Whatever scenarios you build up, make sure that each door gets the car for equal number of times. For example, if you play out 30 scenarios, 10 of them should be with the car behind door1, 10 with car behind door 2, and 10 with car behind door 3. Because the initialy probability of the car being behind a door will always start with 1/3rd.

You can also use Bayes theorem to solve this as follows.

Image and equations provided by author

Let’s simulate this scenario and see if the outcome matches our calculated value.

Monty Hall problem simulation (where host knows)
Image provided by author

You can see that as n_rounds increases, the simulated probability approaches the theoretical probabilities. That is, out of the 10,000 games played, you won almost 2/3rd of those games by switching.

You can also increase the number of doors, and play the same game (where the host reveals a goat behind one of the other remaining doors). The probability for switching remains higher, but the difference between switching and not switching decreases and approaches 0 as the number of doors increases. See below. This is because the advantage for the other door(s) gets lower and lower as n_doors increases.(This will change if the host opens more than 1 door….feel free to play with this scenario and see what you get.)

Image provided by author

Variation where the host doesn’t know

The exact same game as above, but this time the host doesn’t know what is behind which door and he randomly chooses a door (excluding the one that you choose) to open. So he may accidentally open a door with a car behind it, in which case, the game either ends or is restarted or you automatically win.

  • The host asks you to pick one door. Say you picked door 1.
  • He then opens door 3, and finds a goat behind it.
  • He then asks you whether you want to go ahead with your previous choice of door 1 or switch to door 2?

What should you do to maximize your chances of winning?

Now, this may look exactly the same as the previous one, but in this case the main difference to focus on is that the host doesn’t know. This lack of knowledge on the host’s part changes everything.

Image and equations provided by author
Yet another way to visualize this variation. **Note: Whatever scenarios you build up, make sure that each door gets the car for equal number of times. For example, if you play out 30 scenarios, 10 of them should be with the car behind door1, 10 with car behind door 2, and 10 with car behind door 3. Because the initialy probability of the car being behind a door will always start with 1/3rd.

Note that, in the above calculations, and in below simulation, we consider only those rounds where “a goat has been revealed”.

Let’s simulate this scenario and see if the outcome matches our calculated value.

Monty Hall problem simulation (where host doesn’t know)
Image provided by author

As you can see, the simulation results match the calculations.

Hooray!! By now you must have got a gist of how to design a simulation for a probability problem.

We will go over some more examples in Part 2 of this post. I will also be sharing a link to the entire Jupyter notebook after Part 2.

Thanks for reading. Please feel free to leave comments, questions or suggestions. Also, any new ideas for the next post are welcome.

PS: This post was motivated by a discussion that I recently had with my data science colleague on a probability distribution problem that she was trying to solve.

--

--