The world’s leading publication for data science, AI, and ML professionals.

Mastering Monte Carlo: How To Simulate Your Way to Better Machine Learning Models

How a Scientist Playing Solitaire Forever Changed the Game of Statistics

Application of Probabilistic Approaches by Enhancing Predictive Algorithms through Simulation Techniques

"At the Roulette Table in Monte Carlo" by Edvard Munch (1892)
"At the Roulette Table in Monte Carlo" by Edvard Munch (1892)

In the tumultuous year of 1945, as the world was gripped by what would be the final throes of World War II, a game of solitaire quietly sparked an advancement in the realm of computation. This was no ordinary game, mind you, but one that would lead to the birth of the Monte Carlo method(1). The player? None other than scientist Stanislaw Ulam, who was also deeply engrossed in the Manhattan Project(2). Ulam, while convalescing from an illness, found himself engrossed in solitaire. The complex probabilities of the game intrigued him, and he realized that simulating the game repeatedly could provide a good approximation of these probabilities(3). It was a lightbulb moment, akin to Newton’s apple, but with playing cards instead of fruit. Ulam then discussed these ideas with his colleague John von Neumann, and together they formalized the Monte Carlo method, named after the famed Monte Carlo Casino in Monaco, (portrayed in Edvard Munch’s famous painting shown above), where the stakes are high and chance rules – much like the method itself.

Fast forward to the present day, and the Monte Carlo method has become an ace up the sleeve in the world of machine learning, including applications in reinforcement learning, Bayesian filtering, and the optimization of intricate models(4). Its robustness and versatility have ensured its continued relevance, more than seven decades after its inception. From Ulam’s solitaire games to the sophisticated AI applications of today, the Monte Carlo method remains a testament to the power of simulation and approximation in tackling complex systems.

Playing Your Cards Right With Probabilistic Simulations

In the intricate world of Data Science and machine learning, Monte Carlo simulations are akin to a well-calculated wager. This statistical technique allows us to place strategic bets in the face of uncertainty, making probabilistic sense of complex, deterministic problems. In this article, we’ll demystify Monte Carlo simulations and explore their powerful applications in statistics and machine learning.

We’ll start by taking a deep dive into the theory behind Monte Carlo simulations, illuminating the principles that make this technique a powerful tool for problem-solving. We’ll work through some hands-on applications in Python, demonstrating how Monte Carlo simulations can be implemented in practice.

Next, we’ll explore how Monte Carlo simulations can be used to optimize Machine Learning models. We’ll focus on the often challenging task of hyperparameter tuning, providing a practical toolkit for navigating this complex landscape.

So, place your bets, and let’s get started!

Understanding Monte Carlo Simulations

Monte Carlo simulations are an invaluable technique for mathematicians and data scientists. These simulations provide a methodology for navigating through an extensive and complex array of possibilities, formulating educated assumptions, and progressively refining choices until the most suitable solution emerges.

The way it works is just this: we generate a vast number of random scenarios, following a certain predefined process, then scrutinize these scenarios to estimate the probability of various outcomes. Here’s an analogy to make this clearer: consider each scenario as a turn in the popular Hasbro board game "Clue". For those unfamiliar, "Clue" is a detective-style game where players move around a mansion, gathering evidence to deduce the details of a crime – the who, what, and where. Each turn, or question asked, eliminates potential answers and brings the players closer to revealing the true crime scenario. Similarly, each simulation in a Monte Carlo study provides insights that bring us closer to the solution of our complex problem.

In the realm of machine learning, these "scenarios" can represent different model configurations, varied sets of hyperparameters, alternative ways of splitting a dataset into training and test sets, and many other applications. By assessing the outcomes of these scenarios, we can glean valuable insights into the behavior of our machine learning algorithm, enabling informed decisions about its optimization.

A Game of Darts

To understand Monte Carlo simulations, imagine you’re playing a game of darts. But instead of aiming for a specific target, you’re blindfolded and throwing darts randomly at a large square dartboard. Inside this square, there’s a circular target. Your goal is to estimate the value of pi, the ratio of the circle’s circumference to its diameter.

Sounds impossible, right? But here’s the trick: the ratio of the area of the circle to the area of the square is pi/4. So, if you throw a large number of darts, the ratio of darts landing inside the circle to the total number of darts should be approximately pi/4. Multiply this ratio by 4, and you have an estimate of pi!

Random Guessing vs. Monte Carlo

To illustrate the power of Monte Carlo simulations, let’s compare it with a simpler method, perhaps the simplest of all: random guessing.

When you run the code below for both cases (random and Monte Carlo), you’ll get a different set of predictions each time. This is to be expected, because the darts are thrown randomly. This is a key feature of Monte Carlo simulations: they are inherently stochastic, or random. But despite this randomness, they can provide very accurate estimates when used properly. Thus, while your figures will not look exactly like mine, they’ll tell the same story.

In the first set of visualizations (Figure 1a to Figure 1f), we’re making a series of random guesses for the value of pi, each time generating a circle based on the guessed value. Let’s give this "randomness" a push in the right direction, and assume that while we cannot remember the exact value of pi, we know it is higher than 2 and lower than 4. As you can see from the resulting figures, the size of the circle varies widely depending on the guess, demonstrating the inaccuracy of this approach (which shouldn’t come as a surprise). The green circle in each figure represents the unit circle, the "real" circle we’re trying to estimate. The blue circle is based on our random guess.

#Random Guessing of Pi

# Before running this code, make sure you have the necessary packages installed.
# You can install them using "pip install" on your terminal or "conda install" if using a conda environment

# Import necessary libraries
import random
import plotly.graph_objects as go
import numpy as np

# Number of guesses to make. Adjust this for more guesses and subsequent plots
num_guesses = 6

# Generate the coordinates for the unit circle
# We use np.linspace to generate evenly spaced numbers over the range from 0 to 2*pi.
# These represent the angles in the unit circle.
theta = np.linspace(0, 2*np.pi, 100)

# The x and y coordinates of the unit circle are then the cosine and sine of these angles, respectively.
unit_circle_x = np.cos(theta)
unit_circle_y = np.sin(theta)

# We'll make a number of guesses for the value of pi
for i in range(num_guesses):
    # Make a random guess for pi between 2 and 4
    pi_guess = random.uniform(2, 4)

    # Generate the coordinates for the circle based on the guess
    # The radius of the circle is the guessed value of pi divided by 4.
    radius = pi_guess / 4

    # The x and y coordinates of the circle are then the radius times the cosine and sine of the angles, respectively.
    circle_x = radius * np.cos(theta)
    circle_y = radius * np.sin(theta)

    # Create a scatter plot of the circle
    fig = go.Figure()

    # Add the circle to the plot
    # We use a Scatter trace with mode 'lines' to draw the circle.
    fig.add_trace(go.Scatter(
        x = circle_x,
        y = circle_y,
        mode='lines',
        line=dict(
            color='blue',
            width=3
        ),
        name='Estimated Circle'
    ))

    # Add the unit circle to the plot
    fig.add_trace(go.Scatter(
        x = unit_circle_x,
        y = unit_circle_y,
        mode='lines',
        line=dict(
            color='green',
            width=3
        ),
        name='Unit Circle'
    ))

    # Update the layout of the plot
    # We set the title to include the guessed value of pi, and adjust the size and axis ranges to properly display the circles.
    fig.update_layout(
        title=f"Fig1{chr(97 + i)}: Randomly Guessing Pi: {pi_guess}",
        width=600,
        height=600,
        xaxis=dict(
            constrain="domain",
            range=[-1, 1]
        ),
        yaxis=dict(
            scaleanchor="x",
            scaleratio=1,
            range=[-1, 1]
        )
    )

    # Display the plots
    fig.show()
Figures 1a-1f: Random Estimation of Pi
Figures 1a-1f: Random Estimation of Pi

You may notice something odd: in the random guessing method, sometimes a guess closer to the real value of pi results in a circle farther from the unit circle. This apparent contradiction arises because we’re looking at the circles’ circumference, not their radius or area. The visual gap between the two circles represents the error in estimating the circumference of the circle based on the guess, not the circle as a whole.

In the second set of visualizations (Figure 2a to Figure 2f), we’re using the Monte Carlo method to estimate the value of pi. Instead of making a random guess, we’re throwing a large number of darts at a square and counting how many fall inside a circle inscribed in the square. The resulting estimate of pi is much more accurate, as you can see from the figures where the size of the circle is much closer to the actual unit circle. The green dots represent darts that landed inside the unit circle, and the red dots represent darts that landed outside.

#Monte Carlo Estimation of Pi

# Import our required libraries
import random
import math
import plotly.graph_objects as go
import plotly.io as pio
import numpy as np

# We'll simulate throwing darts at a dartboard to estimate pi. Let's throw 10,000 darts.
num_darts = 10000

# To keep track of darts that land in the circle.
darts_in_circle = 0

# We'll store the coordinates of darts that fall inside and outside the circle.
x_coords_in, y_coords_in, x_coords_out, y_coords_out = [], [], [], []

# Let's generate 6 figures throughout the simulation. Therefore, we will create a new figure every 1,666 darts (10,000 divided by 6).
num_figures = 6
darts_per_figure = num_darts // num_figures

# Create a unit circle to compare our estimates against. Here, we use polar coordinates and convert to Cartesian.
theta = np.linspace(0, 2*np.pi, 100)
unit_circle_x = np.cos(theta)
unit_circle_y = np.sin(theta)

# We start throwing the darts (simulating random points inside a 1x1 square and checking if they fall inside a quarter circle).
for i in range(num_darts):
    # Generate random x, y coordinates between -1 and 1.
    x, y = random.uniform(-1, 1), random.uniform(-1, 1)

    # If a dart (point) is closer to the origin (0,0) than the distance of 1, it's inside the circle.
    if math.sqrt(x**2 + y**2) <= 1:
        darts_in_circle += 1
        x_coords_in.append(x)
        y_coords_in.append(y)
    else:
        x_coords_out.append(x)
        y_coords_out.append(y)

    # After every 1,666 darts, let's see how our estimate looks compared to the real unit circle.
    if (i + 1) % darts_per_figure == 0:
        # We estimate pi by seeing the proportion of darts that landed in the circle (out of the total number of darts).
        pi_estimate = 4 * darts_in_circle / (i + 1)

        # Now we create a circle from our estimate to compare visually with the unit circle.
        estimated_circle_radius = pi_estimate / 4
        estimated_circle_x = estimated_circle_radius * np.cos(theta)
        estimated_circle_y = estimated_circle_radius * np.sin(theta)

        # Plot the results using Plotly.
        fig = go.Figure()

        # Add the darts that landed inside and outside the circle to the plot.
        fig.add_trace(go.Scattergl(x=x_coords_in, y=y_coords_in, mode='markers', name='Darts Inside Circle', marker=dict(color='green', size=4, opacity=0.8)))
        fig.add_trace(go.Scattergl(x=x_coords_out, y=y_coords_out, mode='markers', name='Darts Outside Circle', marker=dict(color='red', size=4, opacity=0.8)))

        # Add the real unit circle and our estimated circle to the plot.
        fig.add_trace(go.Scatter(x=unit_circle_x, y=unit_circle_y, mode='lines', name='Unit Circle', line=dict(color='green', width=3)))
        fig.add_trace(go.Scatter(x=estimated_circle_x, y=estimated_circle_y, mode='lines', name='Estimated Circle', line=dict(color='blue', width=3)))

        # Customize the plot layout.
        fig.update_layout(title=f"Figure {chr(97 + (i + 1) // darts_per_figure - 1)}: Thrown Darts: {(i + 1)}, Estimated Pi: {pi_estimate}", width=600, height=600, xaxis=dict(constrain="domain", range=[-1, 1]), yaxis=dict(scaleanchor="x", scaleratio=1, range=[-1, 1]), legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01))

        # Display the plot.
        fig.show()

        # Save the plot as a PNG image file.
        pio.write_image(fig, f"fig2{chr(97 + (i + 1) // darts_per_figure - 1)}.png")
Figures 2a-2f: Monte Carlo Estimation of Pi
Figures 2a-2f: Monte Carlo Estimation of Pi

In the Monte Carlo method, the pi estimate is based on the proportion of "darts" that land inside the circle to the total number of darts thrown. The resulting estimated pi value is used to generate a circle. If the Monte Carlo estimate is inaccurate, the circle will again be the wrong size. The width of the gap between this estimated circle and the unit circle gives an indication of the accuracy of the Monte Carlo estimate.

However, because the Monte Carlo method generates more accurate estimates as the number of "darts" increases, the estimated circle should converge towards the unit circle as more "darts" are thrown. Therefore, while both methods show a gap when the estimate is inaccurate, this gap should decrease more consistently with the Monte Carlo method as the number of "darts" increases.

Predicting Pi: The Power of Probability

What makes Monte Carlo simulations so powerful is their ability to harness randomness to solve deterministic problems. By generating a large number of random scenarios and analyzing the results, we can estimate the probability of different outcomes, even for complex problems that would be difficult to solve analytically.

In the case of estimating pi, the Monte Carlo method allows us to make a very accurate estimate, even though we’re just throwing darts randomly. As discussed, the more darts we throw, the more accurate our estimate becomes. This is a demonstration of the law of large numbers, a fundamental concept in probability theory that states that the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer and closer as more trials are performed. Let’s see if this tends to be true for our six examples shown in Figures 2a-2f by plotting the number of darts thrown against the difference between Monte Carlo-estimated pi and real pi. In general, our graph (Figure 2g) should trend negative. Here’s the code to accomplish this:

# Calculate the differences between the real pi and the estimated pi
diff_pi = [abs(estimate - math.pi) for estimate in pi_estimates]

# Create the figure for the number of darts vs difference in pi plot (Figure 2g)
fig2g = go.Figure(data=go.Scatter(x=num_darts_thrown, y=diff_pi, mode='lines'))

# Add title and labels to the plot
fig2g.update_layout(
    title="Fig2g: Darts Thrown vs Difference in Estimated Pi",
    xaxis_title="Number of Darts Thrown",
    yaxis_title="Difference in Pi",
)

# Display the plot
fig2g.show()

# Save the plot as a png
pio.write_image(fig2g, "fig2g.png")

Note that, even with only 6 examples, the general pattern is as expected: more darts thrown (more scenarios), a smaller difference between the estimated and real value, and thus a better prediction.

Let’s say we throw 1,000,000 total darts, and allow ourselves 500 predictions. In other words, we will record the difference between the estimated and actual values of pi at 500 evenly spaced intervals throughout the simulation of 1,000,000 thrown darts. Rather than generate 500 extra figures, let’s just skip to what we’re trying to confirm: whether it’s indeed true that as more darts are thrown, the difference in our predicted value of pi and real pi gets lower. We’ll use a scatter plot (Figure 2h):

#500 Monte Carlo Scenarios; 1,000,000 darts thrown
import random
import math
import plotly.graph_objects as go
import numpy as np

# Total number of darts to throw (1M)
num_darts = 1000000
darts_in_circle = 0

# Number of scenarios to record (500)
num_scenarios = 500
darts_per_scenario = num_darts // num_scenarios

# Lists to store the data for each scenario
darts_thrown_list = []
pi_diff_list = []

# We'll throw a number of darts
for i in range(num_darts):
    # Generate random x, y coordinates between -1 and 1
    x, y = random.uniform(-1, 1), random.uniform(-1, 1)

    # Check if the dart is inside the circle
    # A dart is inside the circle if the distance from the origin (0,0) is less than or equal to 1
    if math.sqrt(x**2 + y**2) <= 1:
        darts_in_circle += 1

    # If it's time to record a scenario
    if (i + 1) % darts_per_scenario == 0:
        # Estimate pi with Monte Carlo method
        # The estimate is 4 times the number of darts in the circle divided by the total number of darts
        pi_estimate = 4 * darts_in_circle / (i + 1)

        # Record the number of darts thrown and the difference between the estimated and actual values of pi
        darts_thrown_list.append((i + 1) / 1000)  # Dividing by 1000 to display in thousands
        pi_diff_list.append(abs(pi_estimate - math.pi))

# Create a scatter plot of the data
fig = go.Figure(data=go.Scattergl(x=darts_thrown_list, y=pi_diff_list, mode='markers'))

# Update the layout of the plot
fig.update_layout(
    title="Fig2h: Difference between Estimated and Actual Pi vs. Number of Darts Thrown (in thousands)",
    xaxis_title="Number of Darts Thrown (in thousands)",
    yaxis_title="Difference between Estimated and Actual Pi",
)

# Display the plot
fig.show()
# Save the plot as a png
pio.write_image(fig2h, "fig2h.png")

Monte Carlo Simulations and Hyperparameter Tuning: A Winning Combination

You might be thinking to yourself at this point, "Monte Carlo is an interesting statistical tool, but how does it apply to machine learning?" The short answer is: in many ways. One of the many applications of Monte Carlo simulations in machine learning is in the realm of hyperparameter tuning.

Hyperparameters are the knobs and dials that we (the humans) adjust when setting up machine learning algorithms. They control aspects of the algorithm’s behavior that, crucially, aren’t learned from the data. For example, in a decision tree, the maximum depth of the tree is a hyperparameter. In a neural network, the learning rate and the number of hidden layers are hyperparameters.

Choosing the right hyperparameters can make the difference between a model that performs poorly and one that performs excellently. But how do we know which hyperparameters to choose? This is where Monte Carlo simulations come in.

Traditionally, machine learning practitioners have used methods like grid search or random search to tune hyperparameters. These methods involve specifying a set of possible values for each hyperparameter, and then training and evaluating a model for every possible combination of hyperparameters. This can be computationally expensive and time-consuming, especially when there are many hyperparameters to tune or a large range of possible values each can take.

Monte Carlo simulations offer a more efficient alternative. Instead of exhaustively searching through all possible combinations of hyperparameters, we can randomly sample from the space of hyperparameters according to some probability distribution. This allows us to explore the hyperparameter space more efficiently and find good combinations of hyperparameters faster.

In the next section, we’ll use a real dataset to demonstrate how to use Monte Carlo simulations for hyperparameter tuning in practice. Let’s get started!

Monte Carlo Simulations for Hyperparameter Tuning

The Heartbeat of Our Experiment: The Heart Disease Dataset

In the world of machine learning, data is the lifeblood that powers our models. For our exploration of Monte Carlo simulations in hyperparameter tuning, let’s look at a dataset that’s close to the heart – quite literally. The Heart Disease dataset (CC BY 4.0) from the UCI Machine Learning Repository is a collection of medical records from patients, some of whom have heart disease.

The dataset contains 14 attributes, including age, sex, chest pain type, resting blood pressure, cholesterol levels, fasting blood sugar, and others. The target variable is the presence of heart disease, making this a binary classification task. With a mix of categorical and numerical features, it’s an interesting dataset for demonstrating hyperparameter tuning.

First, let’s take a look at our dataset to get a sense of what we’ll be working with – always a good place to start.


#Load and view first few rows of dataset

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import numpy as np
import plotly.graph_objects as go

# Load the dataset
# The dataset is available at the UCI Machine Learning Repository
# It's a dataset about heart disease and includes various patient measurements
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

# Define the column names for the dataframe
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]

# Load the dataset into a pandas dataframe
# We specify the column names and also tell pandas to treat '?' as NaN
df = pd.read_csv(url, names=column_names, na_values="?")

# Print the first few rows of the dataframe
# This gives us a quick overview of the data
print(df.head())

This shows us the first four values in our dataset across all columns. If you’ve loaded the right csv and named your columns as I have, your output will look like Figure 3.

Figure 3: First 4 rows of data from our dataset
Figure 3: First 4 rows of data from our dataset

Setting the Pulse: Preprocessing the Data

Before we can use the Heart Disease dataset for hyperparameter tuning, we need to preprocess the data. This involves several steps:

  1. Handling missing values: Some records in the dataset have missing values. We’ll need to decide how to handle these, whether by deleting the records, filling in the missing values, or some other method.
  2. Encoding categorical variables: Many machine learning algorithms require input data to be numerical. We’ll need to convert categorical variables into a numerical format.
  3. Normalizing numerical features: Machine learning algorithms often perform better when numerical features are on a similar scale. We’ll apply normalization to adjust the scale of these features.

Let’s start by handling missing values. In our Heart Disease dataset, we have a few missing values in the ‘ca’ and ‘thal’ columns. We’ll fill these missing values with the median of the respective column. This is a common strategy for dealing with missing data, as it doesn’t drastically affect the distribution of the data.

Next, we’ll encode the categorical variables. In our dataset, the ‘cp’, ‘restecg’, ‘slope’, ‘ca’, and ‘thal’ columns are categorical. We’ll use label encoding to convert these categorical variables into numerical ones. Label encoding assigns each unique category in a column to a different integer.

Finally, we’ll normalize the numerical features. Normalization adjusts the scale of numerical features so that they all fall within a similar range. This can help improve the performance of many machine learning algorithms. We’ll use standard scaling for normalization, which transforms the data to have a mean of 0 and a standard deviation of 1.

Here’s the Python code that performs all of these preprocessing steps:

# Preprocess

# Import necessary libraries
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

# Identify missing values in the dataset
# This will print the number of missing values in each column
print(df.isnull().sum())

# Fill missing values with the median of the column
# The SimpleImputer class from sklearn provides basic strategies for imputing missing values
# We're using the 'median' strategy, which replaces missing values with the median of each column
imputer = SimpleImputer(strategy='median')

# Apply the imputer to the dataframe
# The result is a new dataframe where missing values have been filled in
df_filled = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# Print the first few rows of the filled dataframe
# This gives us a quick check to make sure the imputation worked correctly
print(df_filled.head())

# Identify categorical variables in the dataset
# These are variables that contain non-numerical data
categorical_vars = df_filled.select_dtypes(include='object').columns

# Encode categorical variables
# The LabelEncoder class from sklearn converts each unique string into a unique integer
encoder = LabelEncoder()
for var in categorical_vars:
    df_filled[var] = encoder.fit_transform(df_filled[var])

# Normalize numerical features
# The StandardScaler class from sklearn standardizes features by removing the mean and scaling to unit variance
scaler = StandardScaler()

# Apply the scaler to the dataframe
# The result is a new dataframe where numerical features have been normalized
df_normalized = pd.DataFrame(scaler.fit_transform(df_filled), columns=df_filled.columns)

# Print the first few rows of the normalized dataframe
# This gives us a quick check to make sure the normalization worked correctly
print(df_normalized.head())

The first print statement shows us the number of missing values in each column of the original dataset. In our case, the ‘ca’ and ‘thal’ columns had a few missing values.

The second print statement shows us the first few rows of the dataset after filling in the missing values. As discussed, we used the median of each column to fill in the missing values.

The third print statement shows us the first few rows of the dataset after encoding the categorical variables. After this step, all the variables in our dataset are numerical.

The final print statement shows us the first few rows of the dataset after normalizing the numerical features, in which the data will have a mean of 0 and a standard deviation of 1. After this step, all the numerical features in our dataset are on a similar scale. Check that your output resembles Figure 4:

Figure 4: Preprocessing Print Statements Output
Figure 4: Preprocessing Print Statements Output

After running this code, we have a preprocessed dataset that’s ready for modeling.

Implementing a Basic Machine Learning Model

Now that we’ve preprocessed our data, we’re ready to implement a basic machine learning model. This will serve as our baseline model, which we’ll later try to improve through hyperparameter tuning.

We’ll use a simple logistic regression model for this task. Note that while it’s called "regression," this is actually one of the most popular algorithms for binary classification problems, like the one we’re dealing with in the Heart Disease dataset. It’s a linear model that predicts the probability of the positive class.

After training our model, we’ll evaluate its performance using two common metrics: accuracy and ROC-AUC. Accuracy is the proportion of correct predictions out of all predictions, while ROC-AUC (Receiver Operating Characteristic – Area Under Curve) measures the trade-off between the true positive rate and the false positive rate.

But what does this have to do with Monte Carlo simulations? Well, machine learning models like logistic regression have several hyperparameters that can be tuned to improve performance. However, finding the best set of hyperparameters can be like searching for a needle in a haystack. This is where Monte Carlo simulations come in. By randomly sampling different sets of hyperparameters and evaluating their performance, we can estimate the probability distribution of good hyperparameters and make an educated guess about the best ones to use, similarly to how we picked better values of pi in our dart-throwing exercise.

Here’s the Python code that implements and evaluates a basic logistic regression model for our newly pre-processed data:

# Logistic Regression Model - Baseline

# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# Replace the 'target' column in the normalized DataFrame with the original 'target' column
# This is done because the 'target' column was also normalized, which is not what we want
df_normalized['target'] = df['target']

# Binarize the 'target' column
# This is done because the original 'target' column contains values from 0 to 4
# We want to simplify the problem to a binary classification problem: heart disease or no heart disease
df_normalized['target'] = df_normalized['target'].apply(lambda x: 1 if x > 0 else 0)

# Split the data into training and test sets
# The 'target' column is our label, so we drop it from our features (X)
# We use a test size of 20%, meaning 80% of the data will be used for training and 20% for testing
X = df_normalized.drop('target', axis=1)
y = df_normalized['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Implement a basic logistic regression model
# Logistic Regression is a simple yet powerful linear model for binary classification problems
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
# The model has been trained, so we can now use it to make predictions on unseen data
y_pred = model.predict(X_test)

# Evaluate the model
# We use accuracy (the proportion of correct predictions) and ROC-AUC (a measure of how well the model distinguishes between classes) as our metrics
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

# Print the performance metrics
# These give us an indication of how well our model is performing
print("Baseline Model " + f'Accuracy: {accuracy}')
print("Baseline Model " + f'ROC-AUC: {roc_auc}')

With an accuracy of 0.885 and an ROC-AUC score of 0.884, our basic logistic regression model has set a solid baseline for us to improve upon. These metrics indicate that our model is performing quite well at distinguishing between patients with and without heart disease. Let’s see if we can make it better.

Hyperparameter Tuning with Grid Search

In machine learning, a model’s performance can often be improved by tuning its hyperparameters. Hyperparameters are parameters that are not learned from the data, but are set prior to the start of the learning process. For example, in logistic regression, the regularization strength ‘C’ and the type of penalty ‘l1’ or ‘l2’ are hyperparameters.

Let’s perform hyperparameter tuning on our logistic regression model using grid search. We’ll tune the ‘C’ and ‘penalty’ hyperparameters, and we’ll use ROC-AUC as our scoring metric. Let’s see if we can beat our baseline model’s performance.

Now, let’s start with the Python code for this section.

# Grid Search

# Import necessary libraries
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters and their values
# 'C' is the inverse of regularization strength (smaller values specify stronger regularization)
# 'penalty' specifies the norm used in the penalization (l1 or l2)
hyperparameters = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 
                   'penalty': ['l1', 'l2']}

# Implement grid search
# GridSearchCV is a method used to tune our model's hyperparameters
# We pass our model, the hyperparameters to tune, and the number of folds for cross-validation
# We're using ROC-AUC as our scoring metric
grid_search = GridSearchCV(LogisticRegression(), hyperparameters, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
# GridSearchCV has found the best hyperparameters for our model, so we print them out
best_params = grid_search.best_params_
print(f'Best hyperparameters: {best_params}')

# Evaluate the best model
# GridSearchCV also gives us the best model, so we can use it to make predictions and evaluate its performance
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)
accuracy_best = accuracy_score(y_test, y_pred_best)
roc_auc_best = roc_auc_score(y_test, y_pred_best)

# Print the performance metrics of the best model
# These give us an indication of how well our model is performing after hyperparameter tuning
print("Grid Search Method " + f'Accuracy of the best model: {accuracy_best}')
print("Grid Search Method " + f'ROC-AUC of the best model: {roc_auc_best}')

With the best hyperparameters found to be {‘C’: 0.1, ‘penalty’: ‘l2’}, our grid search has an accuracy of 0.852 and an ROC-AUC score of 0.853 for the best model. Interestingly, this performance is slightly lower than our baseline model. This could be due to the fact that our baseline model’s hyperparameters were already well-suited to this particular dataset, or it could be a result of the randomness inherent in the train-test split. Regardless, it’s a valuable reminder that more complex models and techniques are not always better.

However, you might have also noticed that our grid search only explored a relatively small number of possible hyperparameter combinations. In practice, the number of hyperparameters and their potential values can be much larger, making grid search computationally expensive or even infeasible.

This is where the Monte Carlo method comes in. Let’s see if this more guided approach improves on either the original baseline or grid search-based model’s performance:

#Monte Carlo

# Import necessary libraries
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the range of hyperparameters
# 'C' is the inverse of regularization strength (smaller values specify stronger regularization)
# 'penalty' specifies the norm used in the penalization (l1 or l2)
C_range = np.logspace(-3, 3, 7)
penalty_options = ['l1', 'l2']

# Initialize variables to store the best score and hyperparameters
best_score = 0
best_hyperparams = None

# Perform the Monte Carlo simulation
# We're going to perform 1000 iterations. You can play with this number to see how the performance changes.
# Remember the Law of Large Numbers!
for _ in range(1000):  

    # Randomly select hyperparameters from the defined range
    C = np.random.choice(C_range)
    penalty = np.random.choice(penalty_options)

    # Create and evaluate the model with these hyperparameters
    # We're using 'liblinear' solver as it supports both L1 and L2 regularization
    model = LogisticRegression(C=C, penalty=penalty, solver='liblinear')
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Calculate the accuracy and ROC-AUC
    accuracy = accuracy_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred)

    # If this model's ROC-AUC is the best so far, store its score and hyperparameters
    if roc_auc > best_score:
        best_score = roc_auc
        best_hyperparams = {'C': C, 'penalty': penalty}

# Print the best score and hyperparameters
print("Monte Carlo Method " + f'Best ROC-AUC: {best_score}')
print("Monte Carlo Method " + f'Best hyperparameters: {best_hyperparams}')

# Train the model with the best hyperparameters
best_model = LogisticRegression(**best_hyperparams, solver='liblinear')
best_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Calculate and print the accuracy of the best model
accuracy = accuracy_score(y_test, y_pred)
print("Monte Carlo Method " + f'Accuracy of the best model: {accuracy}')

In the Monte Carlo method, we found that the best ROC-AUC score was 0.9014, with the best hyperparameters being {‘C’: 0.1, ‘penalty’: ‘l1’}. The accuracy of the best model was 0.9016.

A Tale of Two Techniques: Grid Search vs. Monte Carlo

Grid search selects the "best" hyperparameters by evaluating the performance of all possible combinations in the hyperparameter space on the training set, and selecting the combination that yields the best average performance according to a predefined metric (e.g., accuracy, ROC-AUC, etc.). This involves splitting the training data into several subsets (or "folds", which we’ve set to 5 in our code), training the model on some of the folds and evaluating it on the remaining fold, and then averaging the performance across all the folds. This is known as cross-validation and helps to reduce overfitting by ensuring that the model performs well on average across different subsets of the training data.

In contrast, the Monte Carlo method does not perform any averaging across different subsets of the training data. It selects hyperparameters randomly, evaluates the model on the entire training set, and then selects the hyperparameters that yield the best performance on the test set. This approach does not use any cross-validation and hence does not average the performance across different subsets of the training data.

In the above experiment, even though the grid search method evaluated the combination selected by the Monte Carlo method, it did not select it as the "best" hyperparameters because it likely did not yield the best average performance across the different subsets of the training data during the cross-validation process. However, the combination selected by the Monte Carlo method happened to yield better performance on the specific test set used in this experiment. This suggests that the selected hyperparameters perform well on the specific characteristics of the test set, even though they did not perform the best on average across different subsets of the training data.

This highlights the trade-off between the bias introduced by averaging across different subsets of the training data in the grid search method, and the variance introduced by evaluating the model on the entire training set and selecting the hyperparameters based on a single test set in the Monte Carlo method.

I encourage you to tinker with the Python code and observe how different factors impact the performance. You can also compare the computation time between the two methods with different hyperparameter spaces to understand their efficiency. This exercise aims to demonstrate the dynamics of these methods, which have their merits and limitations, and to highlight that the "best" method may depend on your specific scenario, computational resources, and model requirements. So, feel free to experiment and happy learning!

Conclusion

The Monte Carlo method, born from a game of solitaire, has undoubtedly reshaped the landscape of computational mathematics and data science. Its power lies in its simplicity and versatility, allowing us to tackle complex, high-dimensional problems with relative ease. From estimating the value of pi with a game of darts to tuning hyperparameters in machine learning models, Monte Carlo simulations have proven to be an invaluable tool in our data science arsenal.

In this article, we’ve journeyed from the origins of the Monte Carlo method, through its theoretical underpinnings, and into its practical applications in machine learning. We’ve seen how it can be used to optimize machine learning models, with a hands-on exploration of hyperparameter tuning using a real-world dataset. We’ve also compared it with other methods, demonstrating its efficiency and effectiveness.

But the story of Monte Carlo is far from over. As we continue to push the boundaries of machine learning and data science, the Monte Carlo method will undoubtedly continue to play a crucial role. Whether we’re developing sophisticated AI applications, making sense of complex data, or simply playing a game of solitaire, the Monte Carlo method is a testament to the power of simulation and approximation in solving complex problems.

As we move forward, let’s take a moment to appreciate the beauty of this method – a method that has its roots in a simple card game, yet has the power to drive some of the most advanced computations in the world. The Monte Carlo method truly is a high-stakes game of chance and complexity, and so far, it seems, the house always wins. So, keep shuffling the deck, keep playing your cards, and remember – in the game of data science, Monte Carlo could just be your ace in the hole.

Parting Thoughts

Congratulations on making it to the end! We’ve journeyed through the world of probabilities, wrestled with complex models, and emerged with a newfound appreciation for the power of Monte Carlo simulations. We’ve seen them in action, simplifying intricate problems into manageable components, and even optimizing hyperparameters for machine learning tasks.

If you enjoy diving into the intricacies of ML problem-solving as much as I do, follow me on Medium and LinkedIn. Together, let’s navigate the AI labyrinth, one clever solution at a time.

Until our next statistical adventure, keep exploring, keep learning, and keep simulating! And in your data science and ML journey, may the odds be ever in your favor.

Note: All images, unless otherwise noted, are by the author.


Related Articles