The world’s leading publication for data science, AI, and ML professionals.

Reinforcement Learning: Deep Q-Networks

Teaching a shuttle to land on the moon using Deep Q-Networks in Python. A mathematical deep dive into Reinforcement Learning.

Image generated by DALL-E
Image generated by DALL-E

In reinforcement learning (RL), Q-learning is a foundational algorithm that helps an agent navigate its environment by learning a policy to maximize cumulative rewards. It does this by updating an action-value function, which estimates the expected utility of taking a specific action in a given state, based on received rewards and future estimations (this doesn’t sound familiar? Don’t worry as we will go over it later together).

However, traditional Q-learning has its challenges. It struggles with scalability as the state space grows and is less effective in environments with continuous state and action spaces. This is where Deep Q Networks (DQNs) come in. DQNs use neural networks to approximate the Q-values, enabling agents to handle larger and more complex environments effectively.

In this article, we’ll dive into Deep Q Networks. We’ll explore how DQNs overcome the limitations of traditional Q-learning and discuss the key components that make up a DQN. We’ll also walk through implementing a DQN from scratch and applying it to a more complex environment. By the end of this article, you’ll have a solid understanding of how DQNs work and how to use them to solve challenging RL problems.


Index

1: Traditional Q-Learning1.1: States and Actions1.2: Q-Values1.3: The Q-Table1.4: Learning Process

2: From Q-Learning to Deep Q-Networks2.1: Limitations of Traditional Q-Learning2.2: Neural Networks

3: The Anatomy of a Deep Q-Network3.1: Components of a DQN3.2: The DQN Algorithm

4: Implementing a Deep Q-Network from Scratch4.1: Setting up the Environment4.2: Building the Deep Neural Network4.3: Implementing Experience Replay4.4: Implementing the Target Network4.5: Training the Deep Q-Network4.6: Tuning the Model4.7: Running the model

5: Conclusion References


1: Traditional Q-Learning

Image generated by DALL-E
Image generated by DALL-E

Q-learning guides an agent to learn the best actions to maximize cumulative rewards in an environment. Before diving into Deep Q-Networks, it’s good first to review the mechanisms behind its predecessor, Q-learning briefly.

1.1: States and Actions

Imagine you’re a robot navigating a maze. Every position you occupy in the maze is called a "state." Each possible move you can make, like moving left, right, up, or down, is an "action." The goal is to figure out which action to take in each state to find the best path through the maze eventually.

1.2: Q-Values

The heart of Q-Learning is the Q-value, denoted as 𝑄(𝑠, 𝑎). This value represents the expected future rewards for taking a specific action a in a particular state s, _a_nd then following the best possible path (policy) thereafter.

Think of Q-values as entries in a guidebook that rate the long-term benefits of each possible move. For example, if you’re in a specific spot in the maze and consider moving left, the Q-value tells you how beneficial that move is expected to be in terms of future rewards. A higher Q-value indicates a better move.

1.3: The Q-Table

Q-Learning uses a Q-table to keep track of these Q-values. The Q-table is essentially a large spreadsheet where each row corresponds to a state, each column corresponds to an action, and each cell contains the Q-value for that state-action pair.

Imagine the Q-table as a giant spreadsheet where each cell represents the potential future rewards of making a specific move from a specific position in the maze. As you learn more about the environment, you update this spreadsheet with better estimates of these rewards.

1.4: Learning Process

The learning process in Q-Learning is iterative. It begins in an initial state s. Then, decide on an action a. This choice can be based on:

  • Exploration: Trying out new actions to discover their effects.
  • Exploitation: Using existing knowledge to select the action with the highest known Q-value.

Perform the chosen action, observe the reward r, and move to the next state s′. Update the Q-value for the state-action pair (s, a) using the Q-Learning formula:

Q-Value update formula - Image by Author
Q-Value update formula – Image by Author

Here:

  • α is the learning rate, which determines how much new information overrides old information.
  • γ is the discount factor, which values immediate rewards more highly than distant future rewards.
  • maxa′​Q(s′,a′) represents the highest Q-value for the next state s′ across all possible actions a′.

Imagine you’re constantly updating your guidebook. After each move, you get feedback on how good or bad that move was (the reward). You then adjust the rating (Q-value) in your guidebook to reflect this new information, making your future decisions better informed.

Continue this process repeatedly until the Q-values converge, meaning the agent has learned the optimal policy for navigating the maze. Over time, by repeatedly exploring the maze and updating your guidebook based on your experiences, you develop a comprehensive strategy that tells you the best move to make in any given position to maximize your rewards.

For a deeper dive into Q-Learning, take a look at this article:

Reinforcement Learning 101: Q-Learning

2: From Q-Learning to Deep Q-Networks

2.1: Limitations of Traditional Q-Learning

While Q-learning is a powerful algorithm for reinforcement learning, it has several limitations that hinder its effectiveness in more complex environments:

Scalability Issues: Traditional Q-Learning maintains a Q-table where each state-action pair is mapped to a Q-value. As the state space grows, especially in high-dimensional or continuous environments, the Q-table becomes impractically large, leading to memory inefficiency and slow learning processes.

Discrete State and Action Spaces: Q-Learning works well with environments where the states and actions are discrete and finite. However, many real-world problems involve continuous state and action spaces, which traditional Q-Learning cannot handle efficiently without discretizing these spaces, which can lead to a loss of information and suboptimal policies.

2.2: Neural Networks

Let’s introduce now Neural networks, which play a pivotal role in Deep Networks. Inspired by the structure and function of the human brain, Neural Networks are powerful function approximators capable of learning complex patterns from data. They consist of layers of interconnected nodes (neurons) that process input data and transform it through weights and biases to produce an output.

In the context of reinforcement learning, neural networks can be used to approximate the Q-function, which maps state-action pairs to Q-values. This allows the agent to generalize better across states and actions, especially in large or continuous spaces where maintaining a Q-table is not feasible.

Therefore, Deep Q-networks (DQNs) combine the principles of Q-Learning with the function approximation capabilities of neural networks. By doing so, they address the key limitations of traditional Q-learning.

Instead of storing Q-values in a table, DQNs use a neural network to approximate the Q-function. This network takes the state as input and outputs Q-values for all possible actions. By training the network with experiences from the environment, the agent learns to predict the expected rewards for each action, effectively generalizing across a large number of states and actions.

Imagine you’re learning to play chess. Instead of memorizing every possible board configuration and the best move for each (which is impossible), you learn general strategies and principles (like controlling the center of the board and protecting the king). Similarly, a DQN learns general patterns and strategies through the neural network, allowing it to make informed decisions without having to memorize every possible state.

The use of neural networks allows DQNs to handle environments with large or continuous state spaces. The network can learn representations of the state space that capture essential features, enabling the agent to make informed decisions without the need to discretize the space. Consider trying to navigate a large city. Instead of memorizing the layout of every street and building (which would be like a huge Q-table), you learn to recognize landmarks and major roads that help you find your way around. The neural network in a DQN works similarly, learning to recognize important features of the state space that help the agent navigate complex environments.

By training on a wide variety of experiences, the model learns to generalize from past experiences. This means the agent can apply what it has learned to new, unseen states and actions, making it more adaptable and efficient in different situations.

3: The Anatomy of a Deep Q-Network

3.1: Components of a DQN

To understand how Deep Q-Networks (DQNs) function, it’s essential to break down their key components:

3.1.1: Neural Networks

Feed-forward Neural Network - Image by Author
Feed-forward Neural Network – Image by Author

At the core of a DQN is a neural network, which serves as a function approximator for the Q-values. The architecture typically includes:

Input Layer: Imagine this as the "eyes" of the agent. It receives the state representation from the environment, similar to how your eyes take in the visual information around you. This is the first layer on the left with two nodes in the image above.

Hidden Layers: Think of these layers as the "brain" of the agent. They process the information received by the eyes through multiple stages of thinking, identifying complex features and patterns, much like how your brain processes and understands the world. This is the middle layer with three nodes in the image above.

Output Layer: This is like the "decision-making" part of the agent. It produces Q-values for all possible actions given the input state, similar to how your brain decides the best action based on what you see and think. Each output corresponds to the expected reward for taking a specific action. This is the last layer on the right with one node in the image above.

The image above represents a simple feed-forward neural network, the most basic form of a neural network. While this structure is fundamental, it is not yet a "deep" neural network. To transform it into a deep neural network, we would add more hidden layers, increasing the network’s depth. Additionally, we can experiment with different architectures and configurations to develop more advanced models. It’s also important to note that the number of nodes in each layer is not fixed; it varies depending on the specific training dataset and task at hand. This flexibility allows us to tailor the network to better suit our particular needs.

If you are interested in learning more about Neural Networks, here’s an article I highly recommend you read:

The Math Behind Neural Networks

3.1.2: Experience ReplayLet’s move to our next item on the list: Experience replay. This is a technique used to stabilize and improve the learning process in DQNs. It involves:

Memory Buffer: Picture this as the agent’s "diary." It stores the agent’s experiences (state, action, reward, next state, done) over time, like how you might write down what happens to you each day.

Random Sampling: During training, the agent flips through random pages of its diary to learn from past experiences. This breaks the sequence of events, helping the agent learn more robustly by preventing it from overfitting to the order of experiences.

3.1.3: Target NetworkFinally, the target network is a separate neural network that is used to compute the target Q-values for training. It is identical in architecture to the main network but has frozen weights that are periodically updated to match the main network’s weights. Think of this as a "stable guidebook" for the agent. While the main network is constantly learning and updating, the target network provides stable Q-values for training. It’s like having a reliable, periodically updated manual to refer to, which helps keep learning stable and consistent.

3.2: The DQN Algorithm

With these components in place, the DQN algorithm can be outlined in several key steps:

3.2.1: Forward PassFirst, we start with the forward pass, which is crucial for predicting Q-values. These Q-values store the expected future rewards for taking certain actions in given states. The process begins with the state input.

State InputThe agent observes the current state s from the environment. This state is represented as a vector of features that describe the current situation of the agent. Think of the state as a snapshot of the world around the agent, similar to how your eyes take in the visual scene when you look around. This snapshot includes all the necessary details the agent needs to make a decision.

Q-Value PredictionNext, this observed state s is fed into the neural network. The neural network processes this input through multiple layers and outputs a set of Q-values Q(s, a; θ). Each Q-value corresponds to a possible action a, with the parameters θ representing the weights and biases of the network.

Q value prediction formula - Image by Author
Q value prediction formula – Image by Author

Imagine the neural network as a complex decision-making machine in the agent’s brain. When it receives the snapshot (state), it processes this information through several stages (layers) to figure out the potential outcomes (Q-values) for different actions. It’s like your brain thinking through different possible actions you could take based on what you see.

Action SelectionThe agent then selects the action a∗ with the highest Q-value as its next move, following the greedy action selection policy:

Action Selection Formula - Image by Author
Action Selection Formula – Image by Author

This is akin to deciding on the best move after thinking through all the options. The agent picks the action it believes will lead to the highest reward, much like you choose the path that seems most promising based on what you see and understand.

3.2.2: Experience ReplayNext, we move on to experience replay, which helps stabilize and improve the learning process.

Store ExperienceAfter the agent takes an action a and receives a reward r and a new state s′, it stores this experience as a tuple(s, a, r, s′, done) in a replay buffer. The variable done indicates whether the episode has ended. Think of the replay buffer as a diary where the agent writes down its experiences, much like jotting down notable events in your day.

Sample Mini-BatchDuring training, a mini-batch of experiences

Mini-Batch in Deep Q-Networks - Image by Author
Mini-Batch in Deep Q-Networks – Image by Author

is randomly sampled from the replay buffer. This batch is used to update the network by computing target Q-values and minimizing the loss. When the agent trains, it flips through random pages of its diary to learn from past experiences. This random sampling helps break the sequence of events and provides diverse learning examples, much like reviewing different days in your diary to gain a broader perspective.

3.2.3: Backpropagation The final step involves backpropagation, which updates the network to improve its predictions.

Compute Target Q-ValuesFor each experience in the mini-batch, the agent computes the target Q-value _y__. If the next state s′​ is terminal (i.e., done is true), the target Q-value is simply the reward r. Otherwise, it is the reward plus the discounted maximum Q-value of the next state s′​ predicted by the target network _Q_target​:

Here, γ is the discount factor (0 ≤ γ < 1). This step is like planning ahead based on past experiences. If the experience ends a journey (episode), the target is the reward received. If it continues, the target includes the expected future rewards, similar to how you plan your actions considering both immediate and future benefits.

Loss CalculationNext, the loss is calculated as the mean squared error between the predicted Q-values Q(_si​, _ai​; θ) from the main network and the target Q-values yi​:

Loss Formula - Image by Author
Loss Formula – Image by Author

Calculating the loss is like evaluating how far off your predictions were from what happened. It’s like checking how accurate your guess was compared to the actual outcome and noting the difference.

Backpropagation and OptimizationFinally, backpropagation is performed to minimize this loss. The computed loss is backpropagated through the network to update the weights using an optimization algorithm such as stochastic gradient descent (SGD) or Adam. This process adjusts the network parameters θ to minimize the loss:

Backpropagation Formula - Image by Author
Backpropagation Formula – Image by Author

Here, α is the learning rate, and ∇θ​Loss represents the gradient of the loss to the network parameters. Backpropagation is like learning from your mistakes. When you realize how off your predictions were (loss), you adjust your strategy (network weights) to improve your future decisions. It’s like fine-tuning your approach based on feedback to get better results next time.

Using this architecture, the agent iteratively improves its policy. It learns to take actions that maximize cumulative rewards over time. The combination of neural networks, experience replay, and target networks allows DQNs to learn in complex, high-dimensional environments effectively. This process continues until the agent becomes proficient in navigating its environment.

4: Implementing a Deep Q-Network from Scratch

In this section, we’ll walk through the implementation of a Deep Q-Network (DQN) from scratch. By the end of this section, you’ll have a clear understanding of how to build and train a DQN in Python.

We’ll use the Lunar Lander environment from OpenAI Gym. In this environment, the goal is to control a lunar lander and successfully land it on a designated landing pad. The lander must navigate through the environment, using thrusters to control its movement and orientation. The environment is free for commercial use. You can find more details about the license and usage rights on the OpenAI Gym GitHub page.

You can find all the code we will cover today here:

Reinforcement-Learning/3. Deep Q-Network/main.py at main · cristianleoo/Reinforcement-Learning

4.1: Setting up the Environment

We’ll use OpenAI Gym’s LunarLander environment, which provides a challenging and interesting problem for our agent to solve.

import os
import pickle
import gym
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import random
import optuna

Here, we import the necessary libraries. gym is used for the environment, torch is for building and training our neural network, and collections, random, and optuna help with experience replay and hyperparameter optimization.

env = gym.make('LunarLander-v2', render_mode="rgb_array")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

We initialize the LunarLander environment and retrieve the dimensions of the state and action spaces. state_dim represents the number of features in the state, and action_dim represents the number of possible actions.

4.2: Building the Deep Neural Network

For our deep-NN, we will create a class called DQN. This class defines a neural network with three fully connected layers. The input layer receives the state representation, the hidden layers process this information through linear transformations and ReLU activation functions, and the output layer produces the Q-values for each possible action.

First, take a look at the code, and then let’s break it down:

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

4.2.1: Class Initialization

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)

We define a class named DQN that inherits from nn.Module, a base class for all neural network modules in PyTorch. This allows us to leverage PyTorch’s built-in functions and features for neural networks.

The __init__ method is a special method that initializes the object’s attributes. In our case, it sets up the layers of the neural network. Fully Connected Layers:

We define three fully connected (linear) layers:

  • self.fc1 = nn.Linear(state_dim, 128): The first layer takes the input state dimension (number of features in the state) and maps it to 128 neurons.
  • self.fc2 = nn.Linear(128, 128): The second layer takes the 128 neurons from the first layer and maps them to another 128 neurons.
  • self.fc3 = nn.Linear(128, action_dim): The third layer takes the 128 neurons from the second layer and maps them to the action dimension (number of possible actions).

Each nn.Linear layer performs a linear transformation on the input data:

Linear Transformation - Image by Author
Linear Transformation – Image by Author

where x is the input, W is the weight matrix, and b is the bias vector.

4.2.2: Forward MethodThe forward method defines how data flows through the network. This method is automatically invoked when you pass data through the network.

def forward(self, x):
    x = torch.relu(self.fc1(x))
    x = torch.relu(self.fc2(x))
    return self.fc3(x)

In the first layer, the input data x is passed through the first fully connected layer (self.fc1). The output is then transformed using the ReLU (Rectified Linear Unit) activation function:

x = torch.relu(self.fc1(x))

The ReLU activation function is defined as:

ReLU activation formula - Image by Author
ReLU activation formula – Image by Author

It introduces non-linearity into the model, allowing the network to learn more complex functions.

In the second layer, the output from the first layer is passed through the second fully connected layer (self.fc2) and transformed using the ReLU activation function again:

x = torch.relu(self.fc2(x))

Finally, in the output layer, the output from the second layer is passed through the third fully connected layer (self.fc3) without an activation function:

return self.fc3(x)

This layer produces the final Q-values for each action. Each value represents the expected future reward for taking that action in the given state.

4.3: Implementing Experience Replay

The ReplayBuffer class provides a mechanism to store and sample experiences, which is essential for stabilizing and improving the learning process in DQNs. Therefore, it enables the agent to learn from a diverse set of past experiences, enhancing its ability to generalize and perform well in complex environments.

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
        return state, action, reward, next_state, done

    def __len__(self):
        return len(self.buffer)

4.3.1: Class Initialization

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

The __init__ method initializes a deque (double-ended queue) with a fixed capacity. The double-ended queue allows you to append and pop items from both ends efficiently. It is particularly useful for implementing queues and stacks where you need fast appends and pops from both ends.

self.buffer = deque(maxlen=capacity) actually creates the deque that can hold up to capacity experiences. When the buffer is full, adding a new experience will automatically remove the oldest one.

4.3.2: Push Method

def push(self, state, action, reward, next_state, done):
    self.buffer.append((state, action, reward, next_state, done))

The push method adds a new experience to the buffer. Each experience is a tuple consisting of state, action, reward, next_state, and done:

  • state: The current state.
  • action: The action taken by the agent.
  • reward: The reward received after taking the action.
  • next_state: The state the agent transitions to after taking the action.
  • done: A boolean indicating whether the episode has ended.

4.3.3: Sample Method

def sample(self, batch_size):
    state, action, reward, next_state, done = zip(*random.sample(self.buffer, batch_size))
    return state, action, reward, next_state, done

The sample method retrieves a random batch of experiences from the buffer.

random.sample(self.buffer, batch_size) randomly selects batch_size experiences from the buffer.

zip(*random.sample(self.buffer, batch_size)) unpacks the list of experiences into separate tuples for state, action, reward, next_state, and done.

The method returns these tuples as the sampled experiences.

4.3.4: Length Method

def __len__(self):
    return len(self.buffer)

The __len__ method returns the current number of experiences stored in the buffer.

4.4: Implementing the Target Network

the target network, we provide a stable set of Q-values for training, which helps stabilize the learning process and improve the agent’s performance in complex environments. The target network is updated less frequently than the main network, ensuring that the Q-value estimates used for updating the main network’s weights are more stable.

We will implement the target network inside a class called DQNTrainer, which manages the training process of the DQN, including the main and target networks, the optimizer, and the replay buffer.

class DQNTrainer:
    def __init__(self, env, main_network, target_network, optimizer, replay_buffer, model_path='model/model.pth', gamma=0.99, batch_size=64, target_update_frequency=1000):
        self.env = env
        self.main_network = main_network
        self.target_network = target_network
        self.optimizer = optimizer
        self.replay_buffer = replay_buffer
        self.model_path = model_path
        self.gamma = gamma
        self.batch_size = batch_size
        self.target_update_frequency = target_update_frequency
        self.step_count = 0

        # Load the model if it exists
        if os.path.exists(os.path.dirname(self.model_path)):
            if os.path.isfile(self.model_path):
                self.main_network.load_state_dict(torch.load(self.model_path))
                self.target_network.load_state_dict(torch.load(self.model_path))
                print("Loaded model from disk")
        else:
            os.makedirs(os.path.dirname(self.model_path))

    def train(self, num_episodes, save=True):
        total_rewards = []
        for episode in range(num_episodes):
            state, _ = self.env.reset()  # Extract the state from the returned tuple
            done = False
            total_reward = 0

            while not done:
                self.env.render()  # Add this line to render the environment
                # Ensure the state is in the correct shape by adding an extra dimension
                action = self.main_network(torch.FloatTensor(state).unsqueeze(0)).argmax(dim=1).item()
                next_state, reward, done, _, _ = self.env.step(action)  # Extract the next_state from the returned tuple
                self.replay_buffer.push(state, action, reward, next_state, done)
                state = next_state
                total_reward += reward

                if len(self.replay_buffer) >= self.batch_size:
                    self.update_network()

            total_rewards.append(total_reward)
            print(f"Episode {episode}, Total Reward: {total_reward}")

        # Save the model after training
        if save:
            torch.save(self.main_network.state_dict(), self.model_path)
            print("Saved model to disk")

        self.env.close()
        return sum(total_rewards) / len(total_rewards)  # Return average reward

    def update_network(self):
        state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.replay_buffer.sample(self.batch_size)

        # Convert to tensors
        state_batch = torch.FloatTensor(state_batch)
        action_batch = torch.LongTensor(action_batch)
        reward_batch = torch.FloatTensor(reward_batch)
        next_state_batch = torch.FloatTensor(next_state_batch)
        done_batch = torch.FloatTensor(done_batch)

        # Calculate the current Q-values
        q_values = self.main_network(state_batch).gather(1, action_batch.unsqueeze(1)).squeeze(1)

        # Calculate the target Q-values
        next_q_values = self.target_network(next_state_batch).max(1)[0]
        expected_q_values = reward_batch + self.gamma * next_q_values * (1 - done_batch)

        # Compute the loss
        loss = nn.MSELoss()(q_values, expected_q_values.detach())

        # Optimize the model
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Periodically update the target network
        if self.step_count % self.target_update_frequency == 0:
            self.target_network.load_state_dict(self.main_network.state_dict())

        self.step_count += 1

4.4.1: Class Definition

class DQNTrainer:
    def __init__(self, env, main_network, target_network, optimizer, replay_buffer, model_path='model/model.pth', gamma=0.99, batch_size=64, target_update_frequency=1000):
        self.env = env
        self.main_network = main_network
        self.target_network = target_network
        self.optimizer = optimizer
        self.replay_buffer = replay_buffer
        self.model_path = model_path
        self.gamma = gamma
        self.batch_size = batch_size
        self.target_update_frequency = target_update_frequency
        self.step_count = 0
  1. Defining the DQNTrainer Class:

The __init__ method initializes various components needed for training:

  • env: The environment in which the agent operates.
  • main_network: The main neural network that is being trained.
  • target_network: The target neural network used for stabilizing Q-value estimates.
  • optimizer: The optimizer used for updating the weights of the main network.
  • replay_buffer: The buffer for storing and sampling experiences.
  • model_path: Path to save/load the trained model.
  • gamma: The discount factor for future rewards.
  • batch_size: The number of experiences sampled from the replay buffer for each training step.
  • target_update_frequency: The frequency at which the target network’s weights are updated to match the main network’s weights.
  • step_count: A counter to keep track of the number of steps taken during training.

4.4.2: Model Loading

# Load the model if it exists
        if os.path.exists(os.path.dirname(self.model_path)):
            if os.path.isfile(self.model_path):
                self.main_network.load_state_dict(torch.load(self.model_path))
                self.target_network.load_state_dict(torch.load(self.model_path))
                print("Loaded model from disk")
        else:
            os.makedirs(os.path.dirname(self.model_path))

We check if the directory for the model path exists using os.path.exists(os.path.dirname(self.model_path)). If a saved model exists, it is loaded to continue training from where it left off:

if os.path.isfile(self.model_path):
    self.main_network.load_state_dict(torch.load(self.model_path))
    self.target_network.load_state_dict(torch.load(self.model_path))
    print("Loaded model from disk")

torch.load loads the saved model weights into the main and target networks using load_state_dict. If the model directory does not exist, it is created using os.makedirs.

4.5: Training the Deep Q-Network

Next, we’ll implement the training loop to train our DQN. This method takes place inside the DQNTrainer. It runs the training loop for the DQN, where the agent interacts with the environment, collects experiences, updates the network, and tracks performance.

Here’s the code for the training loop:

def train(self, num_episodes, save=True):
    total_rewards = []
    for episode in range(num_episodes):
        state, _ = self.env.reset()
        done = False
        total_reward = 0

        while not done:
            self.env.render()  # Add this line to render the environment
            action = self.main_network(torch.FloatTensor(state).unsqueeze(0)).argmax(dim=1).item()
            next_state, reward, done, _, _ = self.env.step(action)
            self.replay_buffer.push(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward

            if len(self.replay_buffer) >= self.batch_size:
                self.update_network()

        total_rewards.append(total_reward)
        print(f"Episode {episode}, Total Reward: {total_reward}")

    if save:
        torch.save(self.main_network.state_dict(), self.model_path)
        print("Saved model to disk")

    self.env.close()
    return sum(total_rewards) / len(total_rewards)

The train method runs the training loop for a specified number of episodes. This loop is crucial for the agent to gain experience and improve its decision-making skills.

4.5.1: Training LoopLet’s first initialize total_rewards as an empty list:

total_rewards = []

Let’s now build our training loop:

for episode in range(num_episodes):

This loop runs for the specified number of episodes. Each episode represents a complete interaction sequence with the environment.

4.5.2: Reset EnvironmentAt the start of each episode, the environment is reset to its initial state.

state, _ = self.env.reset()
done = False
total_reward = 0
  • self.env.reset() initializes the environment and returns the initial state.
  • done = False indicates that the episode is not finished.
  • total_reward = 0 initializes the total reward for the current episode.

4.4.3: Action SelectionThe agent selects an action using the main network based on the current state.

action = self.main_network(torch.FloatTensor(state).unsqueeze(0)).argmax(dim=1).item()

torch.FloatTensor(state).unsqueeze(0) converts the state to a PyTorch tensor and adds an extra dimension to match the network’s expected input shape.

self.main_network(...).argmax(dim=1).item() selects the action with the highest Q-value predicted by the main network.

4.5.4: Step and Store ExperienceThe agent takes the selected action, observes the reward and next state, and stores the experience in the replay buffer.

next_state, reward, done, _, _ = self.env.step(action)
self.replay_buffer.push(state, action, reward, next_state, done)
state = next_state
total_reward += reward
  • self.env.step(action) performs the action and returns the next state, reward, and whether the episode is done.
  • self.replay_buffer.push(...) stores the experience in the replay buffer.
  • state = next_state updates the current state to the next state.
  • total_reward += reward accumulates the reward for the current episode.

4.5.5: Update NetworkIf the replay buffer has enough experience, the network is updated.

if len(self.replay_buffer) >= self.batch_size:
    self.update_network()

if len(self.replay_buffer) >= self.batch_size checks if the replay buffer has at least batch_size experiences.

self.update_network() updates the network using a batch of experiences from the replay buffer.

4.5.6: End of EpisodeThe total reward is recorded and printed at the end of each episode.

total_rewards.append(total_reward)
print(f"Episode {episode}, Total Reward: {total_reward}")

total_rewards.append(total_reward) adds the total reward for the current episode to the list of total rewards.

print(f"Episode {episode}, Total Reward: {total_reward}") prints the episode number and total reward.

4.5.7: Save ModelAfter training, the model is saved to disk.

if save:
    torch.save(self.main_network.state_dict(), self.model_path)
    print("Saved model to disk")

if save: checks if the save flag is True.

torch.save(self.main_network.state_dict(), self.model_path) saves the state dictionary of the main network to the specified file path.

4.5.8: Return Average RewardFinally, the method closes the environment and returns the average reward over all episodes.

self.env.close()
return sum(total_rewards) / len(total_rewards)

self.env.close() closes the environment.

return sum(total_rewards) / len(total_rewards) calculates and returns the average reward.

4.6: Tuning the Model

Finally, we’ll look at how to evaluate and tune the trained model. Let’s build an Optimizer class, which will be responsible for optimizing hyperparameters to improve the performance of the DQN.

class Optimizer:
    def __init__(self, env, main_network, target_network, replay_buffer, model_path, params_path='params.pkl'):
        self.env = env
        self.main_network = main_network
        self.target_network = target_network
        self.replay_buffer = replay_buffer
        self.model_path = model_path
        self.params_path = params_path

    def objective(self, trial, n_episodes=10):
        lr = trial.suggest_loguniform('lr', 1e-5, 1e-1)
        gamma = trial.suggest_uniform('gamma', 0.9, 0.999)
        batch_size = trial.suggest_categorical('batch_size', [32, 64, 128])
        target_update_frequency = trial.suggest_categorical('target_update_frequency', [500, 1000, 2000])

        optimizer = optim.Adam(self.main_network.parameters(), lr=lr)
        trainer = DQNTrainer(self.env, self.main_network, self.target_network, optimizer, self.replay_buffer, self.model_path, gamma=gamma, batch_size=batch_size, target_update_frequency=target_update_frequency)
        reward = trainer.train(n_episodes, save=False)
        return reward

    def optimize(self, n_trials=100, save_params=True):
        if not TRAIN and os.path.isfile(self.params_path):
            with open(self.params_path, 'rb') as f:
                best_params = pickle.load(f)
            print("Loaded parameters from disk")
        elif not FINETUNE:
            best_params = {
                'lr': LEARNING_RATE, 
                'gamma': GAMMA, 
                'batch_size': BATCH_SIZE, 
                'target_update_frequency': TARGET_UPDATE_FREQUENCY
                }
            print(f"Using default parameters: {best_params}")
        else:
            print("Optimizing hyperparameters")
            study = optuna.create_study(direction='maximize')
            study.optimize(self.objective, n_trials=n_trials)
            best_params = study.best_params

            if save_params:
                with open(self.params_path, 'wb') as f:
                    pickle.dump(best_params, f)
                print("Saved parameters to disk")

        return best_params

4.6.1: Class Definition

class Optimizer:
    def __init__(self, env, main_network, target_network, replay_buffer, model_path, params_path='params.pkl'):
        self.env = env
        self.main_network = main_network
        self.target_network = target_network
        self.replay_buffer = replay_buffer
        self.model_path = model_path
        self.params_path = params_path

The __init__ method initializes various components needed for optimization:

  • env: The environment in which the agent operates.
  • main_network: The main neural network.
  • target_network: The target neural network.
  • replay_buffer: The buffer for storing and sampling experiences.
  • model_path: Path to save/load the trained model.
  • params_path: Path to save/load the best hyperparameters.

4.6.2: Objective Method

def objective(self, trial, n_episodes=10):
        lr = trial.suggest_loguniform('lr', 1e-5, 1e-1)
        gamma = trial.suggest_uniform('gamma', 0.9, 0.999)
        batch_size = trial.suggest_categorical('batch_size', [32, 64, 128])
        target_update_frequency = trial.suggest_categorical('target_update_frequency', [500, 1000, 2000])

        optimizer = optim.Adam(self.main_network.parameters(), lr=lr)
        trainer = DQNTrainer(self.env, self.main_network, self.target_network, optimizer, self.replay_buffer, self.model_path, gamma=gamma, batch_size=batch_size, target_update_frequency=target_update_frequency)
        reward = trainer.train(n_episodes, save=False)
        return reward

The objective method suggests values for hyperparameters and trains the model with these values.

  • lr = trial.suggest_loguniform('lr', 1e-5, 1e-1): Suggests a learning rate within the range [1e-5, 1e-1].
  • gamma = trial.suggest_uniform('gamma', 0.9, 0.999): Suggests a discount factor within the range [0.9, 0.999].
  • batch_size = trial.suggest_categorical('batch_size', [32, 64, 128]): Suggests a batch size from the specified list.
  • target_update_frequency = trial.suggest_categorical('target_update_frequency', [500, 1000, 2000]): Suggests a target update frequency from the specified list.
optimizer = optim.Adam(self.main_network.parameters(), lr=lr)

Here, we set up an Adam optimizer with the suggested learning rate. Adam (short for Adaptive Moment Estimation) is an optimization algorithm commonly used in training neural networks.

For each parameter in the neural network, Adam calculates the gradient of the loss function to that parameter. It keeps track of the exponential moving averages of the gradients (first moment, denoted as m) and the squared gradients (second moment, denoted as v).

To account for the initialization bias of the moving averages, Adam applies bias correction to both the first and second-moment estimates. The parameters are then updated using the corrected first and second moments. The update rule is designed to incorporate the learning rate and the moments, adjusting the parameters in a way that considers both the magnitude and the direction of the gradients.

Here’s a more comprehensive article about Adam:

The Math behind Adam Optimizer

trainer = DQNTrainer(self.env, self.main_network, self.target_network, optimizer, self.replay_buffer, self.model_path, gamma=gamma, batch_size=batch_size, target_update_frequency=target_update_frequency)

This initializes theDQNTrainer instance with the suggested hyperparameters.

reward = trainer.train(n_episodes, save=False)

Finally, this line trains the model for a specified number of episodes and returns the average reward.

4.6.3: Optimize MethodIn this section, we will use Optuna, a Python library that will help us systematically explore the hyperparameter space, efficiently finding the combination that maximizes the model’s performance.

def optimize(self, n_trials=100, save_params=True):
        if not TRAIN and os.path.isfile(self.params_path):
            with open(self.params_path, 'rb') as f:
                best_params = pickle.load(f)
            print("Loaded parameters from disk")
        elif not FINETUNE:
            best_params = {
                'lr': LEARNING_RATE, 
                'gamma': GAMMA, 
                'batch_size': BATCH_SIZE, 
                'target_update_frequency': TARGET_UPDATE_FREQUENCY
                }
            print(f"Using default parameters: {best_params}")
        else:
            print("Optimizing hyperparameters")
            study = optuna.create_study(direction='maximize')
            study.optimize(self.objective, n_trials=n_trials)
            best_params = study.best_params

            if save_params:
                with open(self.params_path, 'wb') as f:
                    pickle.dump(best_params, f)
                print("Saved parameters to disk")

        return best_params

The optimize method runs the optimization process for a specified number of trials.

if not TRAIN and os.path.isfile(self.params_path):
            with open(self.params_path, 'rb') as f:
                best_params = pickle.load(f)
            print("Loaded parameters from disk")

If training is not required (not TRAIN) and the parameters file exists, the parameters are loaded from disk.

elif not FINETUNE:
            best_params = {
                'lr': LEARNING_RATE, 
                'gamma': GAMMA, 
                'batch_size': BATCH_SIZE, 
                'target_update_frequency': TARGET_UPDATE_FREQUENCY
                }
            print(f"Using default parameters: {best_params}")

If fine-tuning is not required (not FINETUNE), default parameters are used.

else:
            print("Optimizing hyperparameters")
            study = optuna.create_study(direction='maximize')
            study.optimize(self.objective, n_trials=n_trials)
            best_params = study.best_params

            if save_params:
                with open(self.params_path, 'wb') as f:
                    pickle.dump(best_params, f)
                print("Saved parameters to disk")

If hyperparameter optimization is needed, Optuna is used to find the best parameters.

study = optuna.create_study(direction='maximize') creates an Optuna study to maximize the objective function.

study.optimize(self.objective, n_trials=n_trials) runs the optimization for the specified number of trials.

If save_params is True, the best parameters are saved to disk.

Here’s a nice article that explores different fine-tuning techniques, including a deep dive into Optuna:

The Math Behind Fine-Tuning Deep Neural Networks

4.7: Running the model

Finally, let’s recap everything, and run the code!

4.7.1: Setting Training and Fine-tuning

TRAIN = True
FINETUNE = False

# Set the following hyperparameters if FINETUNE is False
GAMMA = 0.99
BATCH_SIZE = 64
TARGET_UPDATE_FREQUENCY = 1000
LEARNING_RATE = 1e-3

TRAIN = True indicates whether to train the model. If set to False, training will be skipped.

FINETUNE = False indicates whether to fine-tune the model. If set to True, existing parameters will be used and fine-tuned.

If FINETUNE is False, we set the following hyperparameters:

  • GAMMA = 0.99: The discount factor for future rewards. This determines how much future rewards are valued compared to immediate rewards.
  • BATCH_SIZE = 64: The number of experiences sampled from the replay buffer for each training step.
  • TARGET_UPDATE_FREQUENCY = 1000: The frequency (in steps) at which the target network’s weights are updated to match the main network’s weights.
  • LEARNING_RATE = 1e-3: The learning rate for the optimizer, which controls how much to change the model in response to the estimated error each time the model weights are updated.

4.7.2: Initializing Networks and Replay Buffer

main_network = DQN(state_dim, action_dim)
target_network = DQN(state_dim, action_dim)
target_network.load_state_dict(main_network.state_dict())
target_network.eval()

replay_buffer = ReplayBuffer(10000)

main_network = DQN(state_dim, action_dim) initializes the main network with the specified state and action dimensions.

target_network = DQN(state_dim, action_dim) initializes the target network with the same architecture as the main network.

target_network.load_state_dict(main_network.state_dict()) copies the weights from the main network to the target network.

target_network.eval() sets the target network to evaluation mode. This ensures that certain layers (like dropout and batch normalization) behave appropriately during inference.

replay_buffer = ReplayBuffer(10000) initializes the replay buffer with a capacity to store 10,000 experiences.

4.7.3: Setting Step Count

STEP_COUNT = 0

STEP_COUNT = 0 initializes a counter to keep track of the number of steps taken during training.

4.7.4: Optimizer Initialization and Hyperparameter Optimization

optimizer = Optimizer(env, main_network, target_network, replay_buffer, f'{os.path.dirname(__file__)}/model/model.pth', f'{os.path.dirname(__file__)}/model/params.pkl')
best_params = optimizer.optimize(n_trials=2, save_params=True)

optimizer = Optimizer(...) initializes the Optimizer class with the environment, networks, replay buffer, model path, and parameters path.

best_params = optimizer.optimize(n_trials=2, save_params=True) runs the optimization process to find the best hyperparameters. This function:

  • Runs the optimization for a specified number of trials (n_trials=2).
  • Saves the best hyperparameters to disk if save_params is True.

4.7.5: Creating the PyTorch Optimizer and DQN Trainer

optimizer = optim.Adam(main_network.parameters(), lr=best_params['lr'])
trainer = DQNTrainer(env, main_network, target_network, optimizer, replay_buffer, f'{os.path.dirname(__file__)}/model/model.pth', gamma=best_params['gamma'], batch_size=best_params['batch_size'], target_update_frequency=best_params['target_update_frequency'])
trainer.train(1000)

optimizer = optim.Adam(main_network.parameters(), lr=best_params['lr']) creates an Adam optimizer with the learning rate from the best hyperparameters.

trainer = DQNTrainer(...) initializes the DQNTrainer class with the environment, networks, optimizer, replay buffer, model path, and hyperparameters from the best parameters.

trainer.train(1000) trains the model for 1000 episodes.

Now let’s take a look at the agent in its first 10 episodes of training:

Agent in its first 10 training episodes (2x speed) - Animation by Author
Agent in its first 10 training episodes (2x speed) – Animation by Author

Here, the model is clumsy, making random and often suboptimal decisions. This is expected as the agent is still exploring the environment and learning the basics. It hasn’t yet developed a robust strategy for maximizing rewards. Over time, with more training episodes, the agent’s performance should improve significantly as it refines its policy and learns from its experiences.

Now let’s look at 10 training episodes after the model was trained over 1000 times:

Agent after training over 100 episodes (2x speed) - Animation by Author
Agent after training over 100 episodes (2x speed) – Animation by Author

This is a noticeable improvement. While the model might not be ready for NASA just yet, we can observe several key advancements:

  • The agent makes more deliberate and strategic decisions.
  • It navigates the environment more efficiently.
  • The frequency of suboptimal actions has decreased significantly.

With continued training and fine-tuning, the agent’s performance will likely improve even further, bringing it closer to optimal behavior.

Now it’s your turn to enhance the model. Take this code and make it your own. Try tuning the hyperparameters, experiment with different model architectures, and see how far you can push it. With some creativity and persistence, you’ll have that shuttle landing smoothly in no time!

5: Conclusion

Now that you have a solid understanding of how to build, train, and evaluate a Deep Q-network, I encourage you to experiment with different environments. Try testing this DQN on various environments and observe how it adapts to different challenges.

Implement advanced techniques and explore new architectures to improve your agent’s performance. For example, you could try to set different hyperparameters, use a different optimization algorithm (like SGD or Nadam), use a different fine-tuning algorithm, and so on!

References

  1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
  2. Lin, L. J. (1992). "Self-improving reactive agents based on reinforcement learning, planning and teaching." Machine Learning, 8(3–4), 293–321.
  3. OpenAI. "LunarLander-v2." OpenAI Gym. https://gym.openai.com/envs/LunarLander-v2/
  4. Berkeley AI Research (BAIR). "Experience Replay." https://bair.berkeley.edu/blog/2020/03/20/experiencereplay/
  5. Towards Data Science. "Reinforcement Learning 101: Q-Learning." https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292
  6. Towards Data Science. "The Math Behind Neural Networks." https://towardsdatascience.com/the-math-behind-neural-networks-3a18b7f8d8dc
  7. Towards Data Science. "The Math Behind Adam Optimizer." https://towardsdatascience.com/the-math-behind-adam-optimizer-3a18b7f8d8dc
  8. Towards Data Science. "The Math Behind Fine-Tuning Deep Neural Networks." https://towardsdatascience.com/the-math-behind-fine-tuning-deep-neural-networks-3a18b7f8d8dc

Congratulations on making it to the end! I hope you found this article informative and enjoyable. If you did, please consider leaving a clap and following me for more articles like this one. I’d love to hear your thoughts on the article and any topics you’d like to see covered in the future. Your feedback and support are greatly appreciated. Thank you for reading!


Related Articles