The world’s leading publication for data science, AI, and ML professionals.

Dynamic Pricing with Reinforcement Learning from Scratch: Q-Learning

An introduction to Q-Learning with a practical Python example

Exploring prices to find the optimal action-state values to maximize profit. Image by author.
Exploring prices to find the optimal action-state values to maximize profit. Image by author.

Table of contents

  1. Introduction
  2. A primer on Reinforcement Learning 2.1 Key concepts 2.2 Q-function 2.3 Q-value 2.4 Q-Learning 2.5 The Bellman equation 2.6 Exploration vs. exploitation 2.7 Q-Table

  3. The Dynamic Pricing problem 3.1 Problem statement 3.2 Implementation

  4. Conclusions
  5. References

1. Introduction

In this post, we introduce the core concepts of Reinforcement Learning and dive into Q-Learning, an approach that empowers intelligent agents to learn optimal policies by making informed decisions based on rewards and experiences.

We also share a practical Python example built from the ground up. In particular, we train an agent to master the art of pricing, a crucial aspect of business, so that it can learn how to maximize profit.

Without further ado, let us begin our journey.

2. A primer on Reinforcement Learning

2.1 Key concepts

Reinforcement Learning (RL) is an area of Machine Learning where an agent learns to accomplish a task by trial and error.

In brief, the agent tries actions which are associated to a positive or negative feedback through a reward mechanism. The agent adjusts its behavior to maximize a reward, thus learning the best course of action to achieve the final goal.

Let us introduce the key concepts of RL through a practical example. Imagine a simplified arcade game, where a cat should navigate a maze to collect treasures – a glass of milk and a ball of yarn – while avoiding construction sites:

Image by author.
Image by author.
  1. The agent is the one choosing the course of actions. In the example, the agent is the player who controls the joystick deciding the next move of the cat.
  2. The environment is the context in which the agent is operating. In our case, a two-dimensional maze.
  3. An action a is a the minimum amount of steps to move from one state to another. In this game, the player has a finite set of possible actions to choose from: up, left, down and right.
  4. The state s represents the current situation of the player and the environment. It includes information such as the cat’s current and allowed positions, as well as the location of treasures and traps, and any other relevant feature to the game state (points, remaining lives, …).
  5. The reward r represents the feedback assigned to the result of taking an action. For example, the game may assign: • +5 points when reaching the ball of yarn, • +10 points for the glass of milk, • -1 points for an empty cell, • -10 points for a construction.

The described RL framework is depicted in the following figure:

RL framework. Image by author.
RL framework. Image by author.

Our goal is to learn a policy π, i.e., the set of rules that enables the agent to follow the course of action while maximizing the reward, thus achieving its target.

We can learn the optimal policy π* directly, or indirectly by learning the values (rewards) of action-state pairs, and using them to decide the best course of action. These two strategies are named policy-based and value-based, respectively. Let us now introduce Q-Learning, a popular value-based approach.

2.2 Q-function

We introduce the Q-function, denoted as Q(s,a), representing the expected cumulative reward an agent can achieve by taking action a in state s , while following the policy π:

Q-function. Image by author.
Q-function. Image by author.

In the equation:

  • πis the policy being followed by the agent.
  • s is the current state.
  • a is the action taken in that state.
  • r is the reward associated to the given action and state.
  • t represents the current iterate.
  • γ is the discount factor. It represents the agent’s preference for immediate rewards (exploitation) over delayed rewards (exploration).

2.3 Q-value

The Q-value refers to the numeric value assigned by the Q-function to a specific state-action pair. In our example, the Q-value provides the expected cumulative reward the player could obtain by moving the cat in a new position inside the maze through a specific action, starting from a certain state. In brief, it tells how "good" the player’s choice is.

2.4 Q-Learning

Given the concept of Q-value, the Q-Learning algorithm works as follows:

  1. Initialize Q-values arbitrarily, e.g. Q(s, a) = 0 ∀ s ∈ S, a ∈ A.
  2. For each episode:
  3. Initialize states
  4. For each step of episode:

    1. Choose action a, observe the reward r, obtain new state s'
    2. Update the Q-values using the Bellman equation 3. s ← s'
  5. Until s is terminal.

2.5 The Bellman equation

The Bellman equation allows an agent to express the value of a state-action pair in terms of cumulative expected reward. It is used to update the Q-values in the Q-Learning algorithm as follows:

Bellman equation. Image by author.
Bellman equation. Image by author.

In the previous expression:

  • The learning rate α (between 0 and 1) determines how much an agent updates the Q-values based on new experiences.
  • The discount rate γ (between 0 and 1) influences the agent’s preference for immediate rewards over future rewards. A high γ can promote exploitation, as the agent will tend to prefer known actions with immediate gains.

2.6 Exploration vs. exploitation

How does the agent choose the next action?

The agent may "explore" new actions, or "exploit" actions known to be associated to a higher reward.

To learn an effective policy, we should strike for a balance between exploration and exploitation during training. In our example, we can adopt a straightforward method by defining an exploration probability, i.e. a float between 0 and 1:

  • If a random number from the uniform distribution on (0, 1) is higher than the exploration probability, the agent will perform exploitation, preferring already known actions with a high reward.
  • If the number is smaller than the exploration probability, the agent will perform exploration, encouraging the experimentation of new actions.

This approach is known as epsilon-greedy algorithm (see Cheng et Al. 2023, Appendix C).

2.7 Q-Table

When the problem consists of a finite set of potential actions – such as up, left, down and up, it is possible to simply tabulate all the combinations of states and actions. This table, named Q-Table, is populated with Q-values during training, as the agent explores state and action pairs, and collects their associated reward. In our example:

Updating the Q-Table. Image by author.
Updating the Q-Table. Image by author.

3. The Dynamic Pricing problem

Given a product associated to a price and a demand, our goal is to train an intelligent agent that, using Reinforcement Learning, will adjust prices over time to maximize profit:

"Dynamic Pricing is related to price-fixing for perishable resources taking into account demand so that to maximize revenue or profit" (Fleischmann, Hall, Pyke, 2004).

3.1 Problem statement

  • We model a simplified environment with a discrete action space A, where the agent can increase, decrease or keep the price constant: A = {+1, -1, 0}.
  • The action (price manipulation) results in a new demand, and we create discrete demand levels as state S = {Low demand, Medium demand, High demand}.
  • To estimate the new demand (state s) from a price change (action a), we leverage the concept of price elasticity k. Price elasticity estimates the sensitivity between a change in price Δp and its resulting change in demand Δv, and we assume it to be known in our example:
Image by author.
Image by author.
  • The reward r corresponds to the profit stemmed from the application of a price p and its consequent demand v, considering the unitary costs c associated to the product:
Reward r is a function of the action (price p) and state (demand v). Image by author.
Reward r is a function of the action (price p) and state (demand v). Image by author.
  • We assign a negative reward when the new price increases or decreases too much compared to the initial price using an arbitrary threshold. In this way, we penalize strong price variations.

3.2 Implementation

The DynamicPricingQL class implements the following methods:

  • calculate_demand_level assigns a continuous volume to a discrete state value (low, medium or high demand).
  • calculate_demand uses an input price to estimate the volume through price elasticity.
  • fit trains the agent. We decide to interrupt an episode when the maximum number of steps has been reached, or the profit (reward) has reached a certain threshold.
  • get_q_table returns the Q-Table learned by the agent.
  • plot_rewards shows a chart of the rewards achieved during training.
  • predict uses the Q-values to predict the optimal price given a starting price and demand as input.
import numpy as np
from typing import Union
import plotly.express as px

class DynamicPricingQL:
    def __init__(self, 
                 initial_price: int = 1000, 
                 initial_demand: int = 1000000,
                 elasticity: float = -0.01,
                 cost_per_unit: int = 20, 
                 learning_rate: float = 0.1, 
                 discount_factor: float = 0.9, 
                 exploration_prob: float = 0.2, 
                 error_term: float = 0.2, 
                 random_walk_std: float = 0.5, 
                 target_reward_increase: float = 0.2) -> None:
        '''Class that implements a Dynamic Pricing agent using 
        Q-Learning to find the optimal price for a given product.

        Args:
            - initial_price: starting price of the product
            - initial_demand: starting volume of the product
            - elasticity: price elasticity of the product
            - cost_per_unit: unitary cost of the product
            - learning_rate: learning rate for the Bellman equation
            - exploration_prob: control the exploration-explotation trade-off
            - error_term: error term added to the reward estimate to account for fluctuations
            - random_walk_std: control the random walk fluctuations added to the demand estimate
            - target_reward_increase: end the training when the reward reaches this target increase 
        '''
        # Init variables
        self.learning_rate = learning_rate
        self.discount_factor = discount_factor
        self.exploration_prob = exploration_prob
        self.initial_price = initial_price
        self.cost_per_unit = cost_per_unit
        self.elasticity = elasticity
        self.error_term = error_term
        self.random_walk_std = random_walk_std
        self.initial_demand = initial_demand
        self.target_reward_increase = target_reward_increase
        self.current_price = initial_price
        self.current_demand = initial_demand

        # Estimate current demand level from the initial demand
        self.current_demand_level = self.calculate_demand_level(
            self.initial_demand)

        # Track whether the training procedure occurred or not
        self.isfit = False

        # The agent can only perform 3 actions: 
        #   - Increase the price
        #   - Decrease the price
        #   - Keep the price constant
        self.num_actions = 3

        # Consider 3 different states as discrete demand level
        self.num_demand_levels = 3

        # Initialize Q-values
        self.q_values = np.zeros((self.num_demand_levels, 
                                  self.num_actions))

        # Store rewards per episode for plotting
        self.episode_rewards = []

    def calculate_demand_level(self, 
                               demand: int, 
                               demand_fraction: float = 0.3) -> int:
        '''Estimate the demand level.
        Demand levels represent the states of the Q-Learning agent.
        In order to turn a continuous demand into a discrete set in three values,
        we use a fraction of the initial value to estimate a low, medium or high demand level.

        Args:
            - demand: current demand for the product
            - demand_fraction: fraction of demand controlling the assignment to the demand levels
        '''
        # Low demand level: 0
        if demand < (1 - demand_fraction) * self.initial_demand:
            return 0
        # High demand level: 2
        elif demand > (1 + demand_fraction) * self.initial_demand:
            return 2
        # Medium demand level: 1
        else:
            return 1

    def calculate_reward(self, 
                         new_price: int,
                         price_fraction: float = 0.2) -> float:
        '''Calculate the reward.
        The reward during an episode is the profit 
        under a certain price (action) and demand.
        We add an error term to account for fluctuations.

        Note: if the price is either too high or too low
        with respect to the initial price, we assign a negative reward.

        Args:
            - new_price: new price of the product
            - price_fraction: penalize price variations above or below this fraction
        '''

        # If the new price is more distant from the initial price
        # than a certain value given by price_fraction
        # then assign a negative reward to penalize high price changes 
        if new_price > self.initial_price * (1 + price_fraction)
            or new_price < self.initial_price * (1 - price_fraction):

            # Negative reward to penalize significant price changes
            return -1 

        else:

            # Estimate the demand given the new price
            demand = self.calculate_demand(new_price)

            # Etimate profit given new price and demand
            profit = (new_price - self.cost_per_unit) *
                      demand *
                      (1 - self.error_term)

            # Return profit as reward for the agent
            return profit

    def calculate_demand(self, 
                         price: int) -> int:
        '''Calculate demand as: 
              current demand + delta(demand) + random walk fluctuation = 
              current demand + elasticity * (price - current price) + random walk fluctuation

        Args:
            - price: price of the product
        '''
        return np.floor(self.current_demand + 
                self.elasticity * (price - self.current_price) +
                np.random.normal(0, self.random_walk_std))

    def fit(self,
            num_episodes: int = 1000,
            max_steps_per_episode: int = 100) -> None:
        '''Fit the agent for a num_episodes number of episodes.

        Args:
            - num_episodes: number of episodes
            - max_steps_per_episode: max number of steps for each episode
        '''
        # For each episode
        for episode in range(num_episodes):

            # The state is the current demand level 
            state = self.current_demand_level

            # To interrupt the training procedure
            done = False

            # The reward is zero at the beginning of the episode
            episode_reward = 0

            # Keep track of the training steps
            step = 0

            # Training loop
            while not done:

                # Depending on the exploration probability
                if np.random.rand() < self.exploration_prob:

                    # Explore a new price ...
                    action = np.random.randint(self.num_actions)

                else:

                    # ... or exploit prices known to increase the reward
                    action = np.argmax(self.q_values[state])

                # Set the new price given the action (increase, decrease or leave the price as is)
                new_price = self.current_price + action - 1

                # Calculate the new demand and demand level
                new_demand = self.calculate_demand(new_price)
                new_demand_level = self.calculate_demand_level(new_demand)

                # Estimate the reward (profit) under the current action
                reward = self.calculate_reward(new_price)

                # Save the reward
                episode_reward += reward

                # Bellman equation for the Q values
                self.q_values[state, action] = self.q_values[state, action] + 
                  self.learning_rate * 
                  (reward + self.discount_factor * np.max(self.q_values[new_demand_level]) -
                  self.q_values[state, action])

                # Update price and demand for the next iteration
                self.current_price = new_price
                self.current_demand = new_demand
                self.current_demand_level = new_demand_level

                # Update the step counter
                step += 1

                # Exit the loop if the max number of steps was reached
                # or if the reward increased more than a certain threshold
                if step >= max_steps_per_episode or episode_reward >= self.target_reward_increase:
                    done = True

            # Save the training results for plotting
            self.episode_rewards.append(episode_reward)

        # Acknowledge the accomplishment of the training procedure
        self.isfit = True

        print("Training completed.")

    def get_q_table(self) -> np.ndarray:
        '''Return the Q table'''
        return self.q_values

    def plot_rewards(self, width=1200, height=800) -> None:
        '''Plot the cumulative rewards per episode using Plotly.

        Args:
            - width: width of the plot
            - height: height of the plot
        '''

        # Plot rewards per episode
        fig = px.line(
            self.episode_rewards, 
            title = "Rewards per episode <br><sup>Profit</sup>",
            labels = dict(index="Episodes", value="Rewards"),
            template = "plotly_dark",
            width = width, 
            height = height)

        # Style colors, font family and size 
        fig.update_xaxes(
            title_font = dict(size=32, family="Arial"))
        fig.update_yaxes(
            title_font = dict(size=32, family="Arial"))
        fig.update_layout(
            showlegend = False,
            title = dict(font=dict(size=30)),
            title_font_color = "yellow")
        fig.update_traces(
            line_color = "cyan", 
            line_width = 5)

        # Show the plot
        fig.show()

    def predict(self, 
                input_price: int, 
                input_demand: int) -> Union[int, str]:
        '''Predict the next price given an input price and demand.

        Args:
            - input_price: input price of the product
            - input_demand: input demand of the product
        '''
        # If the model was fit
        if self.isfit:

            # State equals the current demand level 
            state = self.calculate_demand_level(input_demand)

            # Identify the most profitable action from the Q values
            action = np.argmax(self.q_values[state])

            # The next price is given by the most profitable action
            prediction = input_price + action - 1

            # Return the predicted price
            return prediction

        # If the model was not fit
        else:
            return "Fit the model before asking a prediction for the next price."

Let us instantiate and fit the agent:

# For reproducibility
np.random.seed(62)

# Instantiate the agent class
pricing_agent = DynamicPricingQL(
    initial_price = 1000,
    initial_demand = 1000000,
    elasticity = -0.02,
    cost_per_unit = 20)

# Fit the agent
pricing_agent.fit(num_episodes=1000)
Training completed.

It is possible to get the Q-Table after training:

pricing_agent.get_q_table()
array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [7.92000766e+09, 8.01708509e+09, 7.98798684e+09],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

We can also plot the rewards:

pricing_agent.plot_rewards()
Output of the code snippet. Image by author.
Output of the code snippet. Image by author.

We observe how the rewards increase during the training procedure, as the agent learns, through Q-values, the pricing policy leading to a profit increase.

We can use the Q-values to predict the next price through the trained agent:

input_price = 500
input_demand = 10000

next_price = pricing_agent.predict(input_price, input_demand)
print(f"Next Price: {next_price}")
Next Price: 499

4. Conclusions

In this post, we explored the key concepts of Reinforcement Learning and introduced the Q-Leaning method for training a smart agent. We also provided a hands-on Python example built from scratch. In particular, we implemented a dynamic pricing agent that learns the optimal pricing policy for a product in order to maximize profit.

Our example is simplified. We aim at sharing a functional, comprehensive illustration from the ground up. For a real-world application, we should consider the following:

  1. Q-learning requires a discrete action space, which means continuous actions must be discretized into a finite set of values. Therefore, we converted price manipulation into a discrete set of actions A = {+1, -1, 0}. In reality, pricing decisions may be more complex and continuous.
  2. States should capture relevant information about the environment that helps the agent make decisions. Although discrete demand levels provide a simple and intuitive state representation, our choice may prove limiting in a real-world application. Instead, the state should embed any relevant feature to the environment (business scenario). For example, in a study on dynamic pricing for e-commerce platforms, Liu et al. (2021) proposed a state representation made of four features categories:

    • price features
    • sales features
    • customer traffic features
    • competitiveness features.

5. References


Related Articles