Deep Reinforcement Learning hands-on for Optimized Ad Placement

NandaKishore Joshi
Towards Data Science
8 min readOct 28, 2021

--

Deep Reinforcement technique is used in this article to optimize Ad placement on a website to maximize the probability of user clicks and increase digital marketing revenue. A detailed case study with code is presented to help users implement the solution on any real world example.

Affiliate marketing and Pay-Per-Click are two important streams of Digital marketing. An optimized implementation of these techniques can greatly increase the Product/Service sales for a company and also bring in great revenue for the marketer . With the progress in Deep Reinforcement Learning, Digital marketing is one of the fields which can greatly benefit.

Traditional methods to finetune digital marketing campaigns require good amount of historical data. This costs both time and resources. With Reinforcement Learning, time and resources can be saved as they do not need any historical data or prior information of the campaign. In this article we can see how a simple Deep RL technique can optimize a fairly complex digital marketing campaign and achieve almost perfect results.

In This article lets see how Reinforcement Learning can help us to manage Ad placement to obtain maximum benefit by going through a near to real case study.

Problem Statement

We manage 10 e-commerce websites, each focusing on selling a different category of items like computer, jewelry, chocolates etc. Our aim is to increase the sale of the products by referring customers who shop on one of our sites to another site that they might be interested in. We display the advertisement of another site when a customer checks out one of our website hoping that they will buy the other product also. Our problem is that we don’t know which site should the customer be referred to or we don’t have any information of the customer preferences.

Let’s bring Reinforcement Learning to solve the problem !!

Illustration of basic working of RL
Fig 1 : Illustration of basic concept behind RL

In general, Reinforcement learning is a technique where we train a agent to operate on an environment. The agent takes an action ‘a’ at state ‘s’ and receives a reward ‘r’ for the action from the environment. So (s,a,r) become a state-action-reward tuple . Objective of our training is to maximize the total reward obtained by the agent. Hence we find the (s,a,r) tuple which has the maximum reward for a given state and action. To find the optimized tuple we run numerous episodes and recompute reward each time.

In this Ad placement problem, we need to test out different actions and automatically learn the most rewarding outcome for a given situation or state or context. So we call this a Contextual bandit framework where a state becomes a contextual information and agent finds the best action for the current context.

Lets say that we have 10 websites to manage which constitutes to 10 different states and customer is on one of the site. As we have 10 different product categories, we can display any of those 10 products for the customer. So we have 10 different actions in each state. This would result in 100 different state-action-reward tuples. We need to store 100 different data points and recompute it every time we have new reward. This might seem reasonable in this example. But what if we have 1000 websites to manage which would result in 1000000 data points. Storing and recomputing this would take significant time and resource.

That means Reinforcement Learning fails when having large State and Action space (Total number of States and Actions are large)???

That’s where Deep Reinforcement Learning come into picture. Instead of storing every State, Action and Reward tuple we use neural network to abstract the reward values for each state and action. Neural networks are great at learning abstracts . They learn the patterns and regularities in data and can compress huge information into their memory as weights. Hence, neural networks can learn complex relationships between state — action and reward.

Neural networks act as agents which learn from the environment to maximize the rewards. In this article we will build a neural network using PyTorch, and train it to optimize the Ad placement to get the maximum rewards.

Lets start with coding !!

Let’s first create a simulated environment for the contextual bandit . This environment should include 10 states which represent 10 websites (0 to 9 ) and method to generate reward on ad click and a method to choose an actions (which of 10 ads to be shown)

class ContextBandit:
def __init__(self, arms=10):
self.arms = arms
self.init_distribution(arms)
self.update_state()

def init_distribution(self, arms): #1
self.bandit_matrix = np.random.rand(arms,arms)

def reward(self, prob):
reward = 0
for i in range(self.arms):
if random.random() < prob:
reward += 1
return reward

def get_state(self):
return self.state

def update_state(self):
self.state = np.random.randint(0,self.arms)

def get_reward(self,arm):
return self.reward(self.bandit_matrix[self.get_state()][arm])

def choose_arm(self, arm): #2
reward = self.get_reward(arm)
self.update_state()
return reward

1 Its a matrix to represent each state. Rows represent states and column arms (actions)

2 Choosing an arm (action ) returns a reward and updates the state

Below code shows how to use the environment

env = ContextBandit(arms=10)
state = env.get_state()
reward = env.choose_arm(1)
print(state)
>>> 1
print(reward)
>>> 7

The environment consists of a class called ContextBandit that can be initialized initialized by number of arms (actions). In this example we have taken number of states equal to number of actions. But this might vary in real life. The class has a function get_state() which when called returns a random state from uniform distribution. State can come from much more complex or business related distribution in real life examples. Calling choose_arm() with any action (arm) as input will simulate placing the ad. This method returns a reward for the action and also updates the current state with the new state. We need to always call get_state() and then choose_arm()to continuously get new data.

ContextualBandit also has few helper functions like one-hot encoder and softmax. one-hot encoder function returns a vector with one 1 and all 0 where 1 represent the current state. Softmax function is used to set the reward distribution of various actions on each state. We will have n different softmax reward distribution over actions for each of n state. Hence, we need to learn the relationship between the states and their action distribution and select the action which has the highest probability for a given state. Code for both these functions are mentioned below

def one_hot(N, pos, val=1):  #N- number of actions , pos-state
one_hot_vec = np.zeros(N)
one_hot_vec[pos] = val
return one_hot_vec
def softmax(av, tau=1.12):
softm = np.exp(av / tau) / np.sum( np.exp(av / tau) )
return softm

Now lets create a two layered feed forward Neural Network with ReLU activation which will act as Agent. The first layer will accept 10 element one-hot encoded vector (state vector) nd final layer will output 10 element vector which represents the reward for each action.

Fig 2: Computational Graph

Form the Fig 2 we can see that get_state() function returns a random state value which is converted to 10 element vector using one-hot encoder. The vector is fed as input to the neural network. The output of the neural network is the 10 element vector representing the predicted reward of each action for the given input state. Output is a dense vector which is further converted to probabilities using softmax function. Based on the probabilities the sample action is chosen. Once the action is chosen , choose_arm() gets the reward and also updates with the new state from the environment.

Initially neural network will produce a output vector something like [1.4, 50, 4.3, 0.31, 0.43, 11, 121, 90, 8.9, 1.1] for state 0. After running softmax and sampling the action, most likely action 6 will be chosen (highest predicted reward). Choosing action 6 will produce a reward say 8 after running choose_arm(). We train neural network to update vector with[1.4, 50, 4.3, 0.31, 0.43, 11, 8, 90, 8.9, 1.1] since 8 is the actual reward. Now next time neural network will predict a reward close to 8 for action 6 whenever it sees state 0. As we train our model continually over many state and action , neural network will learn to predict more accurate rewards for various state-action pairs

Below is the code create a neural network and initiate the environment

arms = 10
N, D_in, H, D_out = 1, arms, 100, arms
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
torch.nn.ReLU(),
)
loss_fn = torch.nn.MSELoss()
env = ContextBandit(arms)

Now lets see how we train the agent and follow all the steps as explained in the fig 2

def train(env, epochs=5000, learning_rate=1e-2):
cur_state = torch.Tensor(one_hot(arms,env.get_state())) #1
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
rewards = []
for i in range(epochs):
y_pred = model(cur_state) #2
av_softmax = softmax(y_pred.data.numpy(), tau=2.0) #3
av_softmax /= av_softmax.sum() #4
choice = np.random.choice(arms, p=av_softmax) #5
cur_reward = env.choose_arm(choice) #6
one_hot_reward = y_pred.data.numpy().copy() #7
one_hot_reward[choice] = cur_reward #8
reward = torch.Tensor(one_hot_reward)
rewards.append(cur_reward)
loss = loss_fn(y_pred, reward)
optimizer.zero_grad()
loss.backward()
optimizer.step()
cur_state = torch.Tensor(one_hot(arms,env.get_state())) #9
return np.array(rewards)
  • 1 Gets current state of the environment; converts to PyTorch variable
  • 2 Runs neural net forward to get reward predictions
  • 3 Converts reward predictions to probability distribution with softmax
  • 4 Normalizes distribution to make sure it sums to 1
  • 5 Chooses new action probabilistically
  • 6 Takes action, receives reward
  • 7 Converts PyTorch tensor data to Numpy array
  • 8 Updates one_hot_reward array to use as labeled training data
  • 9 Updates current environment state

After training the network for about 5000 epochs we can see the average reward improving something like below

Fig 3 : Average Reward after training

We can see that the average reward is up to 8 or above.

The entire project can be found in this GIT link.

This article is based on the book Deep Reinforcement Leaning In Action by Brandon Brown and Alexander Zai. Link for the book is here

Please find links to my other articles on Various Data Science topics below

Feedbacks are most welcome. You can connect with me on LinkedIn

--

--