Do you want to train a simplified self-driving car with Reinforcement Learning?

Just try our new LongiControl Environment

Roman Liessner

Published in

Towards Data Science

11 min readJul 17, 2020

Here I present research with Jan Dohmen and Christoph Friebel.

Motivation

Recent years have seen a surge of applicative successes using Reinforcement Learning (RL) [1] to solve challenging games and smaller domain problems [2][3][4]. These successes in RL have been achieved in part due to the strong collaborative effort by the RL community to work on common, open-sourced environment simulators such as OpenAI’s Gym [5] that allow for expedited development and valid comparisons between different, state-of-art strategies.

However many existing environments contain games rather than real-world problems. Only recent publications initiate the transition to application-oriented RL [6][7]. In this contribution, we aim to bridge real-world motivated RL with easy accessibility inside a highly relevant problem: the longitudinal control of an autonomous vehicle. Autonomous driving is the future, but until autonomous vehicles find their way in the stochastic real world independently, there are still numerous problems to solve.

Reinforcement Learning

Before we take a closer look at the Longicontrol Environment, we will briefly describe the basics of the RL below. If you are familiar with RL, feel free to jump to the section Longitudinal Control.

Reinforcement Learning (RL) is a direct approach to learn from interactions with an environment in order to achieve a defined goal. In this context, the learner and decision maker is referred to as the agent, whereas the part it is interacting with is called environment. The interaction performs in a continuous form so that the agent selects actions at each time step t, the environment responds to them and presents new situations to the agent in form of a state Sₜ₊₁. Responding to the agent’s feedback, the environment returns rewards Rₜ₊₁ in the form of a numerical scalar value. The agent seeks to maximize rewards over time [1].

Fig. 1: Reinforcement Learning interaction [1]

Having introduced the idea of the RL, a brief explanation of certain terms follows. For a detailed introduction, please refer to [1].

Policy. The policy is what characterizes the agents behaviour. More formally the policy is a mapping from states to actions.

Goals and Rewards. In Reinforcement Learning, the agent’s goal is formalized in the form of a special signal called a reward, which is transferred from the environment to the agent at each time step. Basically, the target of the agent is to maximize the total amount of scalar rewards it receives, resulting in maximizing not the immediate reward, but the cumulative reward in the long run, which is also called return.

Exploration vs. Exploitation. A major challenge in reinforcement learning is the balance of exploration and exploitation. In order to receive high rewards, the agent has to choose actions that have proven to be particularly rewarding in the past. In order to discover such actions in the first place, new actions have to be tested. This means the agent has to exploit knowledge already learned to get a reward, and at the same time explore other actions to have a better strategy in the future [1].

Q-Learning. Many popular reinforcement learning algorithms are based on the direct learning of Q-values. One of the simplest is the Q-Learning. The update rules are as follows:

The Q corresponds to the expected future return, when choosing an action a in state s and following the policy from thereon. The reward of an interaction is denoted as r. The adaption the Q-function is controlled by the learning rate and the discount factor . The policy is implied in the Q-Value:

Deep Deterministic Policy Gradient (DDPG). Finding the optimal action requires an efficient evaluation of the Q-function. While this is simple for discrete and small action spaces (all actions are calculated and those with the highest value selected), the problem becomes unsolvable if the action space is continuous. However, in many applications, such as robotics and energy management, discretization are not desirable, as they have a negative impact on the quality of the solution and at the same time require large amounts of memory and computing power in the case of a fine discretization. Lillicrap et al. [8] presented an algorithm called DDPG, which is able to solve continuous problems with Deep Reinforcement Learning. In contrast to the Q-Learning, an actor-critic architecture is used. A detailed description can be found in [8].

Longitudinal Control

In the longitudinal control domain it is the aim that a vehicle completes a single-lane route in a given time as energy-efficient as possible without causing accidents. In summary, this corresponds to the minimization of the total energy E used in the interval from t₀ to T as a function of the power P:

According to external requirements, such as other road users or speed limits, the following boundary conditions must be met at the same time:

Where v is the velocity, a is the acceleration and a_dot is the jerk, with ()ₗᵢₘ,ₘᵢₙ and ()ₗᵢₘ,ₘₐₓ representing the lower and upper limits respectively.

Planning approaches

At this point, the question could arise how such problems are usually solved. One possibility is planning approaches. For these, it is assumed that the route is fully known, that there are no other road users and that it is also known how the driver will use the auxiliary consumers. An exemplary solution is shown in the picture below. The dynamic programming calculates the most efficient velocity trajectory between two speed limits for a known route.

Fig. 2: Exemplary solution of a given route using dynamic programming [9]

You are probably wondering which approach to take if the route is not known with certainty and especially if there are other road users in the area ahead of you. On the one hand, speed limits are mostly known, on the other hand, the behaviour of other road users is very stochastic. This can usually not be foreseen. Thus a different approach is necessary. As Reinforcement Learning is capable of solving stochastic problems, this is a promising approach to this issue. And since it would be too dangerous and inefficient to train an autonomous vehicle directly in real traffic, the simulation offers a solution. There algorithms can be safely developed and tested. Hence we would be at our new RL environment which will be presented in the following.

Longicontrol Environment

As we have seen before, the RL setup consists of two parts: The agent and the environment. We will take a closer look at the environment. It is independent from the agent. This means that you can test any algorithms for the agent.

The environment consists of two parts. The vehicle and the driving environment. First, the focus should be on the vehicle.

The movement of the vehicle is simplified uniformly accelerated modelled. The simulation is based on a time discretization of t = 0.1 s. The current velocity vₜ and position xₜ are calculated as follows:

The acceleration of the vehicle results from the current vehicle state and the engine power chosen by the agent, which is therefore the action of the environment. Since only the longitudinal control is considered, the track can be modelled single-laned. Therefore, onedimensional velocities vₜ and positions xₜ are sufficient at this point.

For the determination of the energy consumption of the vehicle we created a Black Box Model from a real electric vehicle. The current speed and acceleration are the input variables and the energy consumption is the output variable. After getting to know the vehicle model, the next topic is the landscape in which the vehicle is driving.

Fig. 3 shows an example of the track implementation within the simulation. The driving environment is modelled in such a way that the distance is arbitrarily long and that arbitrarily positioned speed limits specify an arbitrary permissible velocity. This can be considered equivalent to a stochastically modelled traffic. Up to 150 m in advance, the vehicle driver receives information about the upcoming speed limits, so that a forward-looking driving is basically possible. The result is an environment for a continuous control problem. The individual components of the state are listed below.

State s

The state consists of five parts:

Velocity
Previous acceleration
Current speed limit
Future speed limit
Future speed limit distance

The current speed and speed limits are intuitive. Keeping the last acceleration in the state is probably not immediately obvious to everyone. This is needed to calculate the jerk. The size that describes how smoothly the vehicle is accelerated.

Action a

The agent selects an action in the value range [-1,1]. The agent can thus choose between a state dependent maximum and minimum acceleration of the vehicle. This type of modeling results in the agent only being able to select valid actions.

Reward r

A reward function defines the feedback the agent receives for each action and is the only way to control the agent’s behavior. It is one of the most important and challenging components of an RL environment. This is particularly challenging in the environment presented here, because it cannot simply be represented by a scalar number. If only the energy consumption were rewarded (punished), the vehicle would simply stand still. The agent would learn that from the point of view of energy consumption it is most efficient simply not to drive. Although this is true and we should all use our bicycles more often, we still want the agent to drive in our environment. So we need a reward that makes driving more appealing to the agent. By comparing different approaches, the difference between the current speed and the current speed limit has proven to be particularly suitable. By minimizing the difference, the agent automatically sets itself in motion. In order to still take energy consumption into account, the reward component is maintained with energy consumption. A third Reward component is caused by the jerk. This is because our autonomous vehicle should also be able to drive comfortably. To punish finally also the violation of the speed limits a fourth Reward part is supplemented.

Since RL is designed for a scalar Reward, it is necessary to weight these four parts. A suitable weighting is not trivial and poses a great challenge. To make it easier for you to get started, we have preconfigured a proper weighting. In the further section of this article we will show you some examples of the effects of different weightings. However, you are welcome to explore a better weighting on your own.

In order to be able to evaluate the effect of different weightings at all, a functioning RL learning process is necessary. This should be considered next.

Exemplary Learning Process

Fig. 4: Beginning of the learning process

In the first example we see the actions of the agent at the beginning of the training. And yes you see correctly, the agent does not move. So we let him train a bit:

After some progress he starts to drive, but ignores the speed limits. This is not desirable. Therefore we let him train even longer:

Fig 6: After a longer training procedure

By letting the agent train even longer he also starts to respect the speed limits. The curve is not yet perfect. If you feel like it, just try out what it could look like after an even longer learning process.

It can be summed up. The agent learns to drive comfortably and even observes the speed limits. It is remarkable that it is able to use the speed limits in the future correctly. This was not explicitly programmed, but he learned it on his own. Impressive.

Multi-Objective Optimization

As mentioned before, this problem has several interdependent objectives. Thus also multi-objective investigations can be carried out. For a better understanding we have three examples for you.

Reward Example 1. If only the movement reward — the deviation from the allowed speed — is applied: the agent violates the speed limits.

Reward Example 2. In the second example, the penalty for exceeding the speed limit is added. This results in the agent actually maintaining the limits.

Reward Example 3. In the third example we add the energy and jerk Reward. This results in the agent driving more energy-efficiently and also choosing smoother accelerations.

This is just one example for understanding. So the environment provides a great basis to investigate multi objective algorithms.

Summary

At the end of this article we will summarize the most important points.

Through the proposed RL environment, which is adapted to the OpenAI Gym standardization, we show that it is easy to prototype and implement state-of-art RL algorithms. Besides, the LongiControl environment is suitable for various examinations. In addition to the comparison of RL algorithms and the evaluation of safety algorithms, investigations in the area of Multi-Objective Reinforcement Learning are also possible. Further possible research objectives are the comparison with planning algorithms for known routes, investigation of the influence of model uncertainties and the consideration of very long-term objectives like arriving at a specific time.

LongiControl is designed to enable the community to leverage the latest strategies of Reinforcement Learning to address a real-world and high-impact problem in the field of autonomous driving.

Here the GitHub Link, the Paper and my LinkedIn profile

Enjoy using it😉

References

[1] R. Sutton and A. Barto, Introduction to Reinforcement Learning (1988), MIT Press

[2] V. Mnih and K. Kavukcuoglu and D. Silver and A. Graves and I. Antonoglou and D. Wierstra and M. A. Riedmiller, Playing Atari with Deep Reinforcement Learning (2013), CoRR

[3] D. Silver and J. Schrittwieser and K. Simonyan K and I. Antonoglou and A. Huang and A. Guez and T. Hubert and L. Baker and M. Lai and A. Bolton, et al., Mastering the game of go without human knowledge (2017), Nature

[4] R. Liessner and C. Schroer and A. Dietermann and B. Bäker, Deep Reinforcement Learning for Advanced Energy Management of Hybrid Electric Vehicles (2018), ICAART

[5] G. Brockman and V. Cheung and L. Pettersson and J. Schneider and J. Schulman and J. Tang and W. Zaremba, OpenAI Gym. CoRR, 2016.

[6] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, et al., Learning Dexterous In-Hand Manipulation. CoRR, 2018.

[7] F. Richter, R. K. Orosco, M. C. Yip, Open-Sourced Reinforcement Learning Environments for Surgical Robotics, CoRR, 2019.

[8] T. P. Lillicrap et al., Continuous control with deep reinforcement learning (2015), CoRR

[9] S. Uebel, Eine im Hybridfahrzeug einsetzbare Energiemanagementstrategie mit effizienter Längsführung (2018), TU Dresden