Photo by Evgeny Tchebotarev on Unsplash

Training Autonomous Vehicles using Augmented Random Search in Carla

ARS learning with camera data from Carla Driving Simulator

Nate Cibik
Towards Data Science
23 min readDec 27, 2020

--

Carla is an open source driving simulator with a Python API used for autonomous driving research. Built on Unreal Engine 4, it employs high-end graphics to provide a suitable representation of the real world conducive for reinforcement learning with sensor/camera data. Recently, I put this platform to work, testing the ability of the Augmented Random Search (ARS) algorithm to train a self-driving car policy on the data gathered from a single front-facing camera per car. ARS is an exciting new algorithm for reinforcement learning (RL) which has been shown to achieve competitive results on benchmark MuJoCo continuous control locomotion tasks compared to more complex model-free methods, while offering at least 15x more computational efficiency.¹ This significant reduction in computational resource requirements makes it an attractive algorithm for small-scale autonomous vehicle research, and so it was chosen for this study. Code for a usable car environment for this task was derived from a Sentdex tutorial on using Carla for Deep Q-Learning and modified to fit the context of ARS learning. Then, code provided by the authors of the ARS study was altered to make use of this car environment to test the efficacy of this efficient learning algorithm on training autonomous vehicles using camera data from Carla. This article reports the results of the first attempt at training an ARS agent in Carla using this framework. Although an effective policy was not achieved after the first round of training, many insights about how to improve these results in the future were obtained, which are discussed in detail in the conclusions. All of the code used in this study, as well as Jupyter notebooks reviewing the project research and how to run the code can be found in the project repository. Unless otherwise noted, all images and media are property of the author.

Background

The building of self-driving car policies is approached from a myriad of directions. As Lex Fridman summarizes in his 2020 Deep Learning State of Art lecture, the leading research by Tesla and Waymo can be classified in two schools of thought: learning-based and map-based, respectively. For Waymo, the issue of sensory perception functions as a supplemental tool for safely making use of the underlying navigation system, whereas the learning-based systems of Tesla Autopilot are constantly using sensory data to improve their policy predictions by building edge associations. This study is most akin to an exploration of the rudiments of the Tesla learning-based method, since GPS and Lidar systems were not used, and instead the only data used to train the policy was coming from a camera sensor.

When it comes to learning-based automation, Deep Q-Learning is one method studied. Sentdex provides a tutorial series on training a self-driving car policy with Deep Q Learning and Carla. Deep Q Learning has pitfalls, especially its computational complexity. In order to apply the algorithm to this task, Sentdex was training a neural network on every frame received from the camera, and had two neural networks operating in tandem. For someone without advanced computer hardware or without paid access to cloud development, this method is unapproachable. Carla alone presents a computational challenge even for decent hardware, and this can be prohibitive for journeymen researchers to use computationally complex learning algorithms with the simulator. Further, Q-Learning in general is limited by the fact that the predicted Q values it generates correspond to a discrete action space. When one is considering multiple continuous valued controls like throttle, brake, and steering, partitioning these control values into discrete actions with increasing resolution creates an exploding size for the action space. Creating a policy which can generate continuous control values seems more appropriate for tasks such as driving.

A new option on the menu of RL training algorithms was proposed in a 2018 paper by Mania, Guy, and Recht, and may have not yet received the full attention that it deserved. Augmented Random Search (ARS) is a method which trains a single-layer perceptron on input data by iteratively adding and subtracting sets of random noise to the weights, recording the total rewards produced by these modifications across separate episodes, then performing weighted modifications to the weights based on those rewards over a given number of episodes, scaled by a predefined learning rate. The authors of the study used benchmark MuJoCo tasks to make the case for their proposed algorithm, showing that is was capable of achieving competitive to superior results compared to the top competing model-free methods on continuous control tasks with far less computational cost. Since it uses just a single layer perceptron, there is only one layer of weights to train, and since adjustments are made randomly, there is no need to compute loss function gradients on each step. This makes ARS a very lightweight method for learning complex control tasks, and the authors of the study found that it offered at least 15x more computational efficiency than the fastest competing learning methods.²

The ARS Algorithm

To get a better understanding of how the ARS algorithm learns a task using random noise, we can break down the math piece by piece. Let’s start by looking at the formula from the whitepaper, marked up by yours truly in crayon for clarity. The following occurs at each update step:

Formula taken from Mania, Guy, & Recht (2018). Marked up by author.

To put this into words, a user-defined number of deltas (random noise arrays with the same shape as the weights) are generated for each update step, and each delta is applied on 2 separate episodes: one in which it is added to the weights (positive direction) and one in which it is subtracted from the weights (negative direction). The rewards produced from the positive and negative directions are stored with their corresponding delta. Then the deltas are organized in order of maximum rewards produced in either the positive or negative direction, and then a user-defined number of the top performing deltas (up to the total number of deltas) are used in the update step (# deltas used). Each of the used deltas is multiplied by the difference between the rewards of the positive direction and the rewards of the negative direction corresponding to that delta, standardized by dividing these rewards by the collective standard deviation of all positive and negative rewards recorded from the deltas used during this update step, which allows us to disregard the actual numerical scale of the reward system at hand. This effectively scales each delta by the magnitude of impact it had on performance, and reverses the sign when the negative direction has higher rewards than the positive direction. The result of each of these multiplication steps is then averaged by summing them together and dividing by the number of deltas used. The result is multiplied by the learning rate, then added to the weights, completing the update step.

Part of the genius of this algorithm is that it is not overly sensitive to the scale or sign of the reward system of the learning environment. Since the weights are adjusted based on differences between the rewards produced by positive and negative addition of each delta, it doesn’t matter whether the rewards themselves are positive or negative numbers. Further, the standard deviation across the rewards is included in the denominator of the equation, which normalizes the differences between these episode rewards to the scale that they are on, making the algorithm equally effective on reward systems using different scales.

The influence of each delta on the update step is scaled by the difference between the rewards that its positive and negative addition to the weights has. Therefore, if the delta has very little impact on the rewards in either direction, or a similar impact in both, it does not find it’s way into the weights during the update step. If the delta shows higher rewards with the negative addition to the weights, then the difference in positive minus negative episode rewards will be negative, and thus the sign of the delta will be reversed before it is incorporated into the update step. This effectively makes each delta into two possible contributions to the weights, since they can be applied in either a positive or negative direction to the weights in the update step.

Methodology

Upon reading the ARS paper, I was immediately compelled to consider what the applicability of this approach to reinforcement learning might be in the field of autonomous navigation. Sentdex had provided a framework to gather RGB camera sensor data from Carla for training a Deep Q Network (DQN), and I saw an opportunity to use his car environment design to test the ARS algorithm from the 2018 study on this same task. I discovered ARS as a result of a YouTube exploration into DQNs, when I came across an informative video series on reinforcement learning by Skowster the Geek (Colin Skow), in which he decided to include a short video about ARS because of its relevance to the topic. Captivated by the simplicity of the algorithm, I decided to look into it further, as I was beginning to realize that my hardware may not be capable of achieving meaningful results training a DQN in Carla, after seeing the lackluster results that Sentdex was able to achieve using far superior hardware over multiple days.

Skow provides a GitHub repository component to his video course on RL, where he provides coding examples for all of the material covered, including a framework to run ARS learning on environments found in the Python gym module. First, I tested this code on the BipedalWalker environment to witness its efficacy for myself. Below we can see curve of episode rewards over training step for this test.

We can see that for the first 600 training steps, the reward curve is basically flat, but that eventually the random deltas started to build the policy in all the right places, creating a steep curve of improving rewards that eventually levels out around the 800th episode, with some occasional mistakes along the way that also eventually disappear around this time. Below, we can see the resulting behavior of this trained policy on the BipedalWalker environment:

When I saw that the algorithm was able to achieve impressive results with computational ease over a reasonable amount of episodes, I decided to see if I could splice Sentdex’s Carla vehicle environment into this framework. In order to improve the input data that the algorithm was trained on, I transformed the raw RGB camera data into something which could represent generalized edge information by first passing the RGB camera frames through a pretrained Convolutional Neural Network (CNN) called VGG19 (available in the TensorFlow/Keras package) on their way into the ARS algorithm. For this study, the ‘imagenet’ weights of the VGG19 were used. This is an example of Transfer Learning, where we can take advantage of the bottom layers of a neural network which has been previously trained on massive amounts of image data in order to apply its generalized edge-detection to a different problem. Since we don’t need to train all of these layers, we can just use the output of their predictions as an input into our single-layer ARS perceptron.

The inputs of ARS need to be normalized, and typically this is done by taking running statistics of each of the input components, then using these statistics to do a mean/standard deviation filter (normalization) of the future inputs. This allows the algorithm to create appropriate distributions for normalizing inputs over time as it experiences more and more states, without needing prior knowledge of these distributions. For this study, since the VGG19 prediction outputs always occurred on a scale from 0 to 10, this method of normalizing inputs with running statistics was replaced with simply dividing the inputs by 10 to normalize them. Further research testing the standard filtration method in this context may be warranted, but it seems likely to cause an issue with treating unseen edge scenarios as extreme outliers when they occur after a considerable amount of states have been observed. The scale that the inputs are on will affect the scale that the weights are on as well, so adjustments to this part of the process will have effects on what learning rate and delta standard deviation are most appropriate to facilitate learning. This is one reason why normalization of the inputs is so important.

The Sentdex vehicle environment needed to be modified to use a continuous action space with continuous control values, and the reward system needed to be adjusted to work more effectively. When self-driving cars are penalized for collisions, they tend to learn how to drive in circles to avoid them, so close attention was paid to finding ways to punish extreme or consistent directional steering, and reward moving in straight lines at speed. This took some careful consideration. Since collisions were being penalized, the rewards for speed and straight lines needed to be sufficiently high to counterbalance the punishments for the abundance of collisions sure to be encountered when a car moves faster and turns less, so that these penalties would not be so high as to discourage the agent from reaching these goals. However, the collision penalty must be high enough if the hope is that the agent will eventually learn not hit things. It can logically be expected that the edge scenarios which need to be learned to avoid collisions while driving at speed will only be experienced by the agent when the workers encounter them as a result of moving about more liberally; in other words, the agent must first learn to move before it can find patterns to avoid while moving. All that being said, without results to compare different reward/punishment systems, this took a bit of guesswork.

The predictions for throttle, steering, and brake coming from the perceptron were clipped to fit into their appropriate control ranges for Carla (0 to +1, -1 to +1, and 0 to +1 respectively), but this meant that the predicted values themselves could fall outside of this range. For throttle and brake, this issue was ignored. However, for steering, which is generally better done with nuance, a consideration was made by penalizing each frame by the absolute value of the steering control value for that frame averaged with the absolute value of the average steering control value across all frames of the current episode, so that the model would learn to prefer keeping the steering control close to zero, since extreme values would add up to a very negative score for an episode. Sitting stationary was penalized to make sure that the cars would not learn to avoid penalty by not moving.

The RGB camera was adjusted to take in an image resolution of 224x224 pixels, since this is the image size that the VGG19 CNN was trained on. Below, we can see an example of an image of the Carla world seen through one of these cameras.

Worker worldview.

Now that we have an idea of what a worker takes in as input, we can consider this simplified visual depiction of the data flow through each worker from it’s way from the camera, through the policy, and to the controls:

After I successfully spliced the Sentdex vehicle environment into Skow’s ARS framework, I realized that the training would still be woefully time consuming because the ARS process has to perform a number of episodes (2x the number of deltas being tested for each update step) before it can make each adjustment to the policy, and since in the context of training a Carla vehicle agent each one of those iterations would be some number of (in this case 15) seconds long, it would be best to try and perform these iterations in parallel, and pool results from multiple workers in each update step. Luckily, the authors of the ARS paper also provide a GitHub repository of their own, in which they include a framework to reproduce their research of training MuJoCo agents with ARS in parallel using a Ray server.

This introduced me to the Ray module, and like some other advanced packages, I struggled a bit getting it to work on my machine. Ray uses a program called Redis to perform cluster computing on local or network scale. Becoming familiar with it will allow one to scale their research once they achieve results which can justify the expense of employing more computational resources. Fortunately, Ray will pull all the necessary levers in Redis for you, but only when you use it properly. As a Windows 10 user, I found that launching the Ray server in a non-administrator Powershell terminal window (a terminal being run as administrator will NOT work), then running the ars.py file contained in the code file of the ARS repository (as demonstrated in the README.md file there) in a SEPARATE non-administrator Powershell window did the trick. Each user will need to set parameters for the Ray server and ars.py file execution that suits the number of CPUs, GPUs, and RAM available on their machine. Detailed instructions on running this process with the Carla ARS agent can be found in the project repository.

Once I was able to train the same BipedalWalker environment using the code in the ARS repository, I knew it was time to set about splicing in the Sentdex vehicle environment into this code. This took some doing, as the code in the ARS repo is reasonably complex, and it was engineered to work specifically with gym environments. It was also not designed to take in a previously saved policy to pick up training where it was left off, which I wanted to make sure was an option when working with a task that was sure to take days worth of training, which one may wish to periodically interrupt, or recover in the event of an error. This gave me a great opportunity to dig into the nuts and bolts of coding with Ray, and I look forward to training agents on large clusters in the future with that knowledge. On my gaming laptop, I was able to get a local cluster running which could handle 4 car workers running in the Carla server at once, which was much better than just running one at a time. The resulting code which trains these Carla vehicle environments using ARS in parallel can be found in the ARS_Carla folder in the project repository. It is a Frankenstein-esque combination of Sentdex’s CarEnv for Carla, the code from the ARS repository, and my own modifications/augmentations needed for this task. In the next few paragraphs, I will quickly summarize some of my modifications.

As discussed above, the ARS process usually normalizes inputs by creating running statistics of the mean and standard deviations of the observation space components, so that it can normalize these inputs effectively as more and more states are observed, without requiring prior knowledge of the input distributions. This functionality was not necessary in the context of this task, since the output values of the VGG19 CNN are already scaled with one another with a known range between 0 and 10, so it was only necessary to divide the arrays by 10. Nevertheless, I wished to preserve this functionality for comparison later, so I compartmentalized it and wrapped it into a boolean parameter called ‘state_filter’ which can be passed when the code is run.

I also added optional functionality to pass in a pre-existing policy through utilization of the new ‘policy_file’ parameter, which can take in the location of an .csv or .npz containing weights, and, in the case of the .npz file, possibly information to initialize the observation distributions of the state filter used to normalize inputs if the policy was trained with one (in the same format that the .npz files are saved during logging steps, of course!). This allows one to pick up training where they left off at a later time. Further, I built in two additional parameters ‘lr_decay’ and ‘std_decay’ which will reduce the learning rate and size of random perturbations applied to the weights over time, in order to allow for more exploration early on, and eventually favoring smaller learning steps once the agent has some training under its belt. Another parameter, ‘show_cam’, accepts an integer value that determines the number of worker cameras to be made visible to the user during training. For long training episodes, I recommend setting this to zero to save CPU overhead, and turning it on later to watch the performance of a previously trained policy from a first-person perspective. It is always possible to watch workers from an aerial perspective from the Carla server window, regardless of how many cameras are being shown.

ARS usually works using the sum of the rewards over all steps in an episode (aka rollout). Since each step for our car environment is a frame that the camera sees, this creates somewhat of a challenge, as when the workers were made to report how many frames they saw on each episode, it was found that the frames per second that any worker sees is inconsistent over time, depending on how well the CPU is performing at any instant. This means that simply summing the rewards per episode may give us an inaccurate representation of the performance of a worker during a rollout, since some workers might see more frames per second than others due to fluctuating computational performance, and therefore have more chances to log rewards or punishments. For this reason, it was decided that the more appropriate measurement in this context should be the average reward per step in an episode, calculated simply by taking the sum of rewards over all steps of the episode and dividing it by the number of steps the worker saw. This way, we are able to normalize the scores to account for variable frame rates during training.

Episodes were ended as soon as a worker registered a collision. This created an interesting issue, since Carla drops the vehicles onto the map from a very small height (possibly to keep them from getting stuck in the pavement), and sometimes this creates a small shock to the vehicle which registers as a collision on the first frame of an episode, effectively terminating the episode immediately with a punishment recorded whenever this occurs. For this reason, I instructed the workers to disregard collisions on the first frame of an episode, which fixed the problem and gave each delta a fighting chance to demonstrate its value.

The interested reader can find all of the details of this methodology by reviewing the code and documentation in the repository. The reward system can be found within the ‘CarEnv’ class in the ars.py file. To run the code and train your own Carla ARS agent, go to the repository and follow the instructions in the ‘README.md’ file. Note that Carla is an absolute resource hog, and will operate at whatever level it can squeeze out of your machine. This means that for many users it will be necessary to limit the amount of resources available to it in order to keep their motherboard from melting during multi-day training periods. The easiest way to do this (on PC) is to change the ‘Maximum Processor State’ in your advanced power settings to a level which keeps your CPU running at an acceptable temperature (no higher than 80 degrees C is recommended). Unfortunately, Carla does not work on a virtual machine. For this study, I used a gaming laptop with a not-too-shabby quad-core i7-7700hq processor and NVidia GeForce GTX 1060 graphics card, and ran the training with ‘Maximum Processor State’ set to 83% to prevent overheating.

Results

An evaluation of the policy was done every 10 training steps by deploying the current policy on 32 rollouts and recording the descriptive statistics of the rewards produced by the policy over these rollouts. Below, we can see a chart which summarizes these evaluations over the 5-day training session done for this study.

We can see from the above chart that after 1250 training iterations, substantial progress was not made in the average rewards over the 32 rollouts of each evaluation step. There does appear to be a reduction in standard deviation of rewards over time, but without the desired increase in average reward.

There is a slow decline in average reward in the early period of the training, then a spike upwards around 500 training iterations, after which there is another decline. This may be an indication that the learning rate is set too high for this task, and more testing using different learning rates is highly appropriate based on these results. It may also be an indication that the delta std was set too high as well, which also needs to be further tested.

The minimum rewards do seem to increase over time, with much more extreme low values in the first portion of the training period. The agent was penalized for large steering control values, so this likely illustrates the period before the weights were adjusted enough to prevent extreme values for this control.

The maximum achievable reward for any rollout is 40, and we can see that individual rollouts were periodically getting to this level, even early on in the training. This is a good time to dive into the issue of varying circumstances which our workers find themselves in after being dropped into the map. In this study, each worker was spawned at a random location on the map to start each episode, and the resulting reward from the worker’s behavior was therefore highly related to the location where it was spawned. A worker spawned with an open stretch of road could get close to a max score by flooring it and not steering, while the same behavior at a different spawn point would result in a crash, and therefore a punishment. This makes it difficult for the algorithm to fairly evaluate the efficacy of each adjustment to the weights, since pure luck is playing such a large role in how well the episode goes, and therefore to the contribution of each delta to the update step. This should be dealt with in the future, and a good start would be to spawn the worker testing a given delta in the same location for the positive and negative addition of that delta, so that the rewards being compared in the update step were generated under identical conditions. This would lead to more meaningful contributions of each delta to the update step.

Conclusions

In this study, a usable framework for training a self-driving car policy using ARS in Carla has been constructed. Although an effective policy was not achieved after the first round of training, many insights about how to improve these results in the future have been obtained. The learning framework created provides the opportunity to easily apply these insights in the future.

The first issue to address is the variability of circumstances that workers find themselves placed into during training steps. As mentioned above, a great first step in dealing with this issue would be to have the positive and negative addition of each delta tested from the same spawning location on the map, to generate more fair and meaningful reward comparisons in the update step.

Another point to consider is that the BipedalWalker, which has the advantage of being placed in the exact same scenario for every episode, still took around 600 update steps before the training process started having noticeable effects on the rewards. Considering how many unique situations the workers in this study are exposed to in the Carla simulator, it may not be surprising that meaningful performance gains were not achieved after 1250 training steps, and it is possible that allowing the agent more time to train may have ultimately paid off with an effective policy. In the future, training the agent with more computational resources would help answer this question, since more workers could be operating simultaneously and dividing the work up more ways, which would decrease the overall time requirement for each update step.

In this study, the brake and throttle were kept as separate controls, but it may have been more appropriate to combine them into one continuous control value between -1 and +1, since that would prevent them being applied at the same time no matter how the weights were adjusted, and would provide a more realistic representation of how human beings apply these controls.

Tweaking the reward system may also lead to better learning by the agent, especially the size of punishment for collisions. The agent will not be able to learn correct behavior for the edge information leading up to a collision without experiencing those collisions, so it may be appropriate to reduce the collision punishment so that the agent feels more free to explore its way into these scenarios, but not so much that it does not then alter the policy to avoid similar collisions in the future.

The next thing to work on would be to test more values for the hyperparameters and compare the results of each, which again could be much better done using larger hardware than the gaming laptop used in this study, since training time could be drastically reduced by having more workers in parallel. Smaller values for learning rate should be explored, since we saw decreasing average rewards in the early period of this training session, which could be an indication that the learning rate was too large. Smaller values for the delta standard deviation could also be explored. Further, the number of deltas generated as well as the number of them that are used in the update step could be altered to observe the effects on training.

The VGG19 CNN was chosen as an intermediary step between the camera and ARS perceptron to serve as a way to convert raw camera data (which, if left untreated is likely to cause overfitting to very specific RGB combinations) to edge information, which may be more generalized. This part of the process needs more experimentation, as other networks or weights could have been chosen instead. The top layer was not included on the CNN, though it could be interesting to see what difference the inclusion of this layer might make. As it was, the CNN produced a prediction array with shape (7, 7, 512), which was flattened into shape (25088,) on its way into the ARS perceptron. Although much smaller than the (150528,) shape one would get by flattening the (224, 224, 3) RGB camera sensor data, this is still a big input layer, with a large number of specific arrangements that the agent could only evaluate its policy against in the event that it experienced them. Even in our completely sterilized task of driving around empty streets on the same map with consistent lighting, it would be difficult for an agent to encounter each of these edge scenarios frequently enough to properly tune the weights to suit them without training for very long periods of time. Research into how to generalize these edge scenarios is warranted. It is very possible that using some kind of pooling layer(s) on the end of the CNN would help in this regard. It must also be acknowledged that many of these input values may be zero throughout much of the training process, and therefore the weights associated with them would be having no effect on performance, leaving them vulnerable to being modified willy-nilly by the update steps. More consideration needs to be made with regard to this, and again this may be relieved with some type of dimensionality reduction such as adding pooling layers.

Variable weather was not used in this study, and the agent was only trained on one map, with no traffic or pedestrians. This was acceptable for giving ARS a chance to show its ability in the domain of autonomous driving, but further research would need to be done involving these variables.

This study only involved sensor data from one forward-facing camera per car with a field-of-view (FOV) of 110 degrees to train the agent. It may be worthwhile to run the same test again with a FOV of 180 degrees to compare the results. It would also be interesting to explore the addition of more cameras or sensors of other types to the vehicle, including radar, lidar, and GPS, all of which are available in Carla. Adding a rear camera would be a logical next step. The data from these sensors could be combined and processed in a variety of ways to facilitate learning of any task related to autonomous driving.

The task at hand in this study was relatively simple for this domain: drive around empty streets without smashing into anything, using edge depictions generated by a CNN from RGB camera sensor input. In reality, there are many individual tasks within the context of autonomous driving that researchers seek to accomplish using machine learning and sensory input, including object detection/classification, determination of drivable area, trajectory planning, localization, and many more. The ARS algorithm is used to train agents to perform continuous control locomotion tasks using a single-layer perceptron on a given set of inputs, and it is likely that this capability would be facilitated by including among these inputs the predictions of models pre-trained to perform these specific driving-related tasks, and more research is warranted to explore such possible input configurations to the algorithm.

The safety and scalability of learning in a simulated environment such as Carla provides the opportunity to asses the potential of various combinations of learning algorithms and sensory input configurations to perform a given task or set of tasks. Once models are trained to effectively perform these individual tasks using the sensor data, their outputs may then be combined and passed as inputs into RL algorithms such as ARS to train autonomous navigation policies. The computational simplicity of ARS makes it possible to perform more simultaneous virtual training operations using a given set of inputs for any given amount of available computational resources, allowing for the efficacy of an input schema to be evaluated in a timely fashion. This means that ARS has the potential to expedite the discovery of powerful ways to combine sensory input and intermediary task-specific models to facilitate policy training. The learning framework constructed in this study can offer future researchers a structural foundation on which to explore these possibilities.

--

--