Self-Driving Cars

Trajectory Prediction Evolution (Part 1/2)

Self-driving cars depend upon it

Priyash Sachdeva
9 min readAug 9, 2020

--

“If you recognize that self-driving cars are going to prevent car accidents, AI will be responsible for reducing one of the leading causes of death in the world.” — Mark Zuckerberg

Whenever we think about the AI world, the auto industry immediately comes to our minds. Self-driving cars is one of the fascinating future which does not seem to be a distant reality. We just sit inside the car and watch a movie while it takes you to its destination.

But is it that easy that cars can drive fully autonomous and pay attention to all contexts in an environment? In the past few years, many papers have been published to predict the possible future trajectories of cars and pedestrians that are socially acceptable.

Question: What will be one of the biggest challenges of self-driving cars?

Answer: Understanding pedestrian behavior and their future trajectory.

Human motion can be described as multimodal i.e. humans have the possibility to move in multiple directions at any given instant of time. And this behavior is one of the biggest challenges of self-driving cars. Since they have to navigate through a human-centric world.

In this first part, I will discuss three papers in brief whose main aim is to predict future possible trajectories of pedestrians.

Social GAN

This is one of the initial papers that has stated using GAN to predict the possible trajectories of humans.

This paper tries to solve the problem by predicting the socially plausible future trajectories of humans that will help self-driving cars in making the right decision.

Aim:

The paper aims at resolving two major challenges:

  1. To have a computationally efficient interaction model among all people in a scene.
  2. To learn and produce multiple trajectories that are socially acceptable.

Method

Fig 1. Screenshot from Social Gan Research paper

This paper presented a GAN based encoder-decoder network that consists of a LSTM network for each person and a Pooling module that models interactions among them.

The whole model (shown in Fig 1.) can be represented by 3 different components:

Generator

The Generator consists of an encoder and a decoder. For each person, the encoder takes input as X_i. It embeds the location of each person and provides it as a fixed-length vector to the LSTM cell at time t.

The LSTM weights were shared among all the people in a scene that will help with the pooling module to develop interaction among people.

Unlike prior work, this paper has used 2 following approaches:

a) For an easy training process during backpropagation, instead of predicting the bivariate Gaussian distribution, the decoder directly produces (x, y) coordinates of the person’s location.

b) Instead of providing the social context directly as the input of the encoder, they have provided it once as input to the decoder. This led to an increase in speed to 16x times.

Discriminator

The discriminator consists of an encoder with LSTM layers for each person. The idea of this discriminator is to distinguish the real trajectories with fake ones.

Ideally, it should classify the trajectories as “fake” if they are not socially acceptable or possible.

Pooling Module

Fig2.Screenshot from Social GAN Paper

The basic idea of this approach is shown in Fig2. This method computes the relative position of the person 1 (represented in red) and all other people (represented in blue and green). It is then concatenated with hidden states and processed independently through MLP (multi-layer perception).

Eventually, each element is sequentially pooled to compute a person’s 1 pooling vector P1.

This method diminishes the limitation of considering people inside a particular grid (S-Pool Grid in Fig 2.)

Losses

3 different losses are used in this paper:

  1. Adversarial loss: This loss is a typical GAN loss that helps in differentiating real and fake trajectories.
  2. L2 Loss: This loss takes the distance between the predicted and ground-truth trajectory and measures how far the generated samples are from real ones.
  3. Variety Loss: This loss helps in generating multiple different trajectories i.e. multimodal trajectories. The idea is very simple, for each input, N different possible outcomes are predicted by randomly sampling ‘z’ from N(0, 1). Eventually, select the best trajectory that has a minimum L2 value.
Screenshot from Social GAN Paper

Sophie: An Attentive GAN

This paper has extended the work of Social GAN and tried to predict a future path for an agent with the help of both physical and social information.

Although the aim is still the same as Social GAN, this paper has added scenic information too with the help of images of each frame.

The network learns two types of attention components:

  1. Physical Attention: This attention component helps in paying attention and processing the local and global spatial information of the surrounding. As mentioned in the paper: “For example, when reaching a curved path, we focus more on the curve rather than other constraints in the environment”
  2. Social Attention: In this component, the idea is to give more attention to the movement and decisions of other agents in the surrounding environment. For example: “when walking in a corridor, we pay more attention to people in front of us rather than the ones behind us

Method

Fig3. Screenshot from Sophie Research paper.

This paper’s proposed approach has been divided into 3 modules (as shown in Fig3.).

Feature Extractor module

This module extracts features from the input in 2 different forms, first as an image for each frame and second as a state of each agent for each frame at a time ‘t’.

To extract visual features from the image, they have used VGGnet-19 as a CNN network. The weights of this network are initialized by ImageNet. And to extract features from the past trajectory of all agents, they use a similar approach as Social GAN and use LSTM as an encoder.

To understand the interaction between each agent and capture the influence of each agent trajectory on another agent, the pooling module was used in Social GAN. This paper has mentioned 2 limitations with that method:

  1. Max function may discard the important features of the inputs as they might lose their uniqueness
  2. After the max operation, all the trajectories are concatenated which may lead to having an identical join feature representation.

Because of these limitations, they define an ordering structure. In this, they use sort as permutation invariant function instead of max(used in Social GAN). They sort the agents by calculating the euclidean distance between the target agent and other agents.

Attention module

With the help of physical or social attention, this module helps in highlighting the important information of the input for the next module.

The idea is, as humans pay more attention to certain obstacles or objects in an environment like upcoming turns or people walking towards them, a similar kind of attention needs to be learned.

As mentioned before, this network tends to learn 2 different attention.

In physical attention, hidden states of the LSTM from the GAN module and learned features from the visual context is provided as input. This helps in learning more about the physical constraints like the path is straight or curved, what is the current movement direction, position, and more?

In social attention, the LSTM features learned from the feature module together with hidden states of the LSTM from the GAN module are provided as input. This helps in focusing on all agents that are important to predict the further trajectory.

GAN module

This module takes that highlighted input features to generate a realistic future path for each agent that satisfies all the social and physical norms.

This GAN module is majorly inspired by the Social GAN with almost no further changes.

The input to the generator is the selected features from the attention module as well as white noise ‘z’ sampled from a multivariate normal distribution.

Losses

This approach has used 2 losses which are also similar to Social GAN.

  1. Adversarial Loss: This loss helps in learning discrimination between the generated and real samples.
  2. L2 Loss: This loss is similar to “variety loss” used in Social GAN.

Social Ways

In this paper, they are also trying to predict the pedestrian’s trajectories and their interaction. However, they are also aiming to solve one problem from all previous approaches: Mode Collapse.

Mode Collapse is the opposite of multimodality. In this, the generator tries to produce similar samples or the same set of samples leading to similar modes in output.

To solve the mode collapse problem, this paper uses info-GAN instead of L2 Loss or variety loss.

Method

Fig4.Screenshot from Social Ways Research paper

This method comprises of 3 different components:

Generator

The generator consists of an encoder-decoder network. Here the past trajectory of each agent was fed into respective LSTM-E (Encoder), which encodes the history of the agent. The output of each LSTM-E was fed into attention pooling as well as a decoder.

For decoding, the future trajectories, hidden states from LSTM-E, noise vector ‘z’, latent code ‘c’, and important interacting agent features from the attention pooling are fed into decoder.

The latent code ‘c’ helps in maximizing a lower bound of the mutual information between the distribution of generated output and ‘c’.

Attention Pooling

This paper uses a similar kind of approach as used in Sophie: an attentive GAN.

However, in addition to the euclidean distance between agents (used in Sohpie), 2 more features are being used:

  1. Bearing angle: “the angle between the velocity vector of agent 1 and vectors joining agents 1 and agent 2.”
  2. The distance of closest approach: “the smallest distance, 2 agents would reach in the future if both maintain their current velocity.”

Instead of sorting, the attention weights were obtained by scalar product and softmax operation between hidden states and the above-mentioned three features.

Discriminator

Discriminator consists of LSTM based encoder with multiple dense layers. Generated trajectories from the generator for each agent and ground-truth trajectories were fed into the discriminator.

As output, the probability that generated trajectories are real is provided.

Losses

There are 2 losses used in this process:

  1. Adversarial Loss: This is the normal GAN loss that helps in differentiating between real and fake samples.
  2. Information Loss: The basic idea of this loss is to maximize mutual information. And that is achieved by minimizing the negative- loss likelihood on salient variable ‘c’.

Results

All three papers have tried to learn from previous approaches and have gained some new insights.

There are 2 metrics that are used to evaluate this application:

  1. Average Displacement Error (ADE): It is the average L2 distance on all predicted timesteps between the generated trajectory and ground-truth trajectory.
  2. Final Displacement Error(FDE): This is the smallest distance between the generated trajectory and ground-truth trajectory at the final predicted timestep.

There are 5 datasets, which are used as benchmarking this application. All these approaches have been tested on all these datasets and they have provided valuable comparable results.

Fig5.Screenshot of Results from Social Ways

From Fig5. , it can be said that all three approaches show promising results on some datasets. However, I think with hyperparameter tuning and little adjustments, results may shuffle too. I believe all three have certain advantages and could be used to further research in this area.

Conclusion

The problem of modeling human motion prediction in a scene along with human-human interaction is challenging yet vital for self-driving cars. Without modeling this behavior, it is impossible that self-driving cars could be fully operational.

How to model human-human interaction is the major difference between the above-mentioned approaches. From the theory and suggested methods, I believe attention on distance and the bearing angle between 2 agents is one the most crucial ways to move forward.

But it is completely my perspective. There could be multiple ways that this could be implemented and enhanced. And we will see that in Part 2 too.

With self-driving cars as the focus, I would continue with more approaches in Part 2 with a focus on the trajectory prediction of cars.

I am happy for any further discussion on this paper and in this area. You can leave a comment here or reach out to me on my LinkedIn profile.

References

  1. Alexandre Alahi, Kratarth Goel, Social LSTM: Human Trajectory Prediction in Crowded Spaces, CVPR 2016
  2. Agrim Gupta, Justin Johnson, Li Fei-Fei, Social GAN: Socially Acceptable Trajectories with Generative Adversarial Networks, CVPR 2018
  3. Amir Sadeghian, Vineet Kosaraju, Ali Sadeghian, Sophie: An Attentive GAN for Predicting Paths Compliant to Social and Physical Constraints, CVPR 2018
  4. Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR 2015
  5. Social Ways: Learning Multi-Modal Distributions of Pedestrian Trajectories with GANs, CVPR 2019

--

--