
In this article, we will learn about a novel approach of self-supervised object tracking. Self-supervised is an approach where the models learn themselves 😎 , this itself makes the topic very interesting. Here we will see how our model can learn to track objects on its own. We will start with the basics of object tracking then, get to what is self-supervised learning for Computer Vision and finally discuss the approach in detail.
The implementation of this method can be found here
Introduction to Object Tracking 🎯
In simple language, it can be understood as identifying unique objects throughout the video sequence. The object to track is usually know as a target object. The tracking can be done by either _bounding box or instance segmentation._ There are two types of public object tracking challenges.
- Single Object Tracking: To track an object of interest throughout the video sequence. __ e.g. VOT challenge
- Multiple Object Tracking: To track multiple objects of interest throughout the video sequence. e.g. MOT challenge
Research Trends
Some of the famous classical CV algorithms used to solve object tracking are:
One of the most famous multi-object tracking algorithm SORT uses the Kalman filter at its core and was very successful.
With the emerge of Deep Learning era, very innovative researches arrived in the community and DL was successful in outperforming the classical CV approaches on public tracking challenges. Despite big success on public challenges DL still struggles to give generalised solutions for real-world problem statements.
Challenges with Deep Models 💭
One of the major challenges that we face when training a deep CNN model is training data.
- Training Data: Deep learning approaches are data-hungry and this almost every time becomes a bottleneck. Additionally, tasks like Multiple Object Tracking are very difficult to annotate and this process becomes impractical and expensive.

Self-Supervised Learning to Rescue 😯
We all are aware of supervised and unsupervised learning techniques. This is a fairly new type know as self-supervised learning. In these types of learning, we try to leverage the information already present in the data rather than any external labels, or sometimes we say the model learns on its own. In reality, what we do is train the CNN model for some other task that indirectly helps us to achieve our target goal, the model supervised itself. These tasks are called "proxy tasks" or "pretext task". Few examples of the proxy task are:
- Colourization
![CNN model learns to predict colours from a greyscale image. [source]](https://towardsdatascience.com/wp-content/uploads/2020/08/1IrkcjQEtLnMFMr1KJGHQnQ.jpeg)
- Placing image patches in the right place
![The patches are extracted from an image and shuffled. The model learns to solve this jigsaw and arrange the tiles in the correct sequence as shown in image 3. [source]](https://towardsdatascience.com/wp-content/uploads/2020/08/1V9aYEoJyyU6wi04S7Xs-ww.png)
- Placing frames in the right order
![The model learns to sort the shuffled frames in a video sequence. [source]](https://towardsdatascience.com/wp-content/uploads/2020/08/1MRARnK-ZtcN-sChDcwbPUw.png)
Many such tasks can be used as a proxy task for computer vision problems. A major benefit of such training is no manual annotated data is required for training and is suitable for solving real-life use cases.
#
We have seen what is a self-supervised model and you must have guessed by the name that we will use colorization as our proxy task.
Intro
Colorization is our proxy task or pretext task and object tracking is the main task or downstream task. Large scale unlabelled videos are used to train the model without any single pixel annotated by humans. The temporal coherency of the video is used to make the model learn to colourize grey-scale videos. This might seem confusing but stick to it, I will make things clear.
How model will learn to track?
We will take two frames a target frame (at time t) and a reference frame (at time t-1), and passed through the model. The model is expected to predict colours for the target frame from the prior knowledge of colours of the reference frame. This way the model internally learns to point to the right region in order to copy the colours from the reference frame, as shown in the figure. This pointing mechanism can be used as a tracking mechanism during inference. We will soon see how to do this.
![The model receives as input one colour frame and a grey-scale video and predicts the colours for the next frame. The model learns to copy colours from the reference frame, which enables a mechanism for tracking to be learned without human supervision [source]](https://towardsdatascience.com/wp-content/uploads/2020/08/1-KTE9-7G1LuAzGx4Ysenog.gif)
We don’t copy the colours in the network, rather our CNN network is trained to learn a similarity between pixels of target frame and pixels of the reference frame (the similarity is between grey-scale pixels), then this similarity matrix when linearly combined with the true colours from the reference frame gives the predicted colours. Mathematically, Let Cᵢ be the true colours for each pixel i in the reference frame and Cⱼ be the true colour for each pixel j in the target frame. The model gives a similarity matrix Aᵢⱼ between the target frame and the reference frame. We can get the predicted colours yᵢ by linear combination.
![[source]](https://towardsdatascience.com/wp-content/uploads/2020/08/1XV4vsKMjrjQshyPN6JGsDQ.png)

How to calculate the similarity matrix?
Both the images, reference frame and the target frame when are passed through the model learns a low-level embedding for each pixel, here fᵢ is an embedding for pixel i in the reference frame and similarly, fⱼ is an embedding for pixel j in the target frame. Then the similarity matrix can be calculated by:
Each row in the similarity matrix represents the similarity between all the pixels i of the reference frame and the pixels j of the target frame, so to make to overall weightage to 1, we apply softmax for each row.
Lets look an example with dimension to make it clear,we try to find a similarity matrix of 1 pixel from target frame.
An illustration of this example is shown below.
Consider reference image and target image, size (5, 5) => (25,1)
for each pixel, cnn gives embedding of size (64, 1)
fᵢ, embedding for reference frame, size (64, 25)
fⱼ, embedding for target frame, size (64, 25)
at j=2 f₂, embedding for 3rd pixel in target frame, size (64, 1)
Similarity Matrix, between reference frame and target pixel, j=2
Aᵢ₂ =softmax (fᵢᵀ x f₂) , size (25, 64) x (64, 1) => (25,1) => (5, 5)
we get a similarity between all the ref pixels and a target pixel at j=2.
Colorization, To copy the color (here, colours are not RGB but quantized colour of with 1 channel) from reference frame,
cᵢ, Colors of reference frame size (5, 5) => (25, 1)
Aᵢ₂, Similarity matrix, size (5, 5) => (1, 25)
Predicted color at j=2,
y₂ = Aᵢ₂ x cᵢ, size (1, 25) x (25, 1) => (1, 1)
From the similarity matrix in below figure, we can see reference color at i=1 is dominant(0.46), thus we have a color copied for target, j=2 from reference, i=1
PS:
1. ᵀ denotes transpose
2. matrix indices starts from 0
![(a) shows 2 frames of size (5,5), (b) an inner product of reference frame embedding and an embedding of target pixel at j =2, (c ) the similarity matrix after softmax, (d) linear combination of similarity matrix and true colours of the reference frame [source]](https://towardsdatascience.com/wp-content/uploads/2020/08/19iY7DPZdhPqlqQu-wbUIPQ.png)
Similarly, for every target pixel in target frame ((5, 5)=> 25 pixels), we will have a similarity matrix of size (5, 5), i-e a complete Similarity matrix Aᵢⱼ of size (5, 5, 25) = (25, 25). We will extend the same concept with (256 x 256) image in the implementation.
Image Quantization
Colours are low spatial frequency so we can work with low-resolution frames. We don’t need C(255,3) colour combinations, so we create 16 clusters and quantize the colour space into these clusters. Now we only 16 unique cluster of colours, (see 3rd column of above figure). Clustering is done using k-means. 16 clusters will have some loss of colour information but its enough for identifying objects. We can increase the number of clusters to improve the precision of colourization but at the cost of increased computation.
Why LAB colour space over RGB?
![[source]](https://towardsdatascience.com/wp-content/uploads/2020/08/1SEPc0kBGuUQ5kbqKKtOwFA.png)
To quantize the image into clusters we will use AB channels of LAB colour space rather than RGB colour space. The graph above shows RGB and LAB inter-channel correlation, we can conclude from the graph that
- RGB tends to have more correlation than LAB.
- LAB would force the model to learn in-variances, it will force to learn a more robust representation rather than relying on local colour information
The clustering can be done using the KMeans package from sklearn.
This class will be used to make clusters of colours and we will store it as a pickle.
Implementation 💻
Note : I have used pytorch for implementation and it follows (N, C, H, W) format. Keep that in mind when dealing with matrix reshaping. If you have any doubts with shapes feel free to reach out.

![The model learns to colour the video frame from a reference frame. [source]](https://towardsdatascience.com/wp-content/uploads/2020/08/13CKC5jKEJb-IHcwVz7oQ5w.gif)
Input
The inputs to the model are four grey-scale video frames down-sampled to 256 × 256. Three reference frames and one target frame.
Pre-Processing
Firstly, we will reduce all the training videos to 6fps. Then preprocess the frames to create two different sets. One for CNN model and other for colourization task.
- Video fps is reduced to 6 fps
SET 1 - for CNN Model
- Down sampled to 256 x 256
- Normalise to have intensities between [-1, 1]
SET 2 - for Colourization
- Convert to LAB colour space
- Downsample to 32 x 32
- Quantize in 16 clusters using k-means
- Create one-hot vector corresponding to the nearest cluster centroid
Model Architecture
The backbone used is ResNet-18 so that the results are comparable with other methods. The last layers of ResNet-18 are updated to give the dimensions output of 32 x 32 x 256. The output of ResNet-18 is then passed into a 3D-Conv Network and the final output is a 32 x 32 x 64. (below code block shows the 3D Network which takes input from a ResNet-18 network)
Training
Training can be divided into the following 3 steps:
- Network Pass We will use SET 1 of preprocessed frames, i-e 4 grey-scaled frames of size (256 x 256) are passed through the network to get a (32 x 32) spatial map with 64 channels. This can be interpreted as 64-dimensional embedding for each pixel of a (32 x 32) image. So we have four such pixel-wise embeddings, three for ref images and one for the target image.
- Similarity MatrixWith these five embeddings, we find a similarity matrix, between the reference frames and the target frame. For a pixel in target frame, we will have a similarity value with all the pixels in all three reference frames normalized to 1 by softmax.
3. ColourizationWe will use SET 2 of preprocessed frames, i-e Four frames downsampled to (32 x 32) and quantized are used for colourization. Three reference frames are combined with the similarity matrix to get a predicted quantized frame. We find the cross-entropy loss with the predicted colours (remember we quantized the frames into 16 clusters, now we have 16 categories. We find multi-category cross-entropy loss on these colours.)
Inference


![Example of tracking predictions [source]](https://towardsdatascience.com/wp-content/uploads/2020/08/14_csmyYhrK6uA1jV78-g5g.gif)
After learning the task of colourization, we have a model that can compute a similarity matrix Aᵢⱼ for a pair of target and reference frames. Now for the actual task of tracking, we exploit the property that our model is non-parametric in the label space. We simply re-use equation 1 to propagate, but instead of propagating colours, we propagate distributions of categories. For 1st frames we have ground truth masks, we will arrange all the instance masks as one-hot vectors cᵢ (this is similar to the one-hot vector of quantised colours used during training). Combine cᵢ with our similarity matrix Aᵢⱼ to find the new position of the masks, but remember the predictions cⱼ in subsequent frames will be soft, indicating the confidence of the model. To make a hard decision, we can simply take the most confident category. The algorithm for inference will be:
WHILE (target frame, reference frames) in the video
step 1. Pass the target and reference frames through CNN model
step 2. Find Similarity Matrix
step 3. Take ground truth object masks as one-hot encoding
step 4. Linear combine the object masks with similarity matrix
step 5. Update ground truth object masks by predicted masks
Failure Modes
Let’s discuss when the model tends to fail in certain scenarios this is mostly the cases where colourization fails, this implies colourization has a high co-relation with tracking. Some failures are found in the following conditions:
- When the light changes drastically or frequently in the video
- The method successfully track objects with minor to medium occlusion s but still fails when object undergoes major occlusions
- Sudden change in the size of the object
Conclusion
Here we saw how a model can learn from its own without any manually annotated data. We learned, how to train a CNN model on some proxy task and exploit this learning to do the actual task. We used colourization as a proxy but this is not limited to this, various new approaches are coming up as new proxy tasks. Self-supervised methods are the need of the hour, and they can remove the major constraint of expensive data collection for real-world use cases. This model cannot defeat current SOTA supervised models yet but outperforms many other.
The method is very promising in terms of its approach and flexibility. Soon self-supervised models will be the first choice for solving ML problems because of its advantages. This article was based on research by "Google Research" and all the credits to them. I tried to explain the research according to my knowledge and understanding.
About Me
I am Tushar Kolhe working as a Deep Learning Engineer at Fynd. My interest is in building computer vision models that solve real-world problems. Reach out for any suggestions or help at email.
I strongly believe anyone can learn anything if they have enough motivation, deep models too 😛 .
Appendix
Some interesting researches in object tracking field:
- Simple Online and Real-time Tracking with a Deep Association Metric. [Paper] → an extension of SORT
- Tracking without bells and whistles.[Paper]
- A simple baseline for one-shot multi-object tracking- FairMOT. [Paper]
- Learning a Neural Solver for Multiple Object Tracking. [Paper]
You can find a lot more interesting researches, comment researches that you find interesting.
Reference
- Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., & Murphy, K. (2018). Tracking Emerges by Colorizing Videos. ArXiv, abs/1806.09594.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. ArXiv abs/1512.03385
- Richard Zhang, Phillip Isola, Alexei A. Efros. Colorful Image Colorization. ArXiv abs/1603.08511
- Mehdi Noroozi, Paolo Favaro. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. ArXiv abs/1603.09246
- Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, Ming-Hsuan Yang. Unsupervised Representation Learning by Sorting Sequences. ArXiv abs/1708.01246
- https://www.fast.ai/2020/01/13/self_supervised/