The world’s leading publication for data science, AI, and ML professionals.

Contrasting Contrastive Learning Approaches

Thoughts and Theory

A deep dive into which Computer Vision tasks make good benchmarks, how datasets affect model performance, and which encoder makes the best general-purpose backbone.

Links: Github, Paper

In recent years we have seen an explosion of new self-supervised learning methods in the computer vision domain—researchers have managed to train neural networks that perform extremely well on common benchmarks like ImageNet classification using mostly unlabeled data.

Graphic by Winson Han
Graphic by Winson Han

It turns out that understanding what makes an image different from others is enough to produce an abstract representation of that image which can be used for real-world tasks like semantic classification. Early successes with this approach have triggered an avalanche of publications describing variations on this theme that all marginally improve on each other.

We now have methods such as PIRL, CPC, SimCLR, MoCo, and SwAV which all produce remarkable results using a specific type of self-supervised learning called Contrastive Learning in which encoders are trained to recognize slightly visually augmented versions of the same image as similar to each other and different to other images.

While this explosive pace of research is great for advancing a new idea, it also produces many separate threads that can be hard to compare or consolidate. In this blog post, I want to talk about the problems with the current state of self-supervised computer vision research and a paper I recently published, along with Gabriel Ilharco, Ludwig Schmidt, Kiana Eshani, and Roozbeh Mottaghi, that aims to resolve some of them.

Before we dig in, let us quickly review a few key terms, and how I will use them in this post:

Pre-Training Algorithm: While the term "pre-training algorithm" is rather loosely defined in deep learning, in this post I will be using it to describe the entire pre-training pipelines proposed by recent popular works such as MoCo and SwAV.

Pre-Training Data: This is the dataset used for the self-supervised pre-training of computer vision encoders. Most works use ImageNet for this.

Encoder: In computer vision, we often separate our network into two components: a general-purpose feature extractor which encodes the raw pixel data of an image into a useful abstract representation, and an end task network that uses this abstract representation to accomplish some real-world task. The former is what I will call an encoder in this blog post.

End Task Network: As mentioned above the end task network is the part of our model that is tailored to perform a specific real-world task like image classification, and thus it must be tuned for each task separately.

End Task: An end task is some useful task that our model can perform. Oftentimes these are practical things like estimating the depth of a room from an image or classifying the breed of a dog. End tasks are a way to tie our abstract models to real-world work that people can benefit from.

End Task Data: This is the training dataset associated with a particular end task that is used to train the end task network to do something useful with the abstract image representation produced by the encoder.

So in summary, Pre-Training Algorithms like SwAV use a Pre-Training Dataset to train an Encoder, which is a general-purpose tool for extracting abstract representations from images. End Task Networks are then trained on End Task Data to use these abstract representations to perform some useful real-world End Task.

Graphic from Contrasting Contrastive Self-Supervised Models
Graphic from Contrasting Contrastive Self-Supervised Models

Now that we are all caught up with the terminology, let’s dive into a couple of key problems that the rapid pace of innovation in the self-supervised vision field has brought with it.

1. Apples, Oranges, and Bananas

While the various proposed training algorithms all seek to create a good, general-purpose image encoder, they share very few compatible data points, by which I mean applying the algorithm to the exact same model architecture, with the exact same pre-training data using the exact same endpoint.

More often than not the set of perfectly matched data points like these is reduced to only one: ImageNet Classification performance using a ResNet50 trained on ImageNet data. While this is a good benchmark, it can become extremely dangerous if it is the only benchmark that we care about. Outside of this, the different papers provide results for non-overlapping subsets of end tasks, pre-training datasets, and model architectures so comparing the numbers across papers oftentimes leads to comparing apples to oranges.

2. What Are We Chasing Anyways?

Since ImageNet classification is the only benchmark that the majority of the computer vision community can synchronize and agree on, it seems that the real goal being chased is not producing a good general-purpose image encoder, but producing an encoder that will do good on the ImageNet classification, and similar end task. In a way, any researcher producing a new algorithm is forced to chase this benchmark as a high score will give the algorithm more attention but this inadvertently leads the community into optimizing the proxy objective of "ImageNet performance" instead of the true objective of "good visual encoder". Combined with the fact that most papers use ImageNet as training data and we have the recipe for a powerful feedback loop that produces encoders that are good at learning the underlying distribution statistics of a dataset (like ImageNet) instead of being good at understanding what is in an image.

Our Work

Hopefully, this was enough to convince you that major inconsistencies exist in the field of self-supervised computer vision. Now let us talk about ways to address them. Specifically, I will talk about the methodology and findings of my recent paper Contrasting Contrastive Self-Supervised Representation Learning Models.

In order to get a standardized frame of reference to compare various self-supervised algorithms and pre-training datasets, we had to fix many experiment variables. All of our testing was done using the same encoder architecture (ResNet50). We also froze the weights of our encoders when training the end task networks.

Despite freezing these variables we still ran over 700 experiments using thousands of hours of GPU time. We test a total of 30 encoders produced by 4 different pre-training algorithms (SwAV, MoCo v1, MoCo v2, and PIRL) on 4 different pre-training datasets (ImageNet, Places, Taskonomy, and Kinetics400) and a combination of the 4. We trained end task networks for each encoder on 20 end tasks training sets and reported the results that those encoders produced on the end task test sets (see the diagram of end tasks below).

For further details please refer to the paper.

Now let’s dig into the results…

Is ImageNet a Good Benchmark?

As mentioned above, it seems rather circular to be evaluating models trained on the ImageNet dataset on the ImageNet classification end task. To measure how good of an indicator this is we performed correlation analysis between the performance of encoders on ImageNet and other end tasks. This means that we computed how strongly does performance on ImageNet track with performance for other end tasks for any given encoder. We find that ImageNet is not a very good indicator at all. While it is fairly good at predicting the performance of similar tasks (like CalTech and CIFAR-100 classification), it does really poorly at predicting the performance of different tasks like depth prediction.

We broadly divide our tasks into four categories based on task type (as Semantic or Structural) and output modality (as Image-Level or Pixel Wise). Below we see an illustration of all our end tasks and their corresponding categorization:

Graphic from Contrasting Contrastive Self-Supervised Models
Graphic from Contrasting Contrastive Self-Supervised Models

The graph below plots ImageNet classification accuracy vs. performance on other end tasks. It illustrates the point that ImageNet performance is a good indicator for other Image-level Semantic tasks, but a weak signal for all other end task categories. Furthermore, we even see some negatively correlated results, suggesting that tuning an encoder to be very good at ImageNet classification causes it to ignore some information crucial for other task types.

Graphic from Contrasting Contrastive Self-Supervised Models
Graphic from Contrasting Contrastive Self-Supervised Models

All in all, this shows that reporting just the ImageNet classification performance of a model is very limiting.

Are All Pre-training Datasets Created Equal?

Another area we wanted to explore was how much does the pre-training data influence the qualities of the final model. Since the overwhelming majority of the works in this field pre-train their encoders on ImageNet, there is not much exploration on this axis. We trained MoCo v2 and SwAV encoders on 4 datasets: ImageNet, Places, Taskonomy, and Kinetics400. We subsampled all datasets to match the size of ImageNet and also trained on a combination of the 4.

Firstly, we found that encoders trained on ImageNet tended to be the best at solving semantic end tasks while encoders trained on Places tended to be the best at solving structural end task. This makes a lot of sense since ImageNet contains a lot of diverse images, while Places contains images of rooms and buildings. Furthermore, both Places and ImageNet have curated, labeled, and organized data, while Kinetics is a series of frame captures from youtube videos and Taskonomy is a series of Matterport 3d scans. This suggests that although we are not using the labels explicitly, working with a clean and organized dataset still might have some advantages. This brings into question the viability of training vision models on random completely unsupervised data from the internet – one of the great promises of self-supervised computer vision. While some recent works show success when training on large datasets collected from the internet, it is unclear how clean and organized this data is.

Secondly, we tested whether pre-training our encoder on a large dataset similar to our end task data with self-supervised methods produces a better encoder. For each of our pre-training datasets (ImageNet, Kinetics, Places, and Taskonomy) we find a corresponding end task that uses a similar dataset or a subset of the same dataset (CalTech Classification, Kinetics Action Prediction, SUN Scenes Classification, and Taskonomy Depth Prediction respectively). We plot the end task performance of all the encoders we trained on the 4 datasets below:

Graphic from Contrasting Contrastive Self-Supervised Models
Graphic from Contrasting Contrastive Self-Supervised Models

This result is somewhat obvious for supervised learning, but in our work, we also validated that it holds for contrastive learning. Interestingly we found that the combination datasets on average produce encoders that are reasonably good at all tasks, but not the best at any task. In fact, we also find that encoders trained on ImageNet and Places outperform the combination encoders on average, so it seems that mixing datasets brings us fewer benefits than drawbacks.

Does Dataset Balance Matter?

In addition to the pre-training datasets mentioned above, we also tested the effects of pre-training using an unbalanced version of ImageNet, which we produced by logarithmically sampling how many images we have from each category (such that we have many samples from a few categories and few samples from many categories). We find that if we pre-train our encoder on a heavily unbalanced subset of ImageNet our end task performance is no worse than pre-training on a perfectly balanced subset of ImageNet of the same size. We test 3 different samplings of each dataset without a very large variance which indicates that all subsampling are equally good and that there are no magical classes that would offer us a huge performance boost if we get many samples from that particular class in our subsample. To slightly tamper the excitement of this finding it is important to mention that we used only small datasets (with 250k samples) trained for 200 epochs, so further work is needed to validate this trend on larger datasets and longer training runs.

Do Different Pre-Training Algorithms Show Different Strengths?

Two of the training algorithms that we profiled extensively are MoCo v2 and SwAV. While not the primary focus of our work, our analysis brought up some interesting contrasting properties of the two algorithms.

MoCo v2 tends to be better at structural tasks while SwAV outperforms on Image level tasks. My high-level hypothesis as to why this happens is that since SwAV uses a clustering approach at the final layer it tends to lose some spatial image information. Some support for this theory comes from the results of a layer-wise CKA analysis we performed on the encoders. We find that on average encoders trained with MoCo v2 have a stronger agreement between the early and late layer representations suggesting that more spatial information is retained in the final encoding. The graph below showcases the performance difference between MoCo and SwAV encoders for PixelWise and Image-level tasks:

Graphic from Contrasting Contrastive Self-Supervised Models
Graphic from Contrasting Contrastive Self-Supervised Models

This can be a useful datapoint if we are trying to build a custom encoder for a task that requires spatial information, as we now have evidence suggesting that MoCo v2 is the better pre-training algorithm for the job. Here we see another shortcoming of chasing ImageNet classification performance as our benchmark. Since SwAV outperforms MoCo v2 on that particular end task many might assume that it is better in general, while our research shows that the reality is not quite so clear cut.

Are Self Supervised Encoders Good For All Downstream Task?

The short answer to this is yes. For every task that we tested the self-supervised models performed very well, in fact for all but 3 they outperformed the supervised ImageNet baseline. The 3 end tasks where the supervised encoder performed better were ImageNet classification, ImageNet v2 classification, and Pets classification (which is very similar to ImageNet). Since we are not fine-tuning the encoders for the task at hand, this result is not at all surprising as a supervised encoder trained on ImageNet is effectively fine-tuned to the task during encoder training. For everything else, self-supervised methods perform better, which gives us a strong indication that they produce better all-purpose encoders.

Furthermore, we find that some end tasks gain an even bigger boost from using a self-supervised model than others, namely structural tasks. The graph below shows while some self-supervised encoders outperform the supervised benchmark in every task category, almost all self-supervised encoders outperform the supervised benchmark for structural tasks – even ones with pre-training datasets and pre-training algorithms that are a poor match for the end task:

Graphic from Contrasting Contrastive Self-Supervised Models
Graphic from Contrasting Contrastive Self-Supervised Models

So What Encoder Should I Use?

After considering all of the above results it is clear that the current standard computer vision encoder (ResNet50 trained with supervised learning on ImageNet) is oftentimes not the best all-purpose encoder. We found that some encoder trained with self-supervised learning is almost always better at solving end tasks, and one particular encoder (trained with SwAV on ImageNet) is better at over 75% of the end tasks.

The graph below shows the levels of relative improvement that self-supervised models offer over supervised ImageNet. It also shows that ImageNet and Places tend to be the two datasets achieving the best results as mentioned above.

Graphic from Contrasting Contrastive Self-Supervised Models
Graphic from Contrasting Contrastive Self-Supervised Models

Further Unanswered Questions

While we ran over 700 experiments in our work we have only profiled a small section of the overall landscape of self-supervised computer vision. In order to obtain the detailed results that we did, we needed to fix many variables, which leaves us with many open questions such as:

  1. How does model architecture affect the performance of the different self-supervised algorithms?
  2. Does fine-tuning the entire encoder significantly affect performance?
  3. Do the trends we observe wash out or get more pronounced if we train the encoders for many more epochs?

These are all good jumping-off points for future work, that will further help us understand the benefits and drawbacks of self-supervised computer vision.

Conclusion

While this blog post has pointed out many flaws of the current work in the self-supervised vision domain, it is also important to celebrate its many achievements. Our paper found evidence suggesting that contrastive learning methods outperform supervised learning at producing a good general purpose encoder, further validating that this not simply a neat trick but a genuine and important advance. However, we have also shown that measuring progress in a single dimension (ImageNet Classification) can lead us to overlook the parts of the bigger picture (like the fact that MoCo v2 outperforms SwAV for nearly half of the end task we tested).

So in summary I would like to provide 4 key takeaways from this work that might aid computer vision researchers and engineers on their future computer vision projects:

  1. Self-supervised image encoders are great all-purpose feature extractors and you should consider using one for your next project. I would recommend a ResNet50 trained with SwAV for 800 epochs on ImageNet (or Places if your project is structural)
  2. If you have a large amount of data in your domain, consider training a self-supervised encoder yourself with it, as this might give you an even greater performance boost.
  3. If you are developing a new self-supervised model make sure to evaluate it on a wide range of diverse tasks. Consider using the ViRB codebase that we are releasing with the project.
  4. If you are developing a new dataset (or training webly supervised modes), the balance of your data classes might not be as important, but having some diverse samples is.

Related Articles