🚀 Sascha’s Paper Club

Monocular depth estimation, the prediction of distance in 3D space from a 2D image. The "ill posed and inherently ambiguous problem", as stated in literally every paper on depth estimation, is a fundamental problem in computer vision and robotics. At the same time foundation models dominate the scene in Deep Learning based NLP and computer vision. Wouldn’t it be awesome if we could leverage their success for depth estimation too?
In today’s paper walkthrough we’ll dive into Depth Anything, a foundation model for monocular depth estimation. We will discover its architecture, the tricks used to train it and how it is used for metric depth estimation.
Paper: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data, Lihe Yang et.al., 19 Jan. 2024
Resources: GitHub – Project Page – Demo – Checkpoints
Conference: CVPR2024
Category: Foundation Models, monocular depth estimation
Other Walkthroughs: [BYOL] – [CLIP] – [GLIP] – [Segment Anything] – [DINO] – [DDPM]
Outline
- Context & Background
- Method
- Qualitative Results
- Experiments & Ablations
- Further Readings & Resources
Context & Background
Why is depth such an important modality and why using deep learning for it?

Put simply: to navigate through 3D space, one must need to know where all the stuff is and at which distance. Classical applications include collision avoidance, drivable space detection, placing objects into a virtual or augmented reality, creating 3D objects, navigating a robot to grab an object and many more.
Depth can be captured with a variety of sensors featuring different measurement modalities, using active or passive measurement principals and different number of sensors. The problem is that every sensor has its pros and cons but to sum it up: capturing accurate depth is expensive!
So, wouldn’t it be nice to use a sensor that is relatively cheap, does not need a complex setup, does not need a lot of space, is light in weight and is already available in many systems? That sensor is a camera. Using a single camera solves many issues but faces us with new ones: Predicting 3D information from 2D images is ill posed, meaning we have not enough information to unambiguously predict the depth: is the object small or simply far away? Is a surface concave or convex? Sounds like a job for deep learning, our general function approximator!
How has this problem been approached previously?
Over the years many deep learning approaches have been investigated to perform depth estimation from single images. Some tried regression-based approaches to directly predict a depth value, others discretized the depth and performed a classification to predict in which depth-bin a pixel should be.
With the rise of generative deep learning many researchers also used GANs, VAEs or even diffusion-based models, including myself 🤓
RGB-D-Fusion: Image Conditioned Depth Diffusion of Humanoid Subjects by Sascha Kirch et.al., 2023
While many models showed promising performance, their usage has often been very application specific: e.g. indoor vs. outdoor scenes, predicting relative depth vs. metric depth, sparse vs. dense depth.
Pretrained foundation models have been shown great 0-shot performance and can be fine-tuned with relatively few data samples to new applications. Since they are quite data-hungry, foundation models are usually only available for modalities that are available on the internet: images and text, but not for depth data… Until now!
Method
Having talked about why depth is important and that having a Foundation Model for depth data would be desirable, let’s dive into how the authors of Depth Anything managed to achieve this.
Depth Anything in a Nutshell
Depth Anything is an auto-encoder model that inputs an image and predicts a depth map. It is trained on a combination of labeled and unlabeled images.
Up to 3 stages are involved in the training process:
- Train a teacher model on labeled images.
- Pretrain a student model (Depth Anything model) using the teacher to create pseudo labels for unlabeled data.
- (Optionally) Fine-tune the model to a specific task or dataset.

Digging a bit deeper
Collecting large amount of data with labels is expensive and time consuming. On the other hand, unlabeled images are available in a vast amount and are easy to collect. So naturally we’d like to utilize this data for training deep learning models. Some previous work had the same idea and used unlabeled data in combination with classical computer vision algorithms like stereo calibration or structure from motion to obtain depth labels, but this is quite time consuming.
The authors of Depth Anything trained a strong teacher model using the labeled dataset of 1.5M samples with a strong pre-trained DINOv2 encoder (a foundation model known for its semantically rich representations) and use it to generate pseudo labels for the 62M unlabeled samples that are used to train the student.
In that way the student can be trained using supervised techniques and they were able to obtain dense depth maps quickly and easily. Note that the pseudo labels are not perfect though and further tricks are required to train the student. The student has the same architecture as the teacher but is not using the teacher’s weights. Its encoder is also initialized with DINOv2 but its decoder is initialized randomly.
Some Words on the Feature Alignment
Previous work has shown that depth estimation can be improved when training your model with an additional semantic segmentation task. The intuition behind it is, if you have a better understanding of the world (more meaningful feature embeddings) can also better resolve ambiguities in the depth estimation task.
In a failed attempt, the authors of Depth Anything tried to train a shared encoder with individual decoders for the depth estimation task and the semantic segmentation task on the unlabeled data. The unlabeled images have been labeled with a combination of Foundation models.
Eventually they implemented the feature alignment constraint. During the training of the student model, the encoder of the student is constrained to have similar features as the frozen DINOv2 encoder (a foundation model known for its semantically rich representations). Their similarity, as we will see later, is enforced by a cosine similarity.
However, since DINOv2 has very similar features for subparts of an object (e.g front and rear of a car) and in depth estimation sub-parts can have very different depth values, they also introduce a margin to enforce the similarity too strongly, but still benefit from it.
And What about those Perturbations?
So, only adding unlabeled data to the training pipeline is not sufficient to further improve the performance upon the labeled data only baseline.
What the authors did is to add strong perturbations to the unlabeled data that are consumed by the student. Note that these are added only to the student input and not to the teacher and not to the DINOv2 encoder. Similar approaches have been used already in other self-supervised frameworks like in DINO and BYOL.
These perturbations include augmentation techniques like color jittering, noising the image and spatial distortions.
The intuition behind those is always similar: make the models to learn similar features for the same things regardless of their appearance in the image. A car is a car regardless of the image being blurred or horizontally mirrored.
The Data used to train the Model
The authors compose a dataset consisting of 1.5 million labeled images and 62 million unlabeled images of a variety of indoor and outdoor scenes.
The labels are dense depth maps meaning, each pixel will have a depth associated with it (might need interpolation if the map is smaller as the image).

As mentioned earlier, the labeled data is used to train a teacher model and the student model. The unlabeled data is used to train the student model only.
The Loss Function to Train the Model
The student model’s loss function is an average of 3 loss terms:
- Loss for labeled images between student predictions and ground truth.
- Loss for feature alignment between student encoder and DINOv2.
- Loss for unlabeled images between student prediction and pseudo label generated by the teacher from the unlabeled data.
Let’s start with the labeled images: It is a simple mean absolute error aka. MAE, between predictions of the student on images and their corresponding labels of the labeled dataset.

The feature alignment loss is calculated by the average cosine similarity between the encoder output of the student model and the DINOv2 encoder output. Since we want to have the loss to penalizes dissimilarities, we subtract the cosine similarity from 1. Note that the cosine similarity measures the angle between 2 vectors, not their distance.

Note: The paper mentions a margin 𝛼 to not contrain the similarity too strongly, as described earlier. The paper is not really clear about that but I guess they ignore losses that are <𝛼, and they use 𝛼=0.15 according their ablation studies.
Finally let’s have a look on the unlabeled data. Despite the strong perturbations applied to single images, the authors also implement a CutMix augmentation for 50% of the samples. It basically combines 2 images using a mask.

The combined image is then fed into the student model and the individual image is fed into the teacher model to obtain the pseudo label. Then a mean absolute error is calculated. Note that the loss is calculated twice, one time for each contributing image determined by the mask M.

Finally, the individual terms are added and the average over all pixels is taken. I guess there is no need to multiply again by the mask since it has been done in the equation above already.

Qualitative Results
Before diving into the experiments and ablations, let’s change the order of the paper and first inspect some qualitative results.
In a first test, the model was inferred with unseen images from different domains including indoor and outdoor scenes with different lighting conditions.

Further, the compare their performance against MiDaS v3.1, showing that the Depth Anything model in general is able to capture more details and has a better general understanding of the scene allowing to resolve ambiguities much better.

In a final qualitative test, they took a ControlNet, a diffusion-based generative model to generate new images conditioned on the predicted depth map and an input prompt.

Experiments and Ablations
Let’s now take a closer look onto the experiments and ablations that have been performed to evaluate the effectiveness of their method.
Experiments
In Summary, the authors show in their experiments that:
- the Depth Anything encoder outperforms previous SOTA on 0-shot relative depth estimation and fine-tuned metric depth estimation even with smaller backbone.
- the feature alignment constraint is effective and that the Depth Anything encoder has semantically rich features that can be fine-tuned into segmentation model.
Zero-Shot, Relative Depth Estimation
In this experiment, the Depth Anything model is pre-trained on the dataset composed by the authors and evaluated on other datasets without having seen a sample from the individual datasets (0-shot). It is compared against the best checkpoint of MiDaS v3.1.

Fine-Tuned (In-Domain) Metric Depth Estimation
In this experiment the Depth Anything model pre-trained on relative depth estimation is fine-tuned for metric depth estimation. An encoder-decoder model is initialized with the pre-trained Depth Anything encoder and a randomly initialized decoder. It is then fine-tuned using the ZoeDepth framework on the training set of a given dataset and finally evaluated on the same dataset and compared to other models.

Fine-Tuned (Zero-Shot) Metric Depth Estimation
Here the same pre-trained Depth Anything model is fine-tuned on metric depth estimation, but this time evaluated on another dataset than it was tested on.
The ZoeDepth framework is used to fine-tune two metric depth estimation models. The first one uses MiDaS v3.1 (shown as ZoeDepth in the table) as encoder and the second one uses the Depth Anything encoder.

Fine-Tuned Semantic Segmentation
In this experiment the authors test the semantic capability of the Depth Anything encoder. Recall that it was constrained during training with their feature alignment loss to have similar features as the DINOv2 encoder, a foundation model to be known to have rich semantic embeddings.

Ablations
General Ablation Study
In this general ablation study, the authors tested the effectiveness of their different loss terms and constraints and the strong perturbations on unlabeled data.

Different Encoders for Downstream Task
In this final ablation study the authors show that their encoder is superior to DINOv2’s and MiDaS’ encoder when fine-tuned on depth estimation and semantic segmentation tasks suggesting the feature space of the Depth Anything encoder captures more semantic information.

Conclusion
Depth Anything is a great step towards Foundation Models in other domains than image or text. They have successfully trained a strong and semantically rich encoder model that can be either used in a 0-shot mode or further fine-tuned to a custom dataset.
Further Readings & Resources
Here a list of links with further reassorces on Depth Anything and Foundation Models.
Hugging Face Demo to play with Depth Anything:
Paper Walkthroughs
You might also like my other paper walkthroughs covering other foundation models: