The Data-centric AI Concepts in Segment Anything

Unpacking the data-centric AI concepts used in Segment Anything, the first foundation model for image segmentation

Henry Lai
Towards Data Science

--

Segment Anything dataset construction. Image from the paper https://arxiv.org/pdf/2304.02643.pdf

Artificial Intelligence (AI) has made remarkable progress, especially in developing foundation models, which are trained with a large quantity of data and can be adapted to a wide range of downstream tasks.

A notable success of the foundation models is Large Language Models (LLMs). These models can perform complex tasks with great precision, such as language translation, text summarization, and question-answering.

Foundation models are also starting to change the game in Computer Vision. Meta’s Segment Anything is a recent development that’s causing a stir.

The success of Segment Anything can be attributed to its large labeled dataset, which has played a crucial role in enabling its remarkable performance. The model architecture, as described in the Segment Anything paper, is surprisingly simple and lightweight.

In this article, drawing upon insights from our recent survey papers [1,2], we will take a closer look at Segment Anything through the lens of data-centric AI, a growing concept in the data science community.

What Can Segment Anything Do?

In a nutshell, the image segmentation task is to predict a mask to separate the areas of interest in an image, such as an object, a person, etc. Segmentation is a very important task in Computer Visual, making the image more meaningful and easier to analyze.

The difference between Segment Anything and other image segmentation approaches lies in introducing prompts to specify the segmentation location. Prompts can be vague, such as a point, a box, etc.

The image is a screenshot from https://segment-anything.com/ by uploading the image taken by the author.

What is Data-centric AI?

Comparison between data-centric AI and model-centric AI. https://arxiv.org/abs/2301.04819 Image by the author.

Data-centric AI is a novel approach to AI system development, which has been gaining traction and is being promoted by AI pioneer Andrew Ng.

Data-centric AI is the discipline of systematically engineering the data used to build an AI system. — Andrew Ng

Previously, our primary focus was on developing better models using data that remained largely unchanged — this was referred to as model-centric AI. However, this approach can be problematic in real-world scenarios since it fails to account for issues that may arise in the data, including inaccurate labels, duplicates, and biases. Consequently, overfitting a dataset may not necessarily result in improved model behavior.

Data-centric AI, on the other hand, prioritizes enhancing the quality and quantity of data utilized in creating AI systems. The focus is on the data itself, with relatively fixed models. Adopting a data-centric approach in developing AI systems has more promise in real-world applications since the maximum capability of a model is determined by the data used for training.

It’s crucial to distinguish between “data-centric” and “data-driven” approaches. “Data-driven” methods only rely on data to steer AI development, but the focus remains on creating models instead of engineering data, making it fundamentally different from “data-centric” approaches.

The data-centric AI framework encompasses three main objectives:

  • Training data development entails gathering and generating high-quality, diverse data to facilitate the training of machine learning models.
  • Inference data development involves constructing innovative evaluation sets that offer detailed insights into the model or unlock specific capabilities of the model through engineered data inputs, such as prompt engineering.
  • Data maintenance aims to ensure the quality and dependability of data in a constantly changing environment.
Data-centric AI framework. https://arxiv.org/abs/2303.10158. Image by the author.

The Model used in Segment Anything

Segment Anything Model. Image from the paper https://arxiv.org/pdf/2304.02643.pdf

The model design is surprisingly simple. The model mainly consists of three parts:

  1. Prompt encoder: This part is used to obtain the representation of the prompt, either through positional encoding or convolution.
  2. Image encoder: This part directly uses the Vision Transformer (ViT) without any special modifications.
  3. Lightweight mask decoder: This part mainly fuses prompt embedding and image embedding, using mechanisms such as attention. It is called lightweight because it has only a few layers.

The lightweight mask decoder is interesting, as it allows the model to be easily deployed, even with just CPUs. Below is the comment provided by the authors of Segment Anything.

Surprisingly, we find that a simple design satisfies all three constraints: a powerful image encoder computes an image embedding, a prompt encoder embeds prompts, and then the two information sources are combined in a lightweight mask decoder that predicts segmentation masks.

Therefore, the secret of Segment Anything’s strong performance is very likely not the model design, as it is very simple and lightweight.

Data-centric AI Concepts in Segment Anything

The core of training Segment Anything lies in a large annotated dataset containing more than a billion masks, which is 400 times larger than existing segmentation datasets. How did they achieve this? The authors used a data engine to perform the annotation, which can be broadly divided into three steps:

  1. Assisted-manual annotation: This step can be understood as an active learning process. First, an initial model is trained on public datasets. Next, annotators modify the predicted masks. Finally, the model is trained with the newly annotated data. These three steps were repeated six times, ultimately resulting in 4.3 million mask annotations.
  2. Semi-automatic annotation: The goal of this step is to increase the diversity of masks, which can also be understood as an active learning process. In simple terms, if the model can automatically generate good masks, human annotators don’t need to label them, and human efforts can focus on masks where the model is not confident enough. The method used to find confident masks is quite interesting, involving object detection on masks from the first step. For example, suppose there are 20 possible masks in an image. We first use the current model for segmentation, but this will probably only annotate a portion of the masks, with some masks not being well-annotated. We now need to identify which masks are good (confident) automatically. This paper’s approach is to perform object detection on the predicted masks to see if objects can be detected in the image. If objects are detected, we consider the corresponding mask to be confident. Suppose this process identifies eight confident masks; the annotator then labels the remaining 12, saving human effort. The above process was repeated five times, adding another 5.9 million mask annotations.
  3. Fully-automatic annotation: Simply put, this step uses the model trained in the previous step to annotate data. Some strategies were used to improve annotation quality, including:
    (1) filtering out less confident masks based on predicted Intersection over Union (IoU) values (the model has a head to predict IoU).
    (2) only considering stable masks, meaning that if the threshold is adjusted slightly above or below 0.5, the masks remain mostly unchanged. Specifically, for each pixel, the model outputs a value between 0 and 1. We typically use 0.5 as the threshold to decide whether a pixel is masked. Stability means that when the threshold is adjusted to a certain extent around 0.5 (e.g., 0.45 to 0.55), the corresponding mask remains largely unchanged, indicating that the model’s predictions are significantly different on either side of the boundary.
    (3) deduplication was performed with non-maximal suppression (NMS).
    This step annotated 11 billion masks (an increase of more than 100 times in quantity).

Does this process sound familiar? That’s right, the Reinforcement Learning from Human Feedback (RLHF) used in ChatGPT is quite similar to the process described above. The commonality between the two approaches is that instead of directly relying on humans to annotate data, a model is first trained by human inputs and then used to annotate data. In RLHF, a reward model is trained to give rewards for reinforcement learning, while in Segment Anything, the model is trained for direct image annotation.

Summary

The core contribution of Segment Anything lies in its large annotated data, demonstrating the crucial importance of the data-centric AI concept. The success of foundation models in the computer vision field can be considered an inevitable event, but surprisingly, it happened so quickly. Going forward, I believe other AI subfields, and even non-AI and non-computer-related fields, will see the emergence of foundation models in due course.

No matter how technology evolves, improving data quality and quantity will always be an effective way to enhance AI performance, making the concept of data-centric AI increasingly important.

I hope this article can inspire you in your own work. You can learn more about the data-centric AI framework in the following papers/resources:

If you found this article interesting, you may also want to check out my previous article: What Are the Data-Centric AI Concepts behind GPT Models?

Stay tuned!

--

--