
Foundation models are flexible Deep Learning algorithms that are designed for general tasks, rather than being immediately focused on specific tasks. Trained on large amounts of unlabeled data, they can be applied to a variety of downstream tasks with minimal fine-tuning. Foundation models are well known from natural language processing (BERT, GPT-x), and image processing (DALL-E).
In August 2023, NASA and IBM released the Geospatial AI Foundation Model for NASA Earth Observation Data. The model is available open source on Huggingface under the name of Prithvi, the Hindu goddess of Mother Earth. It has been trained on NASA satellite data – according to IBM, more than 250 Petabyte of data are available.
In this blog post, we discuss
- The NASA Harmonized Sentinel-2 Landsat dataset used for training,
- The architecture of the Prithvi-100M Geospatial AI Foundation Model,
- The training process on IBM’s Vela supercomputer,
- Example applications: flooding and crop type identification.
Training data
The Geospatial AI Foundation Model has been trained on NASA Harmonized LandSat Sentinel-2 data.
Sentinel-2 is a satellite mission coordinated by the European Space Agency, with two satellites currently in orbit taking high-resolution images of the Earth. It focuses on land, coastal areas, and selected open waters. The Landsat satellites were launched by NASA to record surface reflectance. The harmonized data combine the input from both sensors, resulting in a spatial resolution of about 30 meters and an average revisit time of two to three days. This resolution is sufficient for agricultural monitoring, land use classification, and natural disaster detection.
Standard photographs are made up of three colors: red, green, and blue. The Sentinel-2 data is available in a total of 13 "colors", so-called bands, spanning the visible, near-infrared, and shortwave infrared range of the electromagnetic spectrum. Selected bands can be used to identify different things, e.g. the infrared bands contain information about the vegetation. For background, see this post on Sentinel-2 band combinations.

Clouds block the view of Earth observation satellites. To counteract this effect, Sentinel-2 provides a band that can be used to identify cloud cover. The affected pixels are masked so as not to confuse the image processing algorithms.
As such, Sentinel-2 and Landsat data are unlabeled. It requires significant human effort and expertise to provide a pixel-wise classification of land use categories. Foundation models are highly versatile and extract structure from data without requiring labeled data for the initial stage of the training procedure. Therefore, they seem very promising for Earth observation data.
Model Architecture
The Prithvi-100M Geospatial AI Foundation Model builds on a temporal vision transformer and on a masked autoencoder. The model card is shown on Huggingface:

The model accepts Landsat images in video format as input. Images taken at the same location are loaded as a time series, while static images can be processed by setting the time series length to 1. The bands correspond to the channels of the vision transformer.
Vision Transformer
In 2020, a team from Google Research showed that transformers could be applied not only to natural language processing, but also to images (Dosovitsky et al, ICLR 2020). Until then, convolutional neural networks had been the de facto standard for image processing.
The vision transformer first cuts images into small patches, similar to the tokenization of sentences for a language processing transformer. Then, learnable embeddings and positional encodings are added. In the original paper, it was shown that with large amounts of training data, the vision transformer can outperform typical computer vision architectures such as ResNet.
Masked Autoencoder
The Prithvi-100M masked autoencoder is based on the original implementation by He et al (2021), https://arxiv.org/pdf/2111.06377.pdf. The concept is straightforward:
Random patches in an image are masked. The autoencoder learns to predict the missing pixels. This is similar to the training of large language models, where the model learns to predict missing words from a sentence.
In the original paper, 2D images with RGB (red, green, blue) color channels are considered. The difference between training on language data and image data is extensively discussed in the paper.
The encoder works only on the unmasked patches, which saves computation time. The embedding is taken care of by providing a linear projection of the individual patches, containing learnable parameters.
Position embedding is important so that the algorithm knows where a patch is in the original image. In the case of the Masked Autoencoder, position embedding is provided by a 2D sine-cosine function that is typically used in transformer models. It encodes the position of a patch in a 2D grid of the overall image. The positional embedding may contain learnable parameters, but this does not appear to be the case in the implementation in the MAE repository.

Changes to the MAE Architecture
In order to process time series of satellite data with more channels, the NASA and IBM team made several changes to the Masked Autoencoder architecture.
- The 2D patch embedding was changed to a 3D patch embedding
- The 2D positional embedding was changed to a 3D positional embedding
- The patch creation takes into account the 3D nature of the data
- Besides RGB colors, one near-infrared and two short-wave infrared bands were added
Loss function
The mean squared error (MSE) loss is used for training by comparing the original and the reconstructed images pixel by pixel.
Model Training
The model training procedure is described on the IBM blog: https://research.ibm.com/blog/nasa-hugging-face-ibm. Unfortunately, not a lot of detail is provided. IBM mentions, however, that they trained on their company’s AI supercomputer Vela. Vela is a fully cloud-based supercomputer that operates exclusively for IBM Research.
The supercomputer consists of 200 nodes. Each node is equipped with 8 NVIDIA A100 GPUs, each with 80 GB of GPU memory. The node RAM is 1.5 TB, and four 3.2 TB local hard drives are available. This accounts for the large datasets that need to be handled to train foundation models. The nodes are connected with a network that can transfer up to 100 GB/second.
Application
The Prithvi-100M Geospatial AI Foundation Model can be applied to a variety of downstream tasks. We focus on two tasks: Flooding and crop type identification.
Flooding
Keeping the original encoder part of Prithvi-100M, the model is now adapted to predict the extension of flooding in a satellite image. Details are described on Huggingface. The Sen1Floods11 dataset is used for finetuning, covering 11 flood events on six continents.

In order to prepare Prithvi-100M for the downstream task, the embedding shape needs to be converted back to the original image shape. Then, a final 2D convolutional layer is added that applies the task-specific classification.
Each pixel in the image is classified as being either water or non-water (land). Since this is a classification problem, the binary cross-entropy loss is used. Only one image is processed at a time, so the time series functionality of Prithvi-100M is not used here.
The authors report a mean accuracy of 93% and mean intersection over union of 86% for a holdout flood event in Bolivia.
A demo page is provided where users can upload their own Sentinel-2 images and ask Prithvi-100M to identify flooding.

Crop type identification
To leverage the time series functionality, the authors provide a demo for crop type identification. The crop type ground truth is given by labeled images. This is a multi-class classification problem, and cross-entropy loss is used for training.

The authors report different accuracies for the different crop types. On average, the accuracy is 64%, and intersection over union is 46%. However, the authors note that the ground truth is noisy, and more accurate labels would help to improve this downstream task.

Summary
We have covered the Geospatial AI Foundation Model, currently (2023) the largest geospatial model on Huggingface under the name Prithvi-100M. The model is developed by IBM Research and NASA and uses the Landsat dataset for training.
We have covered the training data, architecture, and training procedure of the Geospatial AI Foundation Model. The model is available as open source and can be fine-tuned for more specific tasks. Flood detection and crop type identification applications have shown the great potential of the Geospatial AI Foundation Model.
Since Sentinel-2 data is available for individual non-commercial use, it would be possible for interested users to create their own model tuned to a specific downstream task. In a future post, I will show how to fine-tune the Geospatial AI Foundation Model for vegetation identification and super-resolution.
Further Reading
- Jakubik et al, Prithvi-100M, https://github.com/NASA-IMPACT/hls-foundation-os, 2023.
- Prithvi-100M on Huggingface: https://huggingface.co/ibm-nasa-geospatial
- Sentinel-2 satellite bands: https://gisgeography.com/sentinel-2-bands-combinations/
- He et al (2021), "Masked Autoencoders are Scalable Vision Learners", https://arxiv.org/pdf/2111.06377.pdf
- Dosovitskiy et al (2020), "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale", ICLR 2020, https://arxiv.org/abs/2010.11929