|
--- |
|
license: apache-2.0 |
|
tags: |
|
- Pytorch |
|
- Geospatial |
|
- Temporal ViT |
|
--- |
|
|
|
This repository includes the foundation model architecture of Prithvi, a first-of-its-kind temporal Vision transformer pretrained by the IBM and NASA team on continental US Harmonised Landsat Sentinel 2 (HLS) data. This is contained in the `hls-gfm` folder, alongside all the relevant info on how to obtain the pre-trained weights through Hugging Face. |
|
This repo also contains a practical implementation of finetuning Prithvi to flood detection and fire scars detection as an example of a specific downstream application. See the `fine-tuning-example` folder for more details. |
|
|
|
|
|
### Model and Input |
|
The model expects remote sensing data in a video format (B, C, T, H, W). Note that the temporal dimension is very important here and not present in most |
|
other works around remote sensing modeling. Being able to handle a time series of remote sensing images can be very helpful to a variety of downstream tasks. The model can also handle static image which can be simply fed into the model with T=1. |
|
|
|
### Code |
|
The model follows [original mae repo](https://github.com/facebookresearch/mae) with modifications including: |
|
1. replace 2D patch embed with 3D patch embed |
|
2. replace 2D positional embed with 3D positional embed |
|
3. replace 2D patchify and unpatchify with 3D |
|
4. etc. |
|
|
|
### Pre-training |
|
The model was pre-trained with Harmonised Landsat and Sentinel 2 data from NASA using the following bands: |
|
|
|
* Blue |
|
* Green |
|
* Red |
|
* Narrow NIR |
|
* SWIR 1 |
|
* SWIR 2 |
|
|
|
|