Paolo-Fraccaro commited on
Commit
e4a3640
1 Parent(s): 66a6937

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -11
README.md CHANGED
@@ -4,25 +4,19 @@ tags:
4
  - Pytorch
5
  - Geospatial
6
  - Temporal ViT
 
7
  ---
8
 
9
- This repository includes the foundation model architecture of Prithvi, a first-of-its-kind temporal Vision transformer pretrained by the IBM and NASA team on continental US Harmonised Landsat Sentinel 2 (HLS) data. This is contained in the `hls-gfm` folder, alongside all the relevant info on how to obtain the pre-trained weights through Hugging Face.
10
- This repo also contains a practical implementation of finetuning Prithvi to flood detection and fire scars detection as an example of a specific downstream application. See the `fine-tuning-example` folder for more details.
11
 
 
12
 
13
- ### Model and Input
14
  The model expects remote sensing data in a video format (B, C, T, H, W). Note that the temporal dimension is very important here and not present in most
15
  other works around remote sensing modeling. Being able to handle a time series of remote sensing images can be very helpful to a variety of downstream tasks. The model can also handle static image which can be simply fed into the model with T=1.
16
 
17
- ### Code
18
- The model follows [original mae repo](https://github.com/facebookresearch/mae) with modifications including:
19
- 1. replace 2D patch embed with 3D patch embed
20
- 2. replace 2D positional embed with 3D positional embed
21
- 3. replace 2D patchify and unpatchify with 3D
22
- 4. etc.
23
-
24
  ### Pre-training
25
- The model was pre-trained with Harmonised Landsat and Sentinel 2 data from NASA using the following bands:
26
 
27
  * Blue
28
  * Green
@@ -31,3 +25,8 @@ The model was pre-trained with Harmonised Landsat and Sentinel 2 data from NASA
31
  * SWIR 1
32
  * SWIR 2
33
 
 
 
 
 
 
 
4
  - Pytorch
5
  - Geospatial
6
  - Temporal ViT
7
+ - Vit
8
  ---
9
 
10
+ ### Model and Inputs
11
+ Prithvi is a first-of-its-kind temporal Vision transformer pretrained by the IBM and NASA team on continental US Harmonised Landsat Sentinel 2 (HLS) data. Particularly, the model adopts a self-supervised encoder developed with a ViT architecture and Masked AutoEncoder learning strategy, with a MSE as a loss function. The model includes spatial attention across multiple patchies and also temporal attention for each patch.
12
 
13
+ ![](Prithvi_training.png)
14
 
 
15
  The model expects remote sensing data in a video format (B, C, T, H, W). Note that the temporal dimension is very important here and not present in most
16
  other works around remote sensing modeling. Being able to handle a time series of remote sensing images can be very helpful to a variety of downstream tasks. The model can also handle static image which can be simply fed into the model with T=1.
17
 
 
 
 
 
 
 
 
18
  ### Pre-training
19
+ The model was pre-trained with NASA's HLS2 L30 product (30m granularity) from Continental United States. The bands that were used are the following:
20
 
21
  * Blue
22
  * Green
 
25
  * SWIR 1
26
  * SWIR 2
27
 
28
+ ### Code
29
+ The model follows the [original mae repo](https://github.com/facebookresearch/mae) with some modifications including:
30
+ 1. replace 2D patch embed with 3D patch embed;
31
+ 2. replace 2D positional embed with 3D positional embed;
32
+ 3. replace 2D patchify and unpatchify with 3D.