File size: 3,207 Bytes

c5e873a
093d083
c5e873a
093d083
c5e873a
093d083
c5e873a
093d083
c5e873a
d0099e6
 
 
cd613e1
 
c5e873a
093d083
c5e873a
093d083
c5e873a
093d083
 
 
 
c5e873a
093d083
c5e873a
093d083
 
 
c5e873a
093d083
 
 
c5e873a
093d083
 
c5e873a
093d083
c5e873a
093d083
c5e873a
093d083
 
 
 
 
c5e873a
093d083
c5e873a
093d083
c5e873a
093d083
c5e873a
093d083
c5e873a
093d083
c5e873a
093d083
c5e873a
093d083
 
 
 
 
 
 
 
 
 
d0099e6

---
{}
---
# AM-RADIO: Reduce All Domains Into One

Mike Ranzinger, Greg Heinrich, Jan Kautz, Pavlo Molchanov

[NVIDIA Research](https://www.nvidia.com/en-us/research/)

\[[AM-RADIO Paper](https://arxiv.org/abs/2312.06709)\]
\[[PHI-S Paper](https://arxiv.org/abs/2410.01680)\]
\[[BibTex](#citing-radio)\]\[[GitHub examples](https://github.com/NVlabs/RADIO)\]
\[[Tech report on v2.5](https://github.com/NVlabs/RADIO/blob/main/RADIOv2.5_tech_report.md)\]


### HuggingFace Hub

You can pull the model from a Python script:

```Python
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

hf_repo = "nvidia/RADIO-L"

image_processor = CLIPImageProcessor.from_pretrained(hf_repo)
model = AutoModel.from_pretrained(hf_repo, trust_remote_code=True)
model.eval().cuda()

image = Image.open('./assets/radio.png').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt', do_resize=True).pixel_values
pixel_values = pixel_values.cuda()

summary, features = model(pixel_values)
```

### Usage

RADIO will return a tuple with two tensors. The `summary` is similar to the `cls_token` in ViT and is meant to represent the general concept of the entire image. It has shape $(B,C)$ with $B$ being the batch dimension, and $C$ being some number of channels. The `spatial_features` represent more localized content which should be suitable for dense tasks such as semantic segmentation, or for integration into an LLM. It has shape $(B,T,D)$ with $T$ being the flattened spatial tokens, and $D$ being the channels for spatial features. Note that $C \neq D$ in general.

Converting to a spatial tensor format can be done using the downsampling size of the model, combined with the input tensor shape. For 'radio_v1', the patch size is 14.
```Python
from einops import rearrange
spatial_features = rearrange(spatial_features, 'b (h w) d -> b d h w', h=x.shape[-2] // patch_size, w=x.shape[-1] // patch_size)
```

The resulting tensor will have shape $(B,D,H,W)$, as is typically seen with computer vision models.

### RADIOv2.5 Notes

See the [RADIOv2.5 technical report](https://github.com/NVlabs/RADIO/blob/main/RADIOv2.5_tech_report.md).

## License

RADIO code and weights are released under the [NSCLv1 License](LICENSE).

## Citing RADIO

If you find this repository useful, please consider giving a star and citation:
```
@InProceedings{Ranzinger_2024_CVPR,
    author    = {Ranzinger, Mike and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo},
    title     = {AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {12490-12500}
}
```

```
@misc{ranzinger2024phisdistributionbalancinglabelfree,
      title={PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation}, 
      author={Mike Ranzinger and Jon Barker and Greg Heinrich and Pavlo Molchanov and Bryan Catanzaro and Andrew Tao},
      year={2024},
      eprint={2410.01680},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2410.01680}, 
}
```