Model Card for ArborCLIP

ARBORCLIP is a new suite of vision-language foundation models for biodiversity. These CLIP-style foundation models were trained on ARBORETUM-40M, which is a large-scale dataset of 40 million images of 33K species of plants and animals. The models are evaluated on zero-shot image classification tasks.

Model type: Vision Transformer (ViT-B/16, ViT-L/14)
License: MIT
Fine-tuned from model: OpenAI CLIP, MetaCLIP, BioCLIP

These models were developed for the benefit of the AI community as an open-source product. Thus, we request that any derivative products are also open-source.

Model Description

ArborCLIP is based on OpenAI's CLIP model. The models were trained on ARBORETUM-40M for the following configurations:

ARBORCLIP-O: Trained a ViT-B/16 backbone initialized from the OpenCLIP's checkpoint. The training was conducted for 40 epochs.
ARBORCLIP-B: Trained a ViT-B/16 backbone initialized from the BioCLIP's checkpoint. The training was conducted for 8 epochs.
ARBORCLIP-M: Trained a ViT-L/14 backbone initialized from the MetaCLIP's checkpoint. The training was conducted for 12 epochs.

To access the checkpoints of the above models, go to the Files and versions tab and download the weights. These weights can be directly used for zero-shot classification and finetuning. The filenames correspond to the specific model weights -

ARBORCLIP-O: - arborclip-vit-b-16-from-openai-epoch-40.pt,
ARBORCLIP-B: - arborclip-vit-b-16-from-bioclip-epoch-8.pt
ARBORCLIP-M - arborclip-vit-l-14-from-metaclip-epoch-12.pt

Model Training

See the Model Training section on the Github for examples of how to use ArborCLIP models in zero-shot image classification tasks.

We train three models using a modified version of the BioCLIP / OpenCLIP codebase. Each model is trained on Arboretum-40M, on 2 nodes, 8xH100 GPUs, on NYU's Greene high-performance compute cluster. We publicly release all code needed to reproduce our results on the Github page.

We optimize our hyperparameters prior to training with Ray. Our standard training parameters are as follows:

--dataset-type webdataset 
--pretrained openai 
--text_type random 
--dataset-resampled 
--warmup 5000 
--batch-size 4096 
--accum-freq 1 
--epochs 40
--workers 8 
--model ViT-B-16 
--lr 0.0005 
--wd 0.0004 
--precision bf16 
--beta1 0.98 
--beta2 0.99 
--eps 1.0e-6 
--local-loss 
--gather-with-grad 
--ddp-static-graph 
--grad-checkpointing

For more extensive documentation of the training process and the significance of each hyperparameter, we recommend referencing the OpenCLIP and BioCLIP documentation, respectively.

Model Validation

For validating the zero-shot accuracy of our trained models and comparing to other benchmarks, we use the VLHub repository with some slight modifications.

Pre-Run

After cloning the Github repository and navigating to the Arboretum/model_validation directory, we recommend installing all the project requirements into a conda container; pip install -r requirements.txt. Also, before executing a command in VLHub, please add Arboretum/model_validation/src to your PYTHONPATH.

export PYTHONPATH="$PYTHONPATH:$PWD/src";

Base Command

A basic Arboretum model evaluation command can be launched as follows. This example would evaluate a CLIP-ResNet50 checkpoint whose weights resided at the path designated via the --resume flag on the ImageNet validation set, and would report the results to Weights and Biases.

python src/training/main.py --batch-size=32 --workers=8 --imagenet-val "/imagenet/val/" --model="resnet50" --zeroshot-frequency=1 --image-size=224 --resume "/PATH/TO/WEIGHTS.pth" --report-to wandb

Training Dataset

Dataset Repository: Arboretum
Dataset Paper: Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity (arXiv)
HF Dataset card: Arboretum

Model's Limitation

All the ArborCLIP models were evaluated on the challenging CONFOUNDING-SPECIES benchmark. However, all the models performed at or below random chance. This could be an interesting avenue for follow-up work and further expand the models capabilities.

In general, we found that models trained on web-scraped data performed better with common names, whereas models trained on specialist datasets performed better when using scientific names. Additionally, models trained on web-scraped data excel at classifying at the highest taxonomic level (kingdom), while models begin to benefit from specialist datasets like ARBORETUM-40M and Tree-of-Life-10M at the lower taxonomic levels (order and species). From a practical standpoint, ArborCLIP is highly accurate at the species level, and higher-level taxa can be deterministically derived from lower ones.

Addressing these limitations will further enhance the applicability of models like ArborCLIP in real-world biodiversity monitoring tasks.

Acknowledgements

This work was supported by the AI Research Institutes program supported by the NSF and USDA-NIFA under AI Institute: for Resilient Agriculture, Award No. 2021-67021-35329. This was also partly supported by the NSF under CPS Frontier grant CNS-1954556. Also, we gratefully acknowledge the support of NYU IT High Performance Computing resources, services, and staff expertise.

Citation

If you find the models and datasets useful in your research, please consider citing our paper:

@misc{yang2024arboretumlargemultimodaldataset,
        title={Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity}, 
        author={Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab,
           Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh,
            Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian},
        year={2024},
        eprint={2406.17720},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2406.17720}, 
  }

For more details and access to the Arboretum dataset, please visit the Project Page.