|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- imagenet-1k |
|
- ade20k |
|
metrics: |
|
- accuracy |
|
- mIoU |
|
pipeline_tag: image-classification |
|
--- |
|
|
|
# VisionLLaMA-Base-MAE |
|
|
|
With the Masked Autoencoders' paradigm, VisionLLaMA-Base-MAE model is trained on ImageNet-1k without labels. It manifests substantial improvements over classification tasks (SFT, linear probing) on ImageNet-1K and the segmentation task on ADE20K. |
|
|
|
| Model | ImageNet Acc (SFT) | ImageNet Acc (Linear Probe) | ADE20K Segmentation | |
|
| -- | -- | --| --| |
|
| VisionLLaMA-Base-MAE (ep800) |84.0 |69.7 |49.0 | |
|
| VisionLLaMA-Base-MAE (ep1600) |84.3 | 71.7| 50.2 | |
|
|
|
|
|
# How to Use |
|
|
|
Please refer the [Github](https://github.com/Meituan-AutoML/VisionLLaMA) page for usage. |
|
|
|
# Citation |
|
|
|
``` |
|
@article{chu2024visionllama, |
|
title={VisionLLaMA: A Unified LLaMA Interface for Vision Tasks}, |
|
author={Chu, Xiangxiang and Su, Jianlin and Zhang, Bo and Shen, Chunhua}, |
|
journal={arXiv preprint arXiv:2403.00522}, |
|
year={2024} |
|
} |
|
``` |