# ViTDet: Exploring Plain Vision Transformer Backbones for Object Detection Yanghao Li, Hanzi Mao, Ross Girshick†, Kaiming He† [[`arXiv`](https://arxiv.org/abs/2203.16527)] [[`BibTeX`](#CitingViTDet)] In this repository, we provide configs and models in Detectron2 for ViTDet as well as MViTv2 and Swin backbones with our implementation and settings as described in [ViTDet](https://arxiv.org/abs/2203.16527) paper. ## Pretrained Models ### COCO #### Mask R-CNN
Name pre-train train
time
(s/im)
inference
time
(s/im)
train
mem
(GB)
box
AP
mask
AP
model id download
ViTDet, ViT-B IN1K, MAE 0.314 0.079 10.9 51.6 45.9 325346929 model
ViTDet, ViT-L IN1K, MAE 0.603 0.125 20.9 55.5 49.2 325599698 model
ViTDet, ViT-H IN1K, MAE 1.098 0.178 31.5 56.7 50.2 329145471 model
#### Cascade Mask R-CNN
Name pre-train train
time
(s/im)
inference
time
(s/im)
train
mem
(GB)
box
AP
mask
AP
model id download
Swin-B IN21K, sup 0.389 0.077 8.7 53.9 46.2 342979038 model
Swin-L IN21K, sup 0.508 0.097 12.6 55.0 47.2 342979186 model
MViTv2-B IN21K, sup 0.475 0.090 8.9 55.6 48.1 325820315 model
MViTv2-L IN21K, sup 0.844 0.157 19.7 55.7 48.3 325607715 model
MViTv2-H IN21K, sup 1.655 0.285 18.4* 55.9 48.3 326187358 model
ViTDet, ViT-B IN1K, MAE 0.362 0.089 12.3 54.0 46.7 325358525 model
ViTDet, ViT-L IN1K, MAE 0.643 0.142 22.3 57.6 50.0 328021305 model
ViTDet, ViT-H IN1K, MAE 1.137 0.196 32.9 58.7 51.0 328730692 model
### LVIS #### Mask R-CNN
Name pre-train train
time
(s/im)
inference
time
(s/im)
train
mem
(GB)
box
AP
mask
AP
model id download
ViTDet, ViT-B IN1K, MAE 0.317 0.085 14.4 40.2 38.2 329225748 model
ViTDet, ViT-L IN1K, MAE 0.576 0.137 24.7 46.1 43.6 329211570 model
ViTDet, ViT-H IN1K, MAE 1.059 0.186 35.3 49.1 46.0 332434656 model
#### Cascade Mask R-CNN
Name pre-train train
time
(s/im)
inference
time
(s/im)
train
mem
(GB)
box
AP
mask
AP
model id download
Swin-B IN21K, sup 0.368 0.090 11.5 44.0 39.6 329222304 model
Swin-L IN21K, sup 0.486 0.105 13.8 46.0 41.4 329222724 model
MViTv2-B IN21K, sup 0.475 0.100 11.8 46.3 42.0 329477206 model
MViTv2-L IN21K, sup 0.844 0.172 21.0 49.4 44.2 329661552 model
MViTv2-H IN21K, sup 1.661 0.290 21.3* 49.5 44.1 330445165 model
ViTDet, ViT-B IN1K, MAE 0.356 0.099 15.2 43.0 38.9 329226874 model
ViTDet, ViT-L IN1K, MAE 0.629 0.150 24.9 49.2 44.5 329042206 model
ViTDet, ViT-H IN1K, MAE 1.100 0.204 35.5 51.5 46.6 332552778 model
Note: Unlike the system-level comparisons in the paper, these models use a lower resolution (1024 instead of 1280) and standard NMS (instead of soft NMS). As a result, they have slightly lower box and mask AP. We observed higher variance on LVIS evalution results compared to COCO. For example, the standard deviations of box AP and mask AP were 0.30% (compared to 0.10% on COCO) when we trained ViTDet, ViT-B five times with varying random seeds. The above models were trained and measured on 8-node with 64 NVIDIA A100 GPUs in total. *: Activation checkpointing is used. ## Training All configs can be trained with: ``` ../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py ``` By default, we use 64 GPUs with batch size as 64 for training. ## Evaluation Model evaluation can be done similarly: ``` ../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint ``` ## Citing ViTDet If you use ViTDet, please use the following BibTeX entry. ```BibTeX @article{li2022exploring, title={Exploring plain vision transformer backbones for object detection}, author={Li, Yanghao and Mao, Hanzi and Girshick, Ross and He, Kaiming}, journal={arXiv preprint arXiv:2203.16527}, year={2022} } ```