# ViTDet: Exploring Plain Vision Transformer Backbones for Object Detection Yanghao Li, Hanzi Mao, Ross Girshick†, Kaiming He† [[`arXiv`](https://arxiv.org/abs/2203.16527)] [[`BibTeX`](#CitingViTDet)] In this repository, we provide configs and models in Detectron2 for ViTDet as well as MViTv2 and Swin backbones with our implementation and settings as described in [ViTDet](https://arxiv.org/abs/2203.16527) paper. ## Pretrained Models ### COCO #### Mask R-CNN
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
ViTDet, ViT-B | IN1K, MAE | 0.314 | 0.079 | 10.9 | 51.6 | 45.9 | 325346929 | model |
ViTDet, ViT-L | IN1K, MAE | 0.603 | 0.125 | 20.9 | 55.5 | 49.2 | 325599698 | model |
ViTDet, ViT-H | IN1K, MAE | 1.098 | 0.178 | 31.5 | 56.7 | 50.2 | 329145471 | model |
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
Swin-B | IN21K, sup | 0.389 | 0.077 | 8.7 | 53.9 | 46.2 | 342979038 | model |
Swin-L | IN21K, sup | 0.508 | 0.097 | 12.6 | 55.0 | 47.2 | 342979186 | model |
MViTv2-B | IN21K, sup | 0.475 | 0.090 | 8.9 | 55.6 | 48.1 | 325820315 | model |
MViTv2-L | IN21K, sup | 0.844 | 0.157 | 19.7 | 55.7 | 48.3 | 325607715 | model |
MViTv2-H | IN21K, sup | 1.655 | 0.285 | 18.4* | 55.9 | 48.3 | 326187358 | model |
ViTDet, ViT-B | IN1K, MAE | 0.362 | 0.089 | 12.3 | 54.0 | 46.7 | 325358525 | model |
ViTDet, ViT-L | IN1K, MAE | 0.643 | 0.142 | 22.3 | 57.6 | 50.0 | 328021305 | model |
ViTDet, ViT-H | IN1K, MAE | 1.137 | 0.196 | 32.9 | 58.7 | 51.0 | 328730692 | model |
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
ViTDet, ViT-B | IN1K, MAE | 0.317 | 0.085 | 14.4 | 40.2 | 38.2 | 329225748 | model |
ViTDet, ViT-L | IN1K, MAE | 0.576 | 0.137 | 24.7 | 46.1 | 43.6 | 329211570 | model |
ViTDet, ViT-H | IN1K, MAE | 1.059 | 0.186 | 35.3 | 49.1 | 46.0 | 332434656 | model |
Name | pre-train | train time (s/im) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
---|---|---|---|---|---|---|---|---|
Swin-B | IN21K, sup | 0.368 | 0.090 | 11.5 | 44.0 | 39.6 | 329222304 | model |
Swin-L | IN21K, sup | 0.486 | 0.105 | 13.8 | 46.0 | 41.4 | 329222724 | model |
MViTv2-B | IN21K, sup | 0.475 | 0.100 | 11.8 | 46.3 | 42.0 | 329477206 | model |
MViTv2-L | IN21K, sup | 0.844 | 0.172 | 21.0 | 49.4 | 44.2 | 329661552 | model |
MViTv2-H | IN21K, sup | 1.661 | 0.290 | 21.3* | 49.5 | 44.1 | 330445165 | model |
ViTDet, ViT-B | IN1K, MAE | 0.356 | 0.099 | 15.2 | 43.0 | 38.9 | 329226874 | model |
ViTDet, ViT-L | IN1K, MAE | 0.629 | 0.150 | 24.9 | 49.2 | 44.5 | 329042206 | model |
ViTDet, ViT-H | IN1K, MAE | 1.100 | 0.204 | 35.5 | 51.5 | 46.6 | 332552778 | model |