# Detectron2 Model Zoo and Baselines ## Introduction This file documents a large collection of baselines trained with detectron2 in Sep-Oct, 2019. All numbers were obtained on [Big Basin](https://engineering.fb.com/data-center-engineering/introducing-big-basin-our-next-generation-ai-hardware/) servers with 8 NVIDIA V100 GPUs & NVLink. The speed numbers are periodically updated with latest PyTorch/CUDA/cuDNN versions. You can access these models from code using [detectron2.model_zoo](https://detectron2.readthedocs.io/modules/model_zoo.html) APIs. In addition to these official baseline models, you can find more models in [projects/](projects/). #### How to Read the Tables * The "Name" column contains a link to the config file. Models can be reproduced using `tools/train_net.py` with the corresponding yaml config file, or `tools/lazyconfig_train_net.py` for python config files. * Training speed is averaged across the entire training. We keep updating the speed with latest version of detectron2/pytorch/etc., so they might be different from the `metrics` file. Training speed for multi-machine jobs is not provided. * Inference speed is measured by `tools/train_net.py --eval-only`, or [inference_on_dataset()](https://detectron2.readthedocs.io/modules/evaluation.html#detectron2.evaluation.inference_on_dataset), with batch size 1 in detectron2 directly. Measuring it with custom code may introduce other overhead. Actual deployment in production should in general be faster than the given inference speed due to more optimizations. * The *model id* column is provided for ease of reference. To check downloaded file integrity, any model on this page contains its md5 prefix in its file name. * Training curves and other statistics can be found in `metrics` for each model. #### Common Settings for COCO Models * All COCO models were trained on `train2017` and evaluated on `val2017`. * The default settings are __not directly comparable__ with Detectron's standard settings. For example, our default training data augmentation uses scale jittering in addition to horizontal flipping. To make fair comparisons with Detectron's settings, see [Detectron1-Comparisons](configs/Detectron1-Comparisons/) for accuracy comparison, and [benchmarks](https://detectron2.readthedocs.io/notes/benchmarks.html) for speed comparison. * For Faster/Mask R-CNN, we provide baselines based on __3 different backbone combinations__: * __FPN__: Use a ResNet+FPN backbone with standard conv and FC heads for mask and box prediction, respectively. It obtains the best speed/accuracy tradeoff, but the other two are still useful for research. * __C4__: Use a ResNet conv4 backbone with conv5 head. The original baseline in the Faster R-CNN paper. * __DC5__ (Dilated-C5): Use a ResNet conv5 backbone with dilations in conv5, and standard conv and FC heads for mask and box prediction, respectively. This is used by the Deformable ConvNet paper. * Most models are trained with the 3x schedule (~37 COCO epochs). Although 1x models are heavily under-trained, we provide some ResNet-50 models with the 1x (~12 COCO epochs) training schedule for comparison when doing quick research iteration. #### ImageNet Pretrained Models It's common to initialize from backbone models pre-trained on ImageNet classification tasks. The following backbone models are available: * [R-50.pkl](https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/MSRA/R-50.pkl): converted copy of [MSRA's original ResNet-50](https://github.com/KaimingHe/deep-residual-networks) model. * [R-101.pkl](https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/MSRA/R-101.pkl): converted copy of [MSRA's original ResNet-101](https://github.com/KaimingHe/deep-residual-networks) model. * [X-101-32x8d.pkl](https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/FAIR/X-101-32x8d.pkl): ResNeXt-101-32x8d model trained with Caffe2 at FB. * [R-50.pkl (torchvision)](https://dl.fbaipublicfiles.com/detectron2/ImageNetPretrained/torchvision/R-50.pkl): converted copy of [torchvision's ResNet-50](https://pytorch.org/docs/stable/torchvision/models.html#torchvision.models.resnet50) model. More details can be found in [the conversion script](tools/convert-torchvision-to-d2.py). Note that the above models have __different__ format from those provided in Detectron: we do not fuse BatchNorm into an affine layer. Pretrained models in Detectron's format can still be used. For example: * [X-152-32x8d-IN5k.pkl](https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/25093814/X-152-32x8d-IN5k.pkl): ResNeXt-152-32x8d model trained on ImageNet-5k with Caffe2 at FB (see ResNeXt paper for details on ImageNet-5k). * [R-50-GN.pkl](https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/47261647/R-50-GN.pkl): ResNet-50 with Group Normalization. * [R-101-GN.pkl](https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/47592356/R-101-GN.pkl): ResNet-101 with Group Normalization. These models require slightly different settings regarding normalization and architecture. See the model zoo configs for reference. #### License All models available for download through this document are licensed under the [Creative Commons Attribution-ShareAlike 3.0 license](https://creativecommons.org/licenses/by-sa/3.0/). ### COCO Object Detection Baselines #### Faster R-CNN:
Name lr
sched
train
time
(s/iter)
inference
time
(s/im)
train
mem
(GB)
box
AP
model id download
R50-C4 1x 0.551 0.102 4.8 35.7 137257644 model | metrics
R50-DC5 1x 0.380 0.068 5.0 37.3 137847829 model | metrics
R50-FPN 1x 0.210 0.038 3.0 37.9 137257794 model | metrics
R50-C4 3x 0.543 0.104 4.8 38.4 137849393 model | metrics
R50-DC5 3x 0.378 0.070 5.0 39.0 137849425 model | metrics
R50-FPN 3x 0.209 0.038 3.0 40.2 137849458 model | metrics
R101-C4 3x 0.619 0.139 5.9 41.1 138204752 model | metrics
R101-DC5 3x 0.452 0.086 6.1 40.6 138204841 model | metrics
R101-FPN 3x 0.286 0.051 4.1 42.0 137851257 model | metrics
X101-FPN 3x 0.638 0.098 6.7 43.0 139173657 model | metrics
#### RetinaNet:
Name lr
sched
train
time
(s/iter)
inference
time
(s/im)
train
mem
(GB)
box
AP
model id download
R50 1x 0.205 0.041 4.1 37.4 190397773 model | metrics
R50 3x 0.205 0.041 4.1 38.7 190397829 model | metrics
R101 3x 0.291 0.054 5.2 40.4 190397697 model | metrics
#### RPN & Fast R-CNN:
Name lr
sched
train
time
(s/iter)
inference
time
(s/im)
train
mem
(GB)
box
AP
prop.
AR
model id download
RPN R50-C4 1x 0.130 0.034 1.5 51.6 137258005 model | metrics
RPN R50-FPN 1x 0.186 0.032 2.7 58.0 137258492 model | metrics
Fast R-CNN R50-FPN 1x 0.140 0.029 2.6 37.8 137635226 model | metrics
### COCO Instance Segmentation Baselines with Mask R-CNN
Name lr
sched
train
time
(s/iter)
inference
time
(s/im)
train
mem
(GB)
box
AP
mask
AP
model id download
R50-C4 1x 0.584 0.110 5.2 36.8 32.2 137259246 model | metrics
R50-DC5 1x 0.471 0.076 6.5 38.3 34.2 137260150 model | metrics
R50-FPN 1x 0.261 0.043 3.4 38.6 35.2 137260431 model | metrics
R50-C4 3x 0.575 0.111 5.2 39.8 34.4 137849525 model | metrics
R50-DC5 3x 0.470 0.076 6.5 40.0 35.9 137849551 model | metrics
R50-FPN 3x 0.261 0.043 3.4 41.0 37.2 137849600 model | metrics
R101-C4 3x 0.652 0.145 6.3 42.6 36.7 138363239 model | metrics
R101-DC5 3x 0.545 0.092 7.6 41.9 37.3 138363294 model | metrics
R101-FPN 3x 0.340 0.056 4.6 42.9 38.6 138205316 model | metrics
X101-FPN 3x 0.690 0.103 7.2 44.3 39.5 139653917 model | metrics
#### New baselines using Large-Scale Jitter and Longer Training Schedule The following baselines of COCO Instance Segmentation with Mask R-CNN are generated using a longer training schedule and large-scale jitter as described in Google's [Simple Copy-Paste Data Augmentation](https://arxiv.org/pdf/2012.07177.pdf) paper. These models are trained from scratch using random initialization. These baselines exceed the previous Mask R-CNN baselines. In the following table, one epoch consists of training on 118000 COCO images.
Name epochs train
time
(s/im)
inference
time
(s/im)
box
AP
mask
AP
model id download
R50-FPN 100 0.376 0.069 44.6 40.3 42047764 model | metrics
R50-FPN 200 0.376 0.069 46.3 41.7 42047638 model | metrics
R50-FPN 400 0.376 0.069 47.4 42.5 42019571 model | metrics
R101-FPN 100 0.518 0.073 46.4 41.6 42025812 model | metrics
R101-FPN 200 0.518 0.073 48.0 43.1 42131867 model | metrics
R101-FPN 400 0.518 0.073 48.9 43.7 42073830 model | metrics
regnetx_4gf_dds_FPN 100 0.474 0.071 46.0 41.3 42047771 model | metrics
regnetx_4gf_dds_FPN 200 0.474 0.071 48.1 43.1 42132721 model | metrics
regnetx_4gf_dds_FPN 400 0.474 0.071 48.6 43.5 42025447 model | metrics
regnety_4gf_dds_FPN 100 0.487 0.073 46.1 41.6 42047784 model | metrics
regnety_4gf_dds_FPN 200 0.487 0.072 47.8 43.0 42047642 model | metrics
regnety_4gf_dds_FPN 400 0.487 0.072 48.2 43.3 42045954 model | metrics
### COCO Person Keypoint Detection Baselines with Keypoint R-CNN
Name lr
sched
train
time
(s/iter)
inference
time
(s/im)
train
mem
(GB)
box
AP
kp.
AP
model id download
R50-FPN 1x 0.315 0.072 5.0 53.6 64.0 137261548 model | metrics
R50-FPN 3x 0.316 0.066 5.0 55.4 65.5 137849621 model | metrics
R101-FPN 3x 0.390 0.076 6.1 56.4 66.1 138363331 model | metrics
X101-FPN 3x 0.738 0.121 8.7 57.3 66.0 139686956 model | metrics
### COCO Panoptic Segmentation Baselines with Panoptic FPN
Name lr
sched
train
time
(s/iter)
inference
time
(s/im)
train
mem
(GB)
box
AP
mask
AP
PQ model id download
R50-FPN 1x 0.304 0.053 4.8 37.6 34.7 39.4 139514544 model | metrics
R50-FPN 3x 0.302 0.053 4.8 40.0 36.5 41.5 139514569 model | metrics
R101-FPN 3x 0.392 0.066 6.0 42.4 38.5 43.0 139514519 model | metrics
### LVIS Instance Segmentation Baselines with Mask R-CNN Mask R-CNN baselines on the [LVIS dataset](https://lvisdataset.org), v0.5. These baselines are described in Table 3(c) of the [LVIS paper](https://arxiv.org/abs/1908.03195). NOTE: the 1x schedule here has the same amount of __iterations__ as the COCO 1x baselines. They are roughly 24 epochs of LVISv0.5 data. The final results of these configs have large variance across different runs.
Name lr
sched
train
time
(s/iter)
inference
time
(s/im)
train
mem
(GB)
box
AP
mask
AP
model id download
R50-FPN 1x 0.292 0.107 7.1 23.6 24.4 144219072 model | metrics
R101-FPN 1x 0.371 0.114 7.8 25.6 25.9 144219035 model | metrics
X101-FPN 1x 0.712 0.151 10.2 26.7 27.1 144219108 model | metrics
### Cityscapes & Pascal VOC Baselines Simple baselines for * Mask R-CNN on Cityscapes instance segmentation (initialized from COCO pre-training, then trained on Cityscapes fine annotations only) * Faster R-CNN on PASCAL VOC object detection (trained on VOC 2007 train+val + VOC 2012 train+val, tested on VOC 2007 using 11-point interpolated AP)
Name train
time
(s/iter)
inference
time
(s/im)
train
mem
(GB)
box
AP
box
AP50
mask
AP
model id download
R50-FPN, Cityscapes 0.240 0.078 4.4 36.5 142423278 model | metrics
R50-C4, VOC 0.537 0.081 4.8 51.9 80.3 142202221 model | metrics
### Other Settings Ablations for Deformable Conv and Cascade R-CNN:
Name lr
sched
train
time
(s/iter)
inference
time
(s/im)
train
mem
(GB)
box
AP
mask
AP
model id download
Baseline R50-FPN 1x 0.261 0.043 3.4 38.6 35.2 137260431 model | metrics
Deformable Conv 1x 0.342 0.048 3.5 41.5 37.5 138602867 model | metrics
Cascade R-CNN 1x 0.317 0.052 4.0 42.1 36.4 138602847 model | metrics
Baseline R50-FPN 3x 0.261 0.043 3.4 41.0 37.2 137849600 model | metrics
Deformable Conv 3x 0.349 0.047 3.5 42.7 38.5 144998336 model | metrics
Cascade R-CNN 3x 0.328 0.053 4.0 44.3 38.5 144998488 model | metrics
Ablations for normalization methods, and a few models trained from scratch following [Rethinking ImageNet Pre-training](https://arxiv.org/abs/1811.08883). (Note: The baseline uses `2fc` head while the others use [`4conv1fc` head](https://arxiv.org/abs/1803.08494))
Name lr
sched
train
time
(s/iter)
inference
time
(s/im)
train
mem
(GB)
box
AP
mask
AP
model id download
Baseline R50-FPN 3x 0.261 0.043 3.4 41.0 37.2 137849600 model | metrics
GN 3x 0.309 0.060 5.6 42.6 38.6 138602888 model | metrics
SyncBN 3x 0.345 0.053 5.5 41.9 37.8 169527823 model | metrics
GN (from scratch) 3x 0.338 0.061 7.2 39.9 36.6 138602908 model | metrics
GN (from scratch) 9x N/A 0.061 7.2 43.7 39.6 183808979 model | metrics
SyncBN (from scratch) 9x N/A 0.055 7.2 43.6 39.3 184226666 model | metrics
A few very large models trained for a long time, for demo purposes. They are trained using multiple machines:
Name inference
time
(s/im)
train
mem
(GB)
box
AP
mask
AP
PQ model id download
Panoptic FPN R101 0.098 11.4 47.4 41.3 46.1 139797668 model | metrics
Mask R-CNN X152 0.234 15.1 50.2 44.0 18131413 model | metrics
above + test-time aug. 51.9 45.9