RTMDet: An Empirical Study of Designing Real-Time Object Detectors

Abstract

In this paper, we aim to design an efficient real-time object detector that exceeds the YOLO series and is easily extensible for many object recognition tasks such as instance segmentation and rotated object detection. To obtain a more efficient model architecture, we explore an architecture that has compatible capacities in the backbone and neck, constructed by a basic building block that consists of large-kernel depth-wise convolutions. We further introduce soft labels when calculating matching costs in the dynamic label assignment to improve accuracy. Together with better training techniques, the resulting object detector, named RTMDet, achieves 52.8% AP on COCO with 300+ FPS on an NVIDIA 3090 GPU, outperforming the current mainstream industrial detectors. RTMDet achieves the best parameter-accuracy trade-off with tiny/small/medium/large/extra-large model sizes for various application scenarios, and obtains new state-of-the-art performance on real-time instance segmentation and rotated object detection. We hope the experimental results can provide new insights into designing versatile real-time object detectors for many object recognition tasks.

RTMDet-l model structure

Results and Models

Object Detection

Model	size	Params(M)	FLOPs(G)	TRT-FP16-Latency(ms)	box AP	TTA box AP	Config	Download
RTMDet-tiny	640	4.8	8.1	0.98	41.0	42.7	config	model \| log
RTMDet-tiny *	640	4.8	8.1	0.98	41.8 (+0.8)	43.2 (+0.5)	config	model \| log
RTMDet-s	640	8.89	14.8	1.22	44.6	45.8	config	model \| log
RTMDet-s *	640	8.89	14.8	1.22	45.7 (+1.1)	47.3 (+1.5)	config	model \| log
RTMDet-m	640	24.71	39.27	1.62	49.3	50.9	config	model \| log
RTMDet-m *	640	24.71	39.27	1.62	50.2 (+0.9)	51.9 (+1.0)	config	model \| log
RTMDet-l	640	52.3	80.23	2.44	51.4	53.1	config	model \| log
RTMDet-l *	640	52.3	80.23	2.44	52.3 (+0.9)	53.7 (+0.6)	config	model \| log
RTMDet-x	640	94.86	141.67	3.10	52.8	54.2	config	model \| log

Note:

The inference speed of RTMDet is measured on an NVIDIA 3090 GPU with TensorRT 8.4.3, cuDNN 8.2.0, FP16, batch size=1, and without NMS.
For a fair comparison, the config of bbox postprocessing is changed to be consistent with YOLOv5/6/7 after PR#9494, bringing about 0.1~0.3% AP improvement.
TTA means that Test Time Augmentation. It's perform 3 multi-scaling transformations on the image, followed by 2 flipping transformations (flipping and not flipping). You only need to specify --tta when testing to enable. see TTA for details.
* means checkpoints are trained with knowledge distillation. More details can be found in RTMDet distillation.

Rotated Object Detection

RTMDet-R achieves state-of-the-art on various remote sensing datasets.

Backbone	pretrain	Epoch	Batch Size	Aug	mmAP	mAP50	mAP75	Mem (GB)	Params(M)	FLOPS(G)	TRT-FP16-Latency(ms)	Config	Download
RTMDet-tiny	IN	36	1xb8	RR	46.94	75.07	50.11	12.7	4.88	20.45	4.40	config	model \| log
RTMDet-s	IN	36	1xb8	RR	48.99	77.33	52.65	16.6	8.86	37.62	4.86	config	model \| log
RTMDet-m	IN	36	2xb4	RR	50.38	78.43	54.28	10.9	24.67	99.76	7.82	config	model \| log
RTMDet-l	IN	36	2xb4	RR	50.61	78.66	54.95	16.1	52.27	204.21	10.82	config	model \| log
RTMDet-tiny	IN	36	1xb8	MS+RR	-	-	-		4.88	20.45	4.40	config	\|
RTMDet-s	IN	36	1xb8	MS+RR	-	-	-		8.86	37.62	4.86	config	\|
RTMDet-m	IN	36	2xb4	MS+RR	-	-	-		24.67	99.76	7.82	config	\|
RTMDet-l	IN	36	2xb4	MS+RR	-	-	-		52.27	204.21	10.82	config	\|
RTMDet-l	COCO	36	2xb4	MS+RR	-	-	-		52.27	204.21	10.82	config	\|
RTMDet-l	IN	100	2xb4	Mixup+Mosaic+RR	55.05	80.14	61.32	19.6	52.27	204.21	10.82	config	model \| log

Note:

Please follow doc to get start with rotated detection. Rotated Object Detection
We follow the latest metrics from the DOTA evaluation server, original voc format mAP is now mAP50.
All models trained with image size 1024*1024.
IN means ImageNet pretrain, COCO means COCO pretrain.
For Aug, RR means RandomRotate, MS means multi-scale augmentation in data prepare.
The inference speed here is measured on an NVIDIA 2080Ti GPU with TensorRT 8.4.3, cuDNN 8.2.0, FP16, batch size=1, and with NMS.
Currently, the training process of RTMDet-R tiny is unstable and may have 1% accuracy fluctuation, we will continue to investigate why.

Citation

@misc{lyu2022rtmdet,
      title={RTMDet: An Empirical Study of Designing Real-Time Object Detectors},
      author={Chengqi Lyu and Wenwei Zhang and Haian Huang and Yue Zhou and Yudong Wang and Yanyi Liu and Shilong Zhang and Kai Chen},
      year={2022},
      eprint={2212.07784},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}