Spaces:
Runtime error
Runtime error
# Prompt-Segment-Anything | |
This is an implementation of zero-shot instance segmentation using [Segment Anything](https://github.com/facebookresearch/segment-anything). Thanks to the authors of Segment Anything for their wonderful work! | |
This repository is based on [MMDetection](https://github.com/open-mmlab/mmdetection) and includes some code from [H-Deformable-DETR](https://github.com/HDETR/H-Deformable-DETR) and [FocalNet-DINO](https://github.com/FocalNet/FocalNet-DINO). | |
![example1](assets/example1.jpg) | |
## News | |
**2023.04.12** Multimask output mode and cascade prompt mode is available now. | |
**2023.04.11** Our [demo](https://huggingface.co/spaces/rockeycoss/Prompt-Segment-Anything-Demo) is available now. Please feel free to check it out. | |
**2023.04.11** [Swin-L+H-Deformable-DETR + SAM](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/swin-l-hdetr_sam-vit-h.py)/[FocalNet-L+DINO + SAM](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/swin-l-hdetr_sam-vit-h.py) achieves strong COCO instance segmentation results: mask AP=46.8/49.1 by simply prompting SAM with boxes predicted by Swin-L+H-Deformable-DETR/FocalNet-L+DINO. (mask AP=46.5 based on ViTDet)๐บ | |
## Catalog | |
- [x] Support Swin-L+H-Deformable-DETR+SAM | |
- [x] Support FocalNet-L+DINO+SAM | |
- [x] Support R50+H-Deformable-DETR+SAM/Swin-T+H-Deformable-DETR | |
- [x] Support HuggingFace gradio demo | |
- [x] Support cascade prompts (box prompt + mask prompt) | |
## Box-as-Prompt Results | |
| Detector | SAM | multimask ouput | Detector's Box AP | Mask AP | Config | | |
| :---------------------: | :-------: | :---------------: | :-----: | :----------------------------------------------------------: | ----------------------- | | |
| R50+H-Deformable-DETR | sam-vit-b | :x: | 50.0 | 38.2 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/r50-hdetr_sam-vit-b.py) | | |
| R50+H-Deformable-DETR | sam-vit-b | :heavy_check_mark: | 50.0 | 39.9 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/r50-hdetr_sam-vit-b_best-in-multi.py) | | |
| R50+H-Deformable-DETR | sam-vit-l | :x: | 50.0 | 41.5 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/r50-hdetr_sam-vit-l.py) | | |
| Swin-T+H-Deformable-DETR | sam-vit-b | :x: | 53.2 | 40.0 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/swin-t-hdetr_sam-vit-b.py) | | |
| Swin-T+H-Deformable-DETR | sam-vit-l | :x: | 53.2 | 43.5 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/swin-t-hdetr_sam-vit-l.py) | | |
| Swin-L+H-Deformable-DETR | sam-vit-b | :x: | 58.0 | 42.5 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/swin-l-hdetr_sam-vit-b.py) | | |
| Swin-L+H-Deformable-DETR | sam-vit-l | :x: | 58.0 | 46.3 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/swin-l-hdetr_sam-vit-l.py) | | |
| Swin-L+H-Deformable-DETR | sam-vit-h | :x: | 58.0 | 46.8 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/swin-l-hdetr_sam-vit-h.py) | | |
| FocalNet-L+DINO | sam-vit-b | :x: | 63.2 | 44.5 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/swin-l-hdetr_sam-vit-b.py) | | |
| FocalNet-L+DINO | sam-vit-l | :x: | 63.2 | 48.6 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/swin-l-hdetr_sam-vit-l.py) | | |
| FocalNet-L+DINO | sam-vit-h | :x: | 63.2 | 49.1 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/swin-l-hdetr_sam-vit-h.py) | | |
## Cascade-Prompt Results | |
| Detector | SAM | multimask ouput | Detector's Box AP | Mask AP | Config | | |
| :-------------------: | :-------: | :----------------: | :---------------: | :-----: | ------------------------------------------------------------ | | |
| R50+H-Deformable-DETR | sam-vit-b | :x: | 50.0 | 38.8 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/r50-hdetr_sam-vit-b_cascade.py) | | |
| R50+H-Deformable-DETR | sam-vit-b | :heavy_check_mark: | 50.0 | 40.5 | [config](https://github.com/RockeyCoss/Instance-Segment-Anything/blob/master/projects/configs/hdetr/r50-hdetr_sam-vit-b_best-in-multi_cascade.py) | | |
***Note*** | |
**multimask ouput**: If multimask output is :heavy_check_mark:, SAM will predict three masks for each prompt, and the segmentation result will be the one with the highest predicted IoU. Otherwise, if multimask output is :x:, SAM will return only one mask for each prompt, which will be used as the segmentation result. | |
**cascade-prompt**: In the cascade-prompt setting, the segmentation process involves two stages. In the first stage, a coarse mask is predicted with a bounding box prompt. The second stage then utilizes both the bounding box and the coarse mask as prompts to predict the final segmentation result. Note that if multimask output is :heavy_check_mark:, the first stage will predict three coarse masks, and the second stage will use the mask with the highest predicted IoU as the prompt. | |
## Installation | |
๐บ๐บ๐บ Add dockerhub enviroment | |
``` | |
docker pull kxqt/prompt-sam-torch1.12-cuda11.6:20230410 | |
nvidia-docker run -it --shm-size=4096m -v {your_path}:{path_in_docker} kxqt/prompt-sam-torch1.12-cuda11.6:20230410 | |
``` | |
We test the models under `python=3.7.10,pytorch=1.10.2,cuda=10.2`. Other versions might be available as well. | |
1. Clone this repository | |
``` | |
git clone https://github.com/RockeyCoss/Instance-Segment-Anything | |
cd Instance-Segment-Anything | |
``` | |
2. Install PyTorch | |
```bash | |
# an example | |
pip install torch torchvision | |
``` | |
3. Install MMCV | |
``` | |
pip install -U openmim | |
mim install "mmcv>=2.0.0" | |
``` | |
4. Install MMDetection's requirements | |
``` | |
pip install -r requirements.txt | |
``` | |
5. Compile CUDA operators | |
```bash | |
cd projects/instance_segment_anything/ops | |
python setup.py build install | |
cd ../../.. | |
``` | |
## Prepare COCO Dataset | |
Please refer to [data preparation](https://mmdetection.readthedocs.io/en/latest/user_guides/dataset_prepare.html). | |
## Prepare Checkpoints | |
1. Install wget | |
``` | |
pip install wget | |
``` | |
2. SAM checkpoints | |
```bash | |
mkdir ckpt | |
cd ckpt | |
python -m wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth | |
python -m wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_l_0b3195.pth | |
python -m wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth | |
cd .. | |
``` | |
3. Here are the checkpoints for the detection models. You can download only the checkpoints you need. | |
```bash | |
# R50+H-Deformable-DETR | |
cd ckpt | |
python -m wget https://github.com/HDETR/H-Deformable-DETR/releases/download/v0.1/r50_hybrid_branch_lambda1_group6_t1500_dp0_mqs_lft_deformable_detr_plus_iterative_bbox_refinement_plus_plus_two_stage_36eps.pth -o r50_hdetr.pth | |
cd .. | |
python tools/convert_ckpt.py ckpt/r50_hdetr.pth ckpt/r50_hdetr.pth | |
# Swin-T+H-Deformable-DETR | |
cd ckpt | |
python -m wget https://github.com/HDETR/H-Deformable-DETR/releases/download/v0.1/swin_tiny_hybrid_branch_lambda1_group6_t1500_dp0_mqs_lft_deformable_detr_plus_iterative_bbox_refinement_plus_plus_two_stage_36eps.pth -o swin_t_hdetr.pth | |
cd .. | |
python tools/convert_ckpt.py ckpt/swin_t_hdetr.pth ckpt/swin_t_hdetr.pth | |
# Swin-L+H-Deformable-DETR | |
cd ckpt | |
python -m wget https://github.com/HDETR/H-Deformable-DETR/releases/download/v0.1/decay0.05_drop_path0.5_swin_large_hybrid_branch_lambda1_group6_t1500_n900_dp0_mqs_lft_deformable_detr_plus_iterative_bbox_refinement_plus_plus_two_stage_36eps.pth -o swin_l_hdetr.pth | |
cd .. | |
python tools/convert_ckpt.py ckpt/swin_l_hdetr.pth ckpt/swin_l_hdetr.pth | |
# FocalNet-L+DINO | |
cd ckpt | |
python -m wget https://projects4jw.blob.core.windows.net/focalnet/release/detection/focalnet_large_fl4_o365_finetuned_on_coco.pth -o focalnet_l_dino.pth | |
cd .. | |
python tools/convert_ckpt.py ckpt/focalnet_l_dino.pth ckpt/focalnet_l_dino.pth | |
``` | |
## Run Evaluation | |
1. Evaluate Metrics | |
```bash | |
# single GPU | |
python tools/test.py path/to/the/config/file --eval segm | |
# multiple GPUs | |
bash tools/dist_test.sh path/to/the/config/file num_gpus --eval segm | |
``` | |
2. Visualize Segmentation Results | |
```bash | |
python tools/test.py path/to/the/config/file --show-dir path/to/the/visualization/results | |
``` | |
## Gradio Demo | |
We also provide a UI for displaying the segmentation results that is built with gradio. To launch the demo, simply run the following command in a terminal: | |
```bash | |
pip install gradio | |
python app.py | |
``` | |
This demo is also hosted on HuggingFace [here](https://huggingface.co/spaces/rockeycoss/Prompt-Segment-Anything-Demo). | |
## More Segmentation Examples | |
![example2](assets/example2.jpg) | |
![example3](assets/example3.jpg) | |
![example4](assets/example4.jpg) | |
![example5](assets/example5.jpg) | |
## Citation | |
**Segment Anything** | |
```latex | |
@article{kirillov2023segany, | |
title={Segment Anything}, | |
author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross}, | |
journal={arXiv:2304.02643}, | |
year={2023} | |
} | |
``` | |
**H-Deformable-DETR** | |
```latex | |
@article{jia2022detrs, | |
title={DETRs with Hybrid Matching}, | |
author={Jia, Ding and Yuan, Yuhui and He, Haodi and Wu, Xiaopei and Yu, Haojun and Lin, Weihong and Sun, Lei and Zhang, Chao and Hu, Han}, | |
journal={arXiv preprint arXiv:2207.13080}, | |
year={2022} | |
} | |
``` | |
**Swin Transformer** | |
```latex | |
@inproceedings{liu2021Swin, | |
title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows}, | |
author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining}, | |
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, | |
year={2021} | |
} | |
``` | |
**DINO** | |
```latex | |
@misc{zhang2022dino, | |
title={DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection}, | |
author={Hao Zhang and Feng Li and Shilong Liu and Lei Zhang and Hang Su and Jun Zhu and Lionel M. Ni and Heung-Yeung Shum}, | |
year={2022}, | |
eprint={2203.03605}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.CV} | |
} | |
``` | |
**FocalNet** | |
```latex | |
@misc{yang2022focalnet, | |
author = {Yang, Jianwei and Li, Chunyuan and Dai, Xiyang and Yuan, Lu and Gao, Jianfeng}, | |
title = {Focal Modulation Networks}, | |
publisher = {arXiv}, | |
year = {2022}, | |
} | |
``` | |