Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

Changes to Original

This is a modified version of Florence-2-large-ft. The original failed the conversion to safetensors using the Huggingface Space Safetensors/convert due to inconsistencies with some of the data pointers. The convert_to_safetensors.py script does the minimal steps to enable conversion to safetensors. It does check that the resulting tensors are equal and validates against the single image listed below to ensure both .bin and .safetensors provide the same output.

Only the modeling_florence2.py file has been modified:

  • Added Florence2LanguageForConditionalGeneration._tie_weights() which was missing
  • Added GenerationMixin as a parent class to Florence2LanguageForConditionalGeneration and Florence2ForConditionalGeneration to stop the deprecation warning that PreTrainedModel will NOT inherit from GenerationMixin from v4.50 onwards.
  • Added @torch.no_grad() decorator to the generate() function to follow standard transformers usage of turning off gradient accumulation. Without this, VRAM usage when up significantly each time the batch size would increase by 1.
  • Fixed batch attention masking following pawlowskipawel's PR
  • Changed importing of timm following deprecation warnings.

Model Summary

This Hub repository contains a HuggingFace's transformers implementation of Florence-2 model from Microsoft.

Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages our FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.

Resources and Technical Documentation:

Model Model size Model Description
Florence-2-base[HF] 0.23B Pretrained model with FLD-5B
Florence-2-large[HF] 0.77B Pretrained model with FLD-5B
Florence-2-base-ft[HF] 0.23B Finetuned model on a colletion of downstream tasks
Florence-2-large-ft[HF] 0.77B Finetuned model on a colletion of downstream tasks

How to Get Started with the Model

Use the code below to get started with the model. All models are trained with float16.

import requests

from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForCausalLM 


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained(
  "mrhendrey/Florence-2-large-ft-safetensors",
  torch_dtype=torch_dtype,
  trust_remote_code=True,
  use_safetensors=True
).to(device)
processor = AutoProcessor.from_pretrained(
  "mrhendrey/Florence-2-large-ft-safetensors", trust_remote_code=True
)

prompt = "<OD>"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(
  text=prompt, images=image, return_tensors="pt"
).to(device, torch_dtype)

generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024,
    do_sample=False,
    num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

parsed_answer = processor.post_process_generation(
  generated_text, task="<OD>", image_size=(image.width, image.height)
)

print(parsed_answer)

To take advantage of batching the code changes slightly

import requests

from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForCausalLM 


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained(
  "mrhendrey/Florence-2-large-ft-safetensors",
  torch_dtype=torch_dtype,
  trust_remote_code=True,
  use_safetensors=True
).to(device)
processor = AutoProcessor.from_pretrained(
  "mrhendrey/Florence-2-large-ft-safetensors", trust_remote_code=True
)

prompts = ["<OD>", "<CAPTION>"]

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(
  text=prompts, images=[image]*2, return_tensors="pt", padding=True
).to(device, torch_dtype)

generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    attention_mask=inputs["attention_mask"],
    max_new_tokens=1024,
    do_sample=False,
    num_beams=3
)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=False)

parsed_answers = [
  processor.post_process_generation(
    gen_text, task=prompt, image_size=(image.width, image.height)
  ) for gen_text, prompt in zip(generated_texts, prompts)
]

print(parsed_answers[0])
print(parsed_answers[1])

Tasks

This model is capable of performing different tasks through changing the prompts.

First, let's define a function to run a prompt.

Click to expand
import requests

from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForCausalLM 


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained(
  "mrhendrey/Florence-2-large-ft-safetensors",
  torch_dtype=torch_dtype,
  trust_remote_code=True,
  use_safetensors=True
).to(device)
processor = AutoProcessor.from_pretrained(
  "mrhendrey/Florence-2-large-ft-safetensors", trust_remote_code=True
)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

def run_example(task_prompt, text_input=None):
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input
    inputs = processor(
      text=prompt, images=image, return_tensors="pt"
    ).to(device, torch_dtype)
    generated_ids = model.generate(
      input_ids=inputs["input_ids"],
      pixel_values=inputs["pixel_values"],
      max_new_tokens=1024,
      num_beams=3
    )
    generated_text = processor.batch_decode(
      generated_ids, skip_special_tokens=False
    )[0]

    parsed_answer = processor.post_process_generation(
      generated_text, task=task_prompt, image_size=(image.width, image.height)
    )

    print(parsed_answer)

Here are the tasks Florence-2 could perform:

Click to expand

Caption

prompt = "<CAPTION>"
run_example(prompt)

Detailed Caption

prompt = "<DETAILED_CAPTION>"
run_example(prompt)

More Detailed Caption

prompt = "<MORE_DETAILED_CAPTION>"
run_example(prompt)

Caption to Phrase Grounding

caption to phrase grounding task requires additional text input, i.e. caption.

Caption to phrase grounding results format: {'<CAPTION_TO_PHRASE_GROUNDING>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['', '', ...]}}

task_prompt = "<CAPTION_TO_PHRASE_GROUNDING>"
results = run_example(
  task_prompt, text_input="A green car parked in front of a yellow building."
)

Object Detection

OD results format: {'<OD>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['label1', 'label2', ...]} }

prompt = "<OD>"
run_example(prompt)

Dense Region Caption

Dense region caption results format: {'<DENSE_REGION_CAPTION>' : {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['label1', 'label2', ...]} }

prompt = "<DENSE_REGION_CAPTION>"
run_example(prompt)

Region proposal

Dense region caption results format: {'<REGION_PROPOSAL>': {'bboxes': [[x1, y1, x2, y2], ...], 'labels': ['', '', ...]}}

prompt = "<REGION_PROPOSAL>"
run_example(prompt)

OCR

prompt = "<OCR>"
run_example(prompt)

OCR with Region

OCR with region output format: {'<OCR_WITH_REGION>': {'quad_boxes': [[x1, y1, x2, y2, x3, y3, x4, y4], ...], 'labels': ['text1', ...]}}

prompt = "<OCR_WITH_REGION>"
run_example(prompt)

for More detailed examples, please refer to notebook

Benchmarks

Florence-2 Zero-shot performance

The following table presents the zero-shot performance of generalist vision foundation models on image captioning and object detection evaluation tasks. These models have not been exposed to the training data of the evaluation tasks during their training phase.

Method #params COCO Cap. test CIDEr NoCaps val CIDEr TextCaps val CIDEr COCO Det. val2017 mAP
Flamingo 80B 84.3 - - -
Florence-2-base 0.23B 133.0 118.7 70.1 34.7
Florence-2-large 0.77B 135.6 120.8 72.8 37.5

The following table continues the comparison with performance on other vision-language evaluation tasks.

Method Flickr30k test R@1 Refcoco val Accuracy Refcoco test-A Accuracy Refcoco test-B Accuracy Refcoco+ val Accuracy Refcoco+ test-A Accuracy Refcoco+ test-B Accuracy Refcocog val Accuracy Refcocog test Accuracy Refcoco RES val mIoU
Kosmos-2 78.7 52.3 57.4 47.3 45.5 50.7 42.2 60.6 61.7 -
Florence-2-base 83.6 53.9 58.4 49.7 51.5 56.4 47.9 66.3 65.1 34.6
Florence-2-large 84.4 56.3 61.6 51.4 53.6 57.9 49.9 68.0 67.0 35.8

Florence-2 finetuned performance

We finetune Florence-2 models with a collection of downstream tasks, resulting two generalist models Florence-2-base-ft and Florence-2-large-ft that can conduct a wide range of downstream tasks.

The table below compares the performance of specialist and generalist models on various captioning and Visual Question Answering (VQA) tasks. Specialist models are fine-tuned specifically for each task, whereas generalist models are fine-tuned in a task-agnostic manner across all tasks. The symbol "â–²" indicates the usage of external OCR as input.

Method # Params COCO Caption Karpathy test CIDEr NoCaps val CIDEr TextCaps val CIDEr VQAv2 test-dev Acc TextVQA test-dev Acc VizWiz VQA test-dev Acc
Specialist Models
CoCa 2.1B 143.6 122.4 - 82.3 - -
BLIP-2 7.8B 144.5 121.6 - 82.2 - -
GIT2 5.1B 145.0 126.9 148.6 81.7 67.3 71.0
Flamingo 80B 138.1 - - 82.0 54.1 65.7
PaLI 17B 149.1 127.0 160.0â–² 84.3 58.8 / 73.1â–² 71.6 / 74.4â–²
PaLI-X 55B 149.2 126.3 147.0 / 163.7â–² 86.0 71.4 / 80.8â–² 70.9 / 74.6â–²
Generalist Models
Unified-IO 2.9B - 100.0 - 77.9 - 57.4
Florence-2-base-ft 0.23B 140.0 116.7 143.9 79.7 63.6 63.6
Florence-2-large-ft 0.77B 143.3 124.9 151.1 81.7 73.5 72.6
Method # Params COCO Det. val2017 mAP Flickr30k test R@1 RefCOCO val Accuracy RefCOCO test-A Accuracy RefCOCO test-B Accuracy RefCOCO+ val Accuracy RefCOCO+ test-A Accuracy RefCOCO+ test-B Accuracy RefCOCOg val Accuracy RefCOCOg test Accuracy RefCOCO RES val mIoU
Specialist Models
SeqTR - - - 83.7 86.5 81.2 71.5 76.3 64.9 74.9 74.2 -
PolyFormer - - - 90.4 92.9 87.2 85.0 89.8 78.0 85.8 85.9 76.9
UNINEXT 0.74B 60.6 - 92.6 94.3 91.5 85.2 89.6 79.8 88.7 89.4 -
Ferret 13B - - 89.5 92.4 84.4 82.8 88.1 75.2 85.8 86.3 -
Generalist Models
UniTAB - - - 88.6 91.1 83.8 81.0 85.4 71.6 84.6 84.7 -
Florence-2-base-ft 0.23B 41.4 84.0 92.6 94.8 91.5 86.8 91.7 82.2 89.8 82.2 78.0
Florence-2-large-ft 0.77B 43.4 85.2 93.4 95.3 92.0 88.3 92.9 83.6 91.2 91.7 80.5

BibTex and citation info

@article{xiao2023florence,
  title={Florence-2: Advancing a unified representation for a variety of vision tasks},
  author={Xiao, Bin and Wu, Haiping and Xu, Weijian and Dai, Xiyang and Hu, Houdong and Lu, Yumao and Zeng, Michael and Liu, Ce and Yuan, Lu},
  journal={arXiv preprint arXiv:2311.06242},
  year={2023}
}
Downloads last month
12
Safetensors
Model size
770M params
Tensor type
FP16
·
Inference API
Unable to determine this model's library. Check the docs .