@ucsahin on Hugging Face: "Florence-2 has a great capability of detecting various objects in a zero-shot…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

ucsahin

posted an update Jun 25

Post

3640

Florence-2 has a great capability of detecting various objects in a zero-shot setting with the task prompt "<OD>". However, if you want to detect specific objects that the base model is not able to in its current form, you can easily finetune it for this particular task. Below I show how to finetune the model to detect tables in a given image, but a similar process can be applied to detect any objects. Thanks to @andito , @merve , and @SkalskiP for sharing the fix for finetuning the Florence-2 model. Please also check their great blog post at https://huggingface.co/blog/finetune-florence2.

Colab notebook: https://colab.research.google.com/drive/1Y8GVjwzBIgfmfD3ZypDX5H1JA_VG0YDL?usp=sharing
Finetuned model: ucsahin/Florence-2-large-TableDetection

danelcsb

Jun 26

Hi @ucsahin I think it would be great to add the multi-class scenario which current is only accepting one class which is table.

To enable multi-class you can simply change

    for (cat, bbox) in zip(categories, bboxes):
        bbox_str += f"{class_list[cat]}"
        bbox = bbox.copy()

ucsahin

Jun 26

Thank you for clarifying and sharing the update to the code. I have also added a discussion to the Colab notebook for multi-class object detection.

maddosaientisuto

Jul 23

Hi @ucsahin ,
Thank you for sharing this, I had a question, I am trying to use this model for detecting tables in a document, I have observed that when there are no tables on a page, the model tries to predict tables there too, how can I work around this? Is there any confidence threshold that I can limit?

ucsahin

Jul 23

Thanks for your comment. Did you check if the model prediction actually resembles a table area (such as text and figure regions that are separated from the dense text area)? I cannot really tell without seeing what kind of documents you are working with. Please also note that although the fine-tuned model's performance is good in table detection (in my own experiments), it can still be further improved by training with a more comprehensive table detection dataset. What I suggest is as follows:

Try to add negative samples during fine-tuning. What I mean by negative samples is document images without tables in them. Prepare the intended labels accordingly, such as "no table on page," empty model response, etc.
As far as the model outputs go, I don't think there is one single parameter that can control the confidence of the output. Instead, try to change the generation parameters such as temperature, top_p, top_k. You can also use beam search instead of the standard greedy decoding.

Also, if your primary concern is to detect table regions (without doing anything with the table content like VQA, OCR, or information extraction), I suggest you check out table transformer models, which can detect table bounding boxes and recognize table structures. They also generate confidence scores for each of their predictions so that you can have more control over the desired output.

In this post