Best option for DocQVA->JSON

#11
by Truc95 - opened

Hello,

First thank you very much for the team on the work and model provided.
The accuracy for the size, speed of inference and easiness to use is great ! It will be of great help for flows where huge amount of images need to be processed.

I would like to seek your expertise for a specific use-case I am looking at (and I am sure I am not the only one trying to solve this problem).
I am looking to use SmolVLM for OCRVQA and DocVQA on a wide variety of heavy worded docs like Quality reports, Forms, Bills, etc... (Typical docs that you can find in industry with sometimes its own terminologies) and I am looking to get a JSON output from the model for further processing down the line.

The model is following quite well the prompt that already include the expected schema, however it sometimes modify it at is convenience or sometimes just don't get the right value for a key (lack of understanding).

I am currently looking at Fine Tuning on CORD - re-using https://github.com/NielsRogge/Transformers-Tutorials/blob/master/PaliGemma/Fine_tune_PaliGemma_for_image_%3EJSON.ipynb

But I wanted to check if you have any advise to go further/alternative ideas ? (I am also thinking Distillation / Teacher-Student training using Pixtral for example)

Cheers !

Hugging Face TB Research org

Hi! For this problem you might want to try using higher resolutions (the default is 1.5k pixels per side, which might be a bit pixelated for some documents).
Also, for OCR, the base model is better than this one, a little fine tuning on it should bring you super far. For DocVQA, you might want to look at the synthetic model! This sounds like a cool project :D

Sign up or log in to comment