HuggingFaceTB/SmolVLM-Instruct · Best option for DocQVA->JSON

Hello,

First thank you very much for the team on the work and model provided.
The accuracy for the size, speed of inference and easiness to use is great ! It will be of great help for flows where huge amount of images need to be processed.

I would like to seek your expertise for a specific use-case I am looking at (and I am sure I am not the only one trying to solve this problem).
I am looking to use SmolVLM for OCRVQA and DocVQA on a wide variety of heavy worded docs like Quality reports, Forms, Bills, etc... (Typical docs that you can find in industry with sometimes its own terminologies) and I am looking to get a JSON output from the model for further processing down the line.

The model is following quite well the prompt that already include the expected schema, however it sometimes modify it at is convenience or sometimes just don't get the right value for a key (lack of understanding).

I am currently looking at Fine Tuning on CORD - re-using https://github.com/NielsRogge/Transformers-Tutorials/blob/master/PaliGemma/Fine_tune_PaliGemma_for_image_%3EJSON.ipynb

But I wanted to check if you have any advise to go further/alternative ideas ? (I am also thinking Distillation / Teacher-Student training using Pixtral for example)

Cheers !