---
license: gpl-3.0
tags:
- DocVQA
- Document Question Answering
- Document Visual Question Answering
datasets:
- rubentito/sp-docvqa
language:
- en
---

# VT5 base fine-tuned on SP-DocVQA

This is VT5 base fine-tuned on [Single-Page DocVQA](https://arxiv.org/abs/2007.00398) (SP-DocVQA) dataset using the [MP-DocVQA framework](https://github.com/rubenpt91/MP-DocVQA-Framework).
VT5 is a version of the Hi-VT5 described in [MP-DocVQA paper](https://arxiv.org/abs/2212.05935), arranged in a non-hierarchical paradigm (using only one page for each question-answer pair).
Before fine-tuning, we start from pre-trained [t5-base](https://huggingface.co/t5-base) for the language backbone, and pre-trained [DiT-base](https://huggingface.co/microsoft/dit-base-finetuned-rvlcdip) to embed visual features (which we keep frozen during fine-tune phase).

Please, note that VT5 is not integrated into Hugginface, and therefore you must use the [MP-DocVQA framework](https://github.com/rubenpt91/MP-DocVQA-Framework) (WIP) or [PFL-DocVQA competition framework](https://github.com/rubenpt91/PFL-DocVQA-Competition) to use it.


This method is the base architecture for the PFL-DocVQA Competition that will will take place from the 1st of July to the 1st of November, 2023. If you are interested in Federated Learning and Differential Privacy we invite you to have a look at the [PFL-DocVQA](https://github.com/rubenpt91/PFL-DocVQA-Competition) Challenge and Competition hold on these topics.