Unifying Vision, Text, and Layout for Universal Document Processing (CVPR 2023 Highlight)
Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal
Open Source Checklist:
- Release Model (Encoder + Text decoder)
- Release Most Scripts
- Vision Decoder / Weights (Due to fake document generation ethical consideration, we plan to release this functionality as an Azure API)
- Demo
Introduction
UDOP unifies vision, text, and layout through vision-text-layout Transformer and unified generative pretraining tasks including vision task, text task, layout task, and mixed task. We show the task prompts (left) and task targets (right) for all self-supervised objectives (joint text-layout reconstruction, visual text recognition, layout modeling, and masked autoencoding) and two example supervised objectives (question answering and layout analysis).
Install
Setup python
environment
conda create -n UDOP python=3.8 # You can also use other environment.
Install other dependencies
pip install -r requirements.txt
Run Scripts
Switch model type by:
--model_type "UdopDual"
--model_type "UdopUnimodel"
Finetuninng on RVLCDIP
Download RVLCDIP first and change the path For OCR, you might need to customize your code
bash scripts/finetune_rvlcdip.sh # Finetuning on RVLCDIP
Finetuninng on DUE Benchmark
Download Duebenchmark and follow its procedure to preprocess the data.
The training code adapted to our framework is hosted at benchmarker by running:
bash scripts/finetune_duebenchmark.sh # Finetuning on DUE Benchmark, Switch tasks by changing path to the dataset
Evaluation of the output generation can be evaluated by Duebenchmark due_evaluator
Model Checkpoints
The model checkpoints are hosted here Huggingface Hub
Citation
@article{tang2022unifying,
title={Unifying Vision, Text, and Layout for Universal Document Processing},
author={Tang, Zineng and Yang, Ziyi and Wang, Guoxin and Fang, Yuwei and Liu, Yang and Zhu, Chenguang and Zeng, Michael and Zhang, Cha and Bansal, Mohit},
journal={arXiv preprint arXiv:2212.02623},
year={2022}
}
Contact
Zineng Tang (zn.tang.terran@gmail.com)