pix2struct-base-table2html

Turn table images into HTML!

Demo app

Try the demo app which contains both table detection and recognition!

About

This model takes an image of a table and outputs HTML - the model parses the image and performs optical character recognition (OCR) and structure recognition to HTML format.

The model expects an image containing only a table. If the table is embedded in a document, first use a table detection model to extract it (e.g. Microsoft's Table Transformer model).

The model is finetuned from Pix2Struct base model using a max_patch_length of 1024 and max generation length of 1024. The max_patch_length should likely not be changed for inference but the generation length can be changed.

The model has been trained using two datasets: MMTab and PubTabNet.

Usage

Below is a complete example of loading the model and performing inference on an example table image (example from the MMTab dataset):

import torch
from transformers import AutoProcessor, Pix2StructForConditionalGeneration
from PIL import Image
import requests
from io import BytesIO

# Load model and processor
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("KennethTM/pix2struct-base-table2html")
model = Pix2StructForConditionalGeneration.from_pretrained("KennethTM/pix2struct-base-table2html")
model.to(device)
model.eval()

# Load example image from URL
url = "https://huggingface.co/KennethTM/pix2struct-base-table2html/resolve/main/example_recog_1.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

# Run model inference
encoding = processor(image, return_tensors="pt", max_patches=1024)
with torch.inference_mode():
    flattened_patches = encoding.pop("flattened_patches").to(device)
    attention_mask = encoding.pop("attention_mask").to(device)
    predictions = model.generate(flattened_patches=flattened_patches, attention_mask=attention_mask, max_new_tokens=1024)

predictions_decoded = processor.tokenizer.batch_decode(predictions, skip_special_tokens=True)

# Show predictions as text
print(predictions_decoded[0])

Example image:

Model HTML output for example image:

<table border="1" cellspacing="0">
 <tr>
  <th>
   Rank
  </th>
  <th>
   Lane
  </th>
  <th>
   Name
  </th>
  <th>
   Nationality
  </th>
  <th>
   Time
  </th>
  <th>
   Notes
  </th>
 </tr>
 <tr>
  <td>
  </td>
  <td>
   4
  </td>
  <td>
   Michael Phelps
  </td>
  <td>
   United States
  </td>
  <td>
   51.25
  </td>
  <td>
   OR
  </td>
 </tr>
 <tr>
  <td>
  </td>
  <td>
   3
  </td>
  <td>
   Ian Crocker
  </td>
  <td>
   United States
  </td>
  <td>
   51.29
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
  </td>
  <td>
   5
  </td>
  <td>
   Andriy Serdinov
  </td>
  <td>
   Ukraine
  </td>
  <td>
   51.36
  </td>
  <td>
   EU
  </td>
 </tr>
 <tr>
  <td>
   4
  </td>
  <td>
   1
  </td>
  <td>
   Thomas Rupprath
  </td>
  <td>
   Germany
  </td>
  <td>
   52.27
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   5
  </td>
  <td>
   6
  </td>
  <td>
   Igor Marchenko
  </td>
  <td>
   Russia
  </td>
  <td>
   52.32
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   6
  </td>
  <td>
   2
  </td>
  <td>
   Gabriel Mangabeira
  </td>
  <td>
   Brazil
  </td>
  <td>
   52.34
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   7
  </td>
  <td>
   8
  </td>
  <td>
   Duje Draganja
  </td>
  <td>
   Croatia
  </td>
  <td>
   52.46
  </td>
  <td>
  </td>
 </tr>
 <tr>
  <td>
   8
  </td>
  <td>
   7
  </td>
  <td>
   Geoff Huegill
  </td>
  <td>
   Australia
  </td>
  <td>
   52.56
  </td>
  <td>
  </td>
 </tr>
</table>

And the rendered HTML table:

Rank	Lane	Name	Nationality	Time	Notes
	4	Michael Phelps	United States	51.25	OR
	3	Ian Crocker	United States	51.29
	5	Andriy Serdinov	Ukraine	51.36	EU
4	1	Thomas Rupprath	Germany	52.27
5	6	Igor Marchenko	Russia	52.32
6	2	Gabriel Mangabeira	Brazil	52.34
7	8	Duje Draganja	Croatia	52.46
8	7	Geoff Huegill	Australia	52.56

KennethTM
/

pix2struct-base-table2html

pix2struct-base-table2html

Demo app

About

Usage

Model tree for KennethTM/pix2struct-base-table2html

Datasets used to train KennethTM/pix2struct-base-table2html

Space using KennethTM/pix2struct-base-table2html 1