README.md · bertin-project/filiberto-124M-ocr at main

metadata

license: apache-2.0
base_model:
  - bertin-project/filiberto-124M
library_name: transformers
language:
  - es
pipeline_tag: text-generation
tags:
  - OCR
  - text-correction
  - ocr-correction
  - archives
  - GPT2
  - history
  - SLM
  - pre-train
  - drama

Filiberto 124M OCR is a small specialized model for OCR correction of Spanish Golden Age Dramas OCR, based on the Filiberto 124M Spanish Golden Age Drama foundation model.

Filiberto 124M OCR is only 124 million parameters. It can run easily on CPU or provide correction at scale on GPUs (>10k tokens/seconds).

Training

The pre-trained included a collection of individual verses and their correction taken from the TEXORO corpus, via a collaboration with ETSO, totalling ~5 million tokens.

Pre-training ran on 5 epochs with levanter (500 steps total, each processing 1024 sequences of 512 tokens) on a TPUv4-32 for 15 minutes.

Tokenization is currently done with the GPT-2 tokenizer.

Example of OCR correction

Filiberto 124M OCR has been pre-trained on an instruction dataset with a hard-coded structure: ### Text ### for OCRized text submissiong and ### Correction ### for the generated correction.

Filiberto 124M OCR can be imported like any GPT-2 like model:

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = "bertin-project/filiberto-124M-ocr"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set the device to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

And afterwards inference can be run like this:

# Function to generate text
def ocr_correction(prompt, max_new_tokens=600):

    prompt = f"""### Text ###\n{prompt}\n\n\n### Correction ###\n"""
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

    # Generate text
    output = model.generate(input_ids,
                            max_new_tokens=max_new_tokens,
                            pad_token_id=tokenizer.eos_token_id,
                            top_k=50)

    # Decode and return the generated text
    return tokenizer.decode(output[0], skip_special_tokens=True).split("### Correction ###")[-1].strip()

ocr_result = ocr_correction(prompt)
print(ocr_result)

An example of an OCRized drama:

Otra vez, Don Iuan, me dad,
y otras mil vezes los braços.
Otra, y otras mil sean lazos
de nuestra antigua amistad.
Como venis?
Yo me siento
tan alegre, tan vfano,
tan venturoso, tan vano,
que no podrà el pensamiento
encareceros jamàs
las venturas que posseo,
porque el pensamiento creo

would yield this result:

Otra vez, Don Iuan, me dad,
y otras mil vezes los braços.
Otra, y otras mil sean lazos
de nuestra antigua amistad.
Como venis?
Yo me siento
tan alegre, tan vfano,
tan venturoso, tan vano,
que no podrà el pensamiento
encareceros jamàs
las venturas que posseo,
porque el pensamiento creo