Text processing
Collection
Models for text processing
•
1 item
•
Updated
The presented model can be used for text de-noising. You can use it if you have text that has noise after loading, such as after loading pdf files.
The model was learned on texts in Polish. The dataset was automatically noised. allegro/plt5-base was used as the base model.
Model input
Model input must be preceded by the tag denoise:
F.e. if you have text:
As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k.
then input to the model must be constructed as follows:
denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k.
Sample model usage
from transformers import T5ForConditionalGeneration, T5Tokenizer
def do_inference(text, model, tokenizer):
input_text = f"denoise: {text}"
inputs = tokenizer.encode(
input_text,
return_tensors="pt",
max_length=256,
padding="max_length",
truncation=True,
)
corrected_ids = model.generate(
inputs,
max_length=256,
num_beams=5,
early_stopping=True,
)
corrected_sentence = tokenizer.decode(corrected_ids[0], skip_special_tokens=True)
return corrected_sentence
model = T5ForConditionalGeneration.from_pretrained("radlab/polish-denoiser-t5-base")
tokenizer = T5Tokenizer.from_pretrained("radlab/polish-denoiser-t5-base")
text_str = "As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k."
print(do_inference(text_str, model, tokenizer))
Model reponse for input:
denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k.
is:
Astronomia jest jedną z najstarszych nauk.
Evaluation
More information (in Polish) on our blog