ukr-t5-small
A compact T5-small model fine-tuned for Ukrainian language tasks, with base English understanding.
Model Description
- Base Model: mT5-small
- Fine-tuning Data: Leipzig Corpora Collection (English & Ukrainian news from 2023)
- Tasks:
- Text summarization (Ukrainian)
- Text generation (Ukrainian)
- Other Ukrainian-centric NLP tasks
Technical Details
- Model Size: 300 MB
- Framework: Transformers (Hugging Face)
Usage
Installation
pip install transformers
Loading the Model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("path/to/ukr-t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("path/to/ukr-t5-small")
Example: Machine Translation
text = "(Text in Ukrainian here)"
# Tokenize and translate
inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=128)
# Decode output
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)
Limitations
- The model's focus is on Ukrainian text processing, so performance on purely English tasks may be below that of general T5-small models.
- Further fine-tuning may be required for optimal results on specific NLP tasks.
Dataset Credits
This model was fine-tuned on the Leipzig Corpora Collection (specify if there's a particular subset within the collection that you used). For full licensing and usage information of the original dataset, please refer to Leipzig Corpora Collection website
Ethical Considerations
- NLP models can reflect biases present in their training data. Be mindful of this when using this model for applications that have real-world impact.
- It's important to test this model thoroughly across a variety of Ukrainian language samples to evaluate its reliability and fairness.
- Downloads last month
- 12
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.