GreekLegalRoBERTa_v3
A Greek lagal version of RoBERTa pre-trained language model.
Pre-training corpora
The pre-training corpora of GreekLegalRoBERTa_v3
include:
- The entire corpus of Greek legislation, as published by the National Publication Office.
- the Greek Parliament Proceedings Greekparl.
- The entire corpus of EU legislation (Greek translation), as published in Eur-Lex.
- the Greek Parliament Proceedings Greekparl .
- The Greek part of Wikipedia.
- The Greek part of European Parliament Proceedings Parallel Corpus.
- The Greek part of OSCAR, a cleansed version of Common Crawl.
- The Raptarchis.
Pre-training details
- We develop the code in Hugging Face's Transformers. We publish our code in AI-team-UoA GitHub repository (https://github.com/AI-team-UoA/GreekLegalRoBERTa).
- We released a model similar to the English
FacebookAI/roberta-base
for greek legislative applications model (12-layer, 768-hidden, 12-heads, 125M parameters). - We train for 100k training steps with batch size of 4096 sequences of length 512 with an initial learning rate 6e-4.
- We pretrained our models using 4 v-100 GPUs provided by Cyprus Research Institute. We would like to express our sincere gratitude to the Cyprus Research Institute for providing us with access to Cyclone. Without your support, this work would not have been possible.
Requirements
pip install torch
pip install tokenizers
pip install transformers[torch]
pip install datasets
Load Pretrained Model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")
model = AutoModel.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")
Use Pretrained Model as a Language Model
import torch
from transformers import *
# Load model and tokenizer
for i in range(10):
tokenizer_greek = AutoTokenizer.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
lm_model_greek = AutoModelWithLMHead.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
unmasker = pipeline("fill-mask", model=lm_model_greek, tokenizer=tokenizer_greek)
# ================ EXAMPLE 1 ================
print("================ EXAMPLE 1 ================")
text_1 = ' O Δικηγορος κατεθεσε ένα <mask> .'
# EN: 'The lawyer submited a <mask>.'
input_ids = tokenizer_greek.encode(text_1)
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
for i in range(5):
print("Model's answer "+str(i+1)+" : " +unmasker(text_1, top_k=5)[i]['token_str'])
#================ EXAMPLE 1 ================
#Model's answer 1 : letter
#Model's answer 2 : copy
#Model's answer 3 : record
#Model's answer 4 : memorandum
#Model's answer 5 : diagram
# ================ EXAMPLE 2 ================
print("================ EXAMPLE 2 ================")
text_2 = 'Είναι ένας <mask> άνθρωπος.'
# EN: 'He is a <mask> person.'
input_ids = tokenizer_greek.encode(text_2)
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
for i in range(5):
print("Model's answer "+str(i+1)+" : " +unmasker(text_2, top_k=5)[i]['token_str'])
#================ EXAMPLE 2 ================
#Model's answer 1 : new
#Model's answer 2 : capable
#Model's answer 3 : simple
#Model's answer 4 : serious
#Model's answer 5 : small
# ================ EXAMPLE 3 ================
print("================ EXAMPLE 3 ================")
text_3 = 'Είναι ένας <mask> άνθρωπος και κάνει συχνά <mask>.'
# EN: 'He is a <mask> person he does frequently <mask>.'
for i in range(5):
print("Model's answer "+str(i+1)+" : " +unmasker(text_3, top_k=5)[0][i]['token_str']+" , " +unmasker(text_3, top_k=5)[1][i]['token_str'])
#================ EXAMPLE 3 ================
#Model's answer 1 : simple, trips
#Model's answer 2 : new, vacations
#Model's answer 3 : small, visits
#Model's answer 4 : good, mistakes
#Model's answer 5 : serious, actions
# the most plausible prediction for the second <mask> is "trips"
# ================ EXAMPLE 4 ================
print("================ EXAMPLE 4 ================")
text_4 = ' Kαθορισμός τρόπου αξιολόγησης της επιμελείς των υπαλλήλων που παρακολουθούν προγράμματα επιμόρφωσης και <mask> .'
# EN: '"Determining how to evaluate the diligence of employees attending edification and <mask> programs."'
for i in range(5):
print("Model's answer "+str(i+1)+" : " +unmasker(text_4, top_k=5)[i]['token_str'])
#================ EXAMPLE 4 ================
#Model's answer 1 : retraining
#Model's answer 2 : specialization
#Model's answer 3 : training
#Model's answer 4 : education
#Model's answer 5 : Retraining
Evaluation on downstream tasks
For detailed results read the article:
TODO
Author
- Downloads last month
- 3
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.