GreekLegalRoBERTa_v3

A Greek lagal version of RoBERTa pre-trained language model.

Pre-training corpora

The pre-training corpora of GreekLegalRoBERTa_v3 include:

The entire corpus of Greek legislation, as published by the National Publication Office.
the Greek Parliament Proceedings Greekparl.
The entire corpus of EU legislation (Greek translation), as published in Eur-Lex.
the Greek Parliament Proceedings Greekparl .
The Greek part of Wikipedia.
The Greek part of European Parliament Proceedings Parallel Corpus.
The Greek part of OSCAR, a cleansed version of Common Crawl.
The Raptarchis.

Pre-training details

We develop the code in Hugging Face's Transformers. We publish our code in AI-team-UoA GitHub repository (https://github.com/AI-team-UoA/GreekLegalRoBERTa).
We released a model similar to the English FacebookAI/roberta-base for greek legislative applications model (12-layer, 768-hidden, 12-heads, 125M parameters).
We train for 100k training steps with batch size of 4096 sequences of length 512 with an initial learning rate 6e-4.
We pretrained our models using 4 v-100 GPUs provided by Cyprus Research Institute. We would like to express our sincere gratitude to the Cyprus Research Institute for providing us with access to Cyclone. Without your support, this work would not have been possible.

Requirements

pip install torch
pip install tokenizers
pip install transformers[torch]
pip install datasets

Load Pretrained Model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")
model = AutoModel.from_pretrained("AI-team-UoA/GreekLegalRoBERTa_v3")

Use Pretrained Model as a Language Model

import torch
from transformers import *

# Load model and tokenizer
for i in range(10):
  tokenizer_greek = AutoTokenizer.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
  lm_model_greek = AutoModelWithLMHead.from_pretrained('AI-team-UoA/GreekLegalRoBERTa_v3')
unmasker = pipeline("fill-mask", model=lm_model_greek, tokenizer=tokenizer_greek)
# ================ EXAMPLE 1 ================
print("================ EXAMPLE 1 ================")
text_1 = ' O Δικηγορος κατεθεσε ένα <mask> .'
# EN: 'The lawyer submited a <mask>.'
input_ids = tokenizer_greek.encode(text_1)
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_1, top_k=5)[i]['token_str'])
#================ EXAMPLE 1 ================
#Model's answer 1 : letter
#Model's answer 2 : copy
#Model's answer 3 : record
#Model's answer 4 : memorandum
#Model's answer 5 : diagram


# ================ EXAMPLE 2 ================
print("================ EXAMPLE 2 ================")

text_2 = 'Είναι ένας <mask> άνθρωπος.'
# EN: 'He is a <mask> person.'
input_ids = tokenizer_greek.encode(text_2)
outputs = lm_model_greek(torch.tensor([input_ids]))[0]
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_2, top_k=5)[i]['token_str'])

#================ EXAMPLE 2 ================
#Model's answer 1 : new
#Model's answer 2 : capable
#Model's answer 3 : simple
#Model's answer 4 : serious
#Model's answer 5 : small


# ================ EXAMPLE 3 ================
print("================ EXAMPLE 3 ================")

text_3 = 'Είναι ένας <mask> άνθρωπος και κάνει συχνά <mask>.'
# EN: 'He is a <mask> person he does frequently <mask>.'
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_3, top_k=5)[0][i]['token_str']+" , " +unmasker(text_3, top_k=5)[1][i]['token_str'])

#================ EXAMPLE 3 ================
#Model's answer 1 : simple, trips
#Model's answer 2 : new, vacations
#Model's answer 3 : small, visits
#Model's answer 4 : good, mistakes
#Model's answer 5 : serious, actions

# the most plausible prediction for the second <mask> is "trips"
# ================ EXAMPLE 4 ================
print("================ EXAMPLE 4 ================")

text_4 = ' Kαθορισμός τρόπου αξιολόγησης της επιμελείς των υπαλλήλων που παρακολουθούν προγράμματα επιμόρφωσης και <mask> .'
# EN: '"Determining how to evaluate the diligence of employees attending edification and <mask> programs."'
for i in range(5):
  print("Model's answer "+str(i+1)+" : " +unmasker(text_4, top_k=5)[i]['token_str'])

#================ EXAMPLE 4 ================
#Model's answer 1 : retraining
#Model's answer 2 : specialization
#Model's answer 3 : training
#Model's answer 4 : education
#Model's answer 5 : Retraining

Evaluation on downstream tasks

For detailed results read the article:

TODO

AI-team-UoA
/

GreekLegalRoBERTa_v3

GreekLegalRoBERTa_v3

Pre-training corpora

Pre-training details

Requirements

Load Pretrained Model

Use Pretrained Model as a Language Model

Evaluation on downstream tasks

Author