(BERT large) Language modeling in the legal domain in Portuguese (LeNER-Br)

bert-large-cased-pt-lenerbr is a Language Model in the legal domain in Portuguese that was finetuned on 20/12/2021 in Google Colab from the model BERTimbau large on the dataset LeNER-Br language modeling by using a MASK objective.

You can check as well the version base of this model.

Widget & APP

You can test this model into the widget of this page.

Blog post

This language model is used to get a NER model on the Portuguese judicial domain. You can check the fine-tuned NER model at pierreguillou/ner-bert-large-cased-pt-lenerbr.

All informations and links are in this blog post: NLP | Modelos e Web App para Reconhecimento de Entidade Nomeada (NER) no domínio jurídico brasileiro (29/12/2021)

Using the model for inference in production

# install pytorch: check https://pytorch.org/
# !pip install transformers 
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("pierreguillou/bert-large-cased-pt-lenerbr")
model = AutoModelForMaskedLM.from_pretrained("pierreguillou/bert-large-cased-pt-lenerbr")

Training procedure

Notebook

The notebook of finetuning (Finetuning_language_model_BERtimbau_LeNER_Br.ipynb) is in github.

Training results

Num examples = 3227
Num Epochs = 5
Instantaneous batch size per device = 2
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 4
Total optimization steps = 2015

Step	Training Loss	Validation Loss
100   1.616700      1.366015
200   1.452000      1.312473
300   1.431100      1.253055
400   1.407500      1.264705
500   1.301900      1.243277
600   1.317800      1.233684
700   1.319100      1.211826
800   1.303800      1.190818
900   1.262800      1.171898
1000  1.235900      1.146275
1100  1.221900      1.149027
1200  1.226200      1.127950
1300  1.201700      1.172729
1400  1.198200      1.145363
Downloads last month
19
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train pierreguillou/bert-large-cased-pt-lenerbr

Evaluation results

  • Loss on pierreguillou/lener_br_finetuning_language_model
    self-reported
    1.128