(BERT base) Language modeling in the legal domain in Portuguese
legal-bert-base-cased-ptbr is a Language Model in the legal domain in Portuguese based on the model BERTimbau base by using a MASK objective.
The model is intended to assist NLP research in the legal field, computer law and legal technology applications. Several legal texts in Portuguese were used (more information below).
Large version of the model will be available soon.
Pre-training corpora
The pre-training corpora of legal-bert-base-cased-ptbr include:
- 61309 - Documentos juridicos diversos | (Miscellaneous legal documents)
- 751 - Petições (Recurso Extraordinário JEC) | (Petitions)
- 682 - Sentenças | (Sentences)
- 498 - Acordãos 2º Instancia | (2nd Instance Accords)
- 469 - Agravos Recurso extraordinário | (RE grievances)
- 411 - Despacho de Admissibilidade | (Admissibility Order)
The data used was provided by the BRAZILIAN SUPREME FEDERAL TRIBUNAL, through the terms of use: LREC 2020.
The results of this project do not imply in any way the position of the BRAZILIAN SUPREME FEDERAL TRIBUNAL, all being the sole and exclusive responsibility of the author of the model.
Load Pretrained Model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("dominguesm/legal-bert-base-cased-ptbr")
model = AutoModel.from_pretrained("dominguesm/legal-bert-base-cased-ptbr")
# OR
from transformers import pipeline
pipe = pipeline('fill-mask', "dominguesm/legal-bert-base-cased-ptbr")
Use legal-bert-base-cased-ptbr variants as Language Models
Text | Masked token | Predictions |
---|---|---|
De ordem, a Secretaria Judiciária do Supremo Tribunal Federal INTIMA a parte abaixo identificada, ou quem as suas vezes fizer, do inteiro teor do(a) despacho/decisão presente nos autos (art. 270 do Código de Processo [MASK] e art 5º da Lei 11.419/2006). | Civil | ('Civil', 0.9999), ('civil', 0.0001), ('Penal', 0.0000), ('eletrônico', 0.0000), ('2015', 0.0000) |
2. INTIMAÇÃO da Autarquia: 2.2 Para que apresente em Juízo, com a contestação, cópia do processo administrativo referente ao benefício [MASK] em discussão na lide | previdenciário | ('ora', 0.9424), ('administrativo', 0.0202), ('doença', 0.0117), ('acidente', 0.0037), ('posto', 0.0036) |
Certifico que, nesta data, os presentes autos foram remetidos ao [MASK] para processar e julgar recurso (Agravo de Instrumento). | STF | ('Tribunal', 0.4278), ('Supremo', 0.1657), ('origem', 0.1538), ('arquivo', 0.1415), ('sistema', 0.0216) |
TEMA: 810. Validade da correção monetária e dos juros moratórios [MASK] sobre as condenações impostas à Fazenda Pública, conforme previstos no art. 1º-F da Lei 9.494/1997, com a redação dada pela Lei 11.960/2009. | incidentes | ('incidentes', 0.9979), ('incidente', 0.0021), ('aplicados', 0.0000), (',', 0.0000), ('aplicada', 0.0000) |
Training results
Num examples = 353435
Num Epochs = 3
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 1
Total optimization steps = 33135
TRAIN RESULTS
"epoch": 3.0
"train_loss": 0.6107781137512769
"train_runtime": 10192.1545
"train_samples": 353435
"train_samples_per_second": 104.031
"train_steps_per_second": 3.251
EVAL RESULTS
"epoch": 3.0
"eval_loss": 0.47251805663108826
"eval_runtime": 126.3026
"eval_samples": 17878
"eval_samples_per_second": 141.549
"eval_steps_per_second": 4.426
"perplexity": 1.604028145934512
Citation
@misc{domingues2022legal-bert-base-cased-ptbr,
author = {Domingues, Maicon}
title = {Language Model in the legal domain in Portuguese},
year={2022},
howpublished= {\url{https://huggingface.co/dominguesm/legal-bert-base-cased-ptbr/}}
}
- Downloads last month
- 323
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.