metadata

license: mit
datasets:
  - LemeExploreNau/VeraCruz
language:
  - pt
metrics:
  - accuracy
tags:
  - Portuguese
  - Brazilian
  - Language Classification

PeroVazPT-BR Classifier

Model Description

The PeroVazPT-BR Classifier is designed to classify text between European Portuguese (PT) and Brazilian Portuguese (BR).

This model is a fine-tuned version of prajjwal1/bert-tiny on the VeraCruz Dataset. The model was trained on the VeraCruz Dataset, a collection of text samples from both languages. The model was trained on a total of 500,000 examples, a evenly split between European Portuguese and Brazilian Portuguese, ensuring a balanced representation of both language variants.

It achieves the following results on an evaluation set of 50,000 examples:

Loss: 0.1791
Accuracy: 0.9461

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 256
eval_batch_size: 256
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
steps: 2500
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy
0.4772	0.06	500	0.2501	0.9080
0.3412	0.13	1000	0.2275	0.9135
0.3122	0.19	1500	0.2578	0.9014
0.2975	0.25	2000	0.1992	0.9396
0.2877	0.31	2500	0.1791	0.9461

Framework versions

Transformers 4.40.0.dev0
Pytorch 2.2.1
Datasets 2.18.0
Tokenizers 0.15.2