Cantonese-Written Chinese Translation Model
This model is a fine-tuned version of fnlp/bart-base-chinese on Cantonese-Written Chinese Dataset Gen2. It achieves the following results on the evaluation set:
- Loss: 1.5413
- Bleu: 40.7808
- Chrf: 42.5628
- Gen Len: 13.2556
Model description
The model is based on BART Chinese model, trained on 1M Cantonese-Written Chinese Parallel Corpus data.
Intended uses & limitations
Its intended use is to translate Cantonese sentences to Written Chinese accurately.
Training and evaluation data
Training and evaluation data is provided by the Cantonese-Written Chinese Dataset Gen2.
Training procedure
The training was performed using Seq2SeqTrainer
.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Validation Loss | Bleu | Chrf | Gen Len |
---|---|---|---|---|---|---|
0.2275 | 0.05 | 5000 | 1.5256 | 40.6521 | 42.475 | 13.2277 |
0.1752 | 0.1 | 10000 | 1.5413 | 40.7808 | 42.5628 | 13.2556 |
0.1533 | 0.15 | 15000 | 1.5938 | 40.7698 | 42.5348 | 13.2678 |
0.1442 | 0.2 | 20000 | 1.6487 | 40.6062 | 42.353 | 13.2602 |
0.1317 | 0.24 | 25000 | 1.7148 | 40.569 | 42.2753 | 13.2798 |
Framework versions
- Transformers 4.28.1
- Pytorch 2.3.1+cu121
- Datasets 2.19.1
- Tokenizers 0.13.3
- Downloads last month
- 10