Cantonese-Written Chinese Translation Model

This model is a fine-tuned version of fnlp/bart-base-chinese on Cantonese-Written Chinese Dataset Gen2. It achieves the following results on the evaluation set:

  • Loss: 1.5413
  • Bleu: 40.7808
  • Chrf: 42.5628
  • Gen Len: 13.2556

Model description

The model is based on BART Chinese model, trained on 1M Cantonese-Written Chinese Parallel Corpus data.

Intended uses & limitations

Its intended use is to translate Cantonese sentences to Written Chinese accurately.

Training and evaluation data

Training and evaluation data is provided by the Cantonese-Written Chinese Dataset Gen2.

Training procedure

The training was performed using Seq2SeqTrainer.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 10
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Bleu Chrf Gen Len
0.2275 0.05 5000 1.5256 40.6521 42.475 13.2277
0.1752 0.1 10000 1.5413 40.7808 42.5628 13.2556
0.1533 0.15 15000 1.5938 40.7698 42.5348 13.2678
0.1442 0.2 20000 1.6487 40.6062 42.353 13.2602
0.1317 0.24 25000 1.7148 40.569 42.2753 13.2798

Framework versions

  • Transformers 4.28.1
  • Pytorch 2.3.1+cu121
  • Datasets 2.19.1
  • Tokenizers 0.13.3
Downloads last month
10
Inference Examples
Unable to determine this model's library. Check the docs .

Dataset used to train raptorkwok/cantonese-chinese-translation-gen1