electra-hongkongese-base-hk-ws

This model is a fine-tuned version of toastynews/electra-hongkongese-base-discriminator on HKCanCor and CityU for word segmentation.

Model description

Performs word segmentation on text from Hong Kong. There are two versions; hk trained with only text from Hong Kong, and hkt trained with text from Hong Kong and Taiwan. Each version have base and small model sizes.

Intended uses & limitations

Trained to handle both Hongkongese/Cantonese and Standard Chinese from Hong Kong. Text from other places and English do not work as well. The easiest way is to use with the CKIP Transformers libary.

Training and evaluation data

HKCanCor and CityU are converted to BI-encoded word segmentation dataset in Hugging Face format using code from finetune-ckip-transformers.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 3.0

Training results

dataset token_f token_p token_r
ud yue_hk 0.9462 0.9487 0.9437
ud zh_hk 0.9330 0.9402 0.9260
hkcancor 0.9895 0.9880 0.9909
cityu 0.9806 0.9793 0.9818
as 0.9225 0.9183 0.9267

Was trained on hkcancor. Reported for reference only.

Framework versions

  • Transformers 4.27.0.dev0
  • Pytorch 1.10.0
  • Datasets 2.10.0
  • Tokenizers 0.13.2
Downloads last month
67
Safetensors
Model size
108M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.