language: zh
license: cc-by-sa-4.0
tags:
- word segmentation
datasets:
- ctb6
- as
- cityu
- msra
- pku
- sxu
- cnc
pipeline_tag: token-classification
Multi-criteria BERT base Chinese with Lattice for Word Segmentation
This is a variant of the pre-trained model BERT model. The model was pre-trained on texts in the Chinese language and fine-tuned for word segmentation based on bert-base-chinese. This version of the model processes input texts with character-level with word-level incorporated with a lattice structure.
The scripts for the pre-training are available at tchayintr/latte-ptm-ws.
The LATTE scripts are available at tchayintr/latte-ws.
Model architecture
The model architecture is described in this paper.
Training Data
The model is trained on multiple Chinese word segmented datasets, including ctb6, sighan2005 (as, cityu, msra, pku), sighan2008 (sxu), and cnc. The datasets can be accessed from here.
Licenses
The pre-trained model is distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0.
Acknowledgments
This model was trained with GPU servers provided by Okumura-Funakoshi NLP Group.