language: zh | |
license: cc-by-sa-4.0 | |
tags: | |
- word segmentation | |
datasets: | |
- ctb6 | |
- as | |
- cityu | |
- msra | |
- pku | |
- sxu | |
- cnc | |
pipeline_tag: token-classification | |
# Multi-criteria BERT base Chinese with Lattice for Word Segmentation | |
This is a variant of the pre-trained model [BERT](https://github.com/google-research/bert) model. | |
The model was pre-trained on texts in the Chinese language and fine-tuned for word segmentation based on [bert-base-chinese](https://huggingface.co/bert-base-chinese). | |
This version of the model processes input texts with character-level with word-level incorporated with a lattice structure. | |
The scripts for the pre-training are available at [tchayintr/latte-ptm-ws](https://github.com/tchayintr/latte-ptm-ws). | |
The LATTE scripts are available at [tchayintr/latte-ws](https://github.com/tchayintr/latte-ws). | |
## Model architecture | |
The model architecture is described in this [paper](https://www.jstage.jst.go.jp/article/jnlp/30/2/30_456/_article/-char/ja). | |
## Training Data | |
The model is trained on multiple Chinese word segmented datasets, including ctb6, sighan2005 (as, cityu, msra, pku), sighan2008 (sxu), and cnc. | |
The datasets can be accessed from [here](https://github.com/hankcs/multi-criteria-cws/tree/master/data). | |
## Licenses | |
The pre-trained model is distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/). | |
## Acknowledgments | |
This model was trained with GPU servers provided by [Okumura-Funakoshi NLP Group](https://lr-www.pi.titech.ac.jp). | |