metadata

language: zh
license: cc-by-sa-4.0
tags:
  - word segmentation
datasets:
  - ctb6
  - as
  - cityu
  - msra
  - pku
  - sxu
  - cnc
pipeline_tag: token-classification

Multi-criteria BERT base Chinese with Lattice for Word Segmentation

This is a variant of the pre-trained model BERT model. The model was pre-trained on texts in the Chinese language and fine-tuned for word segmentation based on bert-base-chinese. This version of the model processes input texts with character-level with word-level incorporated with a lattice structure.

The scripts for the pre-training are available at tchayintr/latte-ptm-ws.

The LATTE scripts are available at tchayintr/latte-ws.

Model architecture

The model architecture is described in this paper.

Training Data

The model is trained on multiple Chinese word segmented datasets, including ctb6, sighan2005 (as, cityu, msra, pku), sighan2008 (sxu), and cnc. The datasets can be accessed from here.

Licenses

The pre-trained model is distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0.

Acknowledgments

This model was trained with GPU servers provided by Okumura-Funakoshi NLP Group.