Pseudo-Native-BART-CGEC
This model is a cutting-edge CGEC model based on Chinese BART-large. It is trained with about 100M pseudo native speaker CGEC training data generated by heuristic rules and human-annotated training data for the media domain. More details can be found in our Github and the paper.
Usage
pip install transformers
from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
tokenizer = BertTokenizer.from_pretrained("HillZhang/pseudo_native_bart_CGEC_media")
model = BartForConditionalGeneration.from_pretrained("HillZhang/pseudo_native_bart_CGEC_media")
encoded_input = tokenizer(["北京是中国的都。", "他说:”我最爱的运动是打蓝球“", "我每天大约喝5次水左右。", "今天,我非常开开心。"], return_tensors="pt", padding=True, truncation=True)
if "token_type_ids" in encoded_input:
del encoded_input["token_type_ids"]
output = model.generate(**encoded_input)
print(tokenizer.batch_decode(output, skip_special_tokens=True))
Citation
@inproceedings{zhang-etal-2023-nasgec,
title = "{Na}{SGEC}: a Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts",
author = "Zhang, Yue and
Zhang, Bo and
Jiang, Haochen and
Li, Zhenghua and
Li, Chen and
Huang, Fei and
Zhang, Min"
booktitle = "Findings of ACL",
year = "2023"
}
- Downloads last month
- 14
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.