language:
- vi
viBERT base model (cased)
viBERT is a pretrained model for Vietnamese using a masked language modeling (MLM) objective.
Model Details
Model Description
viBERT is based on mBERT. As such it retains the architecture of 12 layers, 768 hidden units, and 12 heads and also uses a WordPiece tokenizer. In order to specialize the model to Vietnamese the authors collected a dataset from Vietnamese online newspapers, resulting in approximately 10 GB of texts. They reduced the original mBERT vocabulary to only include tokens that occur in the Vietnamese pretraining dataset, resulting in a vocabulary size of 38168. The model was then further pre-trained on the Vietnamese pre-training data.
- Model type: BERT
- Language(s) (NLP): Vietnamese
- Finetuned from model: https://huggingface.co/google-bert/bert-base-multilingual-cased
Model Sources
- Repository: https://github.com/fpt-corp/viBERT
- Paper: Improving Sequence Tagging for Vietnamese Text using Transformer-based Neural Models
Citation
BibTeX:
@inproceedings{bui-etal-2020-improving,
title = "Improving Sequence Tagging for {V}ietnamese Text using Transformer-based Neural Models",
author = "Bui, The Viet and
Tran, Thi Oanh and
Le-Hong, Phuong",
editor = "Nguyen, Minh Le and
Luong, Mai Chi and
Song, Sanghoun",
booktitle = "Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation",
month = oct,
year = "2020",
address = "Hanoi, Vietnam",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.paclic-1.2",
pages = "13--20",
}
APA:
Tran, T. O., & Le Hong, P. (2020, October). Improving sequence tagging for Vietnamese text using transformer-based neural models. In Proceedings of the 34th Pacific Asia conference on language, information and computation (pp. 13-20).