|
--- |
|
language: |
|
- vi |
|
--- |
|
|
|
# viBERT base model (cased) |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
viBERT is a pretrained model for Vietnamese using a masked language modeling (MLM) objective. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
viBERT is based on [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased). |
|
As such it retains the architecture of 12 layers, 768 hidden units, and 12 |
|
heads and also uses a WordPiece tokenizer. |
|
In order to specialize the model to Vietnamese the authors collected a dataset from Vietnamese online newspapers, resulting in approximately 10 GB of texts. |
|
They reduced the original mBERT vocabulary to only include tokens that occur in the Vietnamese pretraining dataset, resulting in a vocabulary size of 38168. |
|
The model was then further pre-trained on the Vietnamese pre-training data. |
|
|
|
- **Model type:** BERT |
|
- **Language(s) (NLP):** Vietnamese |
|
- **Finetuned from model:** https://huggingface.co/google-bert/bert-base-multilingual-cased |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/fpt-corp/viBERT |
|
- **Paper:** [Improving Sequence Tagging for Vietnamese Text using Transformer-based Neural Models](https://aclanthology.org/2020.paclic-1.2/) |
|
|
|
|
|
## Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
```tex |
|
@inproceedings{bui-etal-2020-improving, |
|
title = "Improving Sequence Tagging for {V}ietnamese Text using Transformer-based Neural Models", |
|
author = "Bui, The Viet and |
|
Tran, Thi Oanh and |
|
Le-Hong, Phuong", |
|
editor = "Nguyen, Minh Le and |
|
Luong, Mai Chi and |
|
Song, Sanghoun", |
|
booktitle = "Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation", |
|
month = oct, |
|
year = "2020", |
|
address = "Hanoi, Vietnam", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2020.paclic-1.2", |
|
pages = "13--20", |
|
} |
|
``` |
|
|
|
**APA:** |
|
|
|
Tran, T. O., & Le Hong, P. (2020, October). Improving sequence tagging for Vietnamese text using transformer-based neural models. In Proceedings of the 34th Pacific Asia conference on language, information and computation (pp. 13-20). |
|
|
|
|
|
## Model Card Authors |
|
|
|
[@phucdev](https://github.com/phucdev) |