vibert-base-cased / README.md
phucdev's picture
Create README.md
d5177f6 verified
|
raw
history blame
2.43 kB
---
language:
- vi
---
# viBERT base model (cased)
<!-- Provide a quick summary of what the model is/does. -->
viBERT is a pretrained model for Vietnamese using a masked language modeling (MLM) objective.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
viBERT is based on [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased).
As such it retains the architecture of 12 layers, 768 hidden units, and 12
heads and also uses a WordPiece tokenizer.
In order to specialize the model to Vietnamese the authors collected a dataset from Vietnamese online newspapers, resulting in approximately 10 GB of texts.
They reduced the original mBERT vocabulary to only include tokens that occur in the Vietnamese pretraining dataset, resulting in a vocabulary size of 38168.
The model was then further pre-trained on the Vietnamese pre-training data.
- **Model type:** BERT
- **Language(s) (NLP):** Vietnamese
- **Finetuned from model:** https://huggingface.co/google-bert/bert-base-multilingual-cased
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/fpt-corp/viBERT
- **Paper:** [Improving Sequence Tagging for Vietnamese Text using Transformer-based Neural Models](https://aclanthology.org/2020.paclic-1.2/)
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```tex
@inproceedings{bui-etal-2020-improving,
title = "Improving Sequence Tagging for {V}ietnamese Text using Transformer-based Neural Models",
author = "Bui, The Viet and
Tran, Thi Oanh and
Le-Hong, Phuong",
editor = "Nguyen, Minh Le and
Luong, Mai Chi and
Song, Sanghoun",
booktitle = "Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation",
month = oct,
year = "2020",
address = "Hanoi, Vietnam",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.paclic-1.2",
pages = "13--20",
}
```
**APA:**
Tran, T. O., & Le Hong, P. (2020, October). Improving sequence tagging for Vietnamese text using transformer-based neural models. In Proceedings of the 34th Pacific Asia conference on language, information and computation (pp. 13-20).
## Model Card Authors
[@phucdev](https://github.com/phucdev)