--- language: - vi --- # viBERT base model (cased) viBERT is a pretrained model for Vietnamese using a masked language modeling (MLM) objective. ## Model Details ### Model Description viBERT is based on [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased). As such it retains the architecture of 12 layers, 768 hidden units, and 12 heads and also uses a WordPiece tokenizer. In order to specialize the model to Vietnamese the authors collected a dataset from Vietnamese online newspapers, resulting in approximately 10 GB of texts. They reduced the original mBERT vocabulary to only include tokens that occur in the Vietnamese pretraining dataset, resulting in a vocabulary size of 38168. The model was then further pre-trained on the Vietnamese pre-training data. - **Model type:** BERT - **Language(s) (NLP):** Vietnamese - **Finetuned from model:** https://huggingface.co/google-bert/bert-base-multilingual-cased ### Model Sources - **Repository:** https://github.com/fpt-corp/viBERT - **Paper:** [Improving Sequence Tagging for Vietnamese Text using Transformer-based Neural Models](https://aclanthology.org/2020.paclic-1.2/) ## Citation **BibTeX:** ```tex @inproceedings{bui-etal-2020-improving, title = "Improving Sequence Tagging for {V}ietnamese Text using Transformer-based Neural Models", author = "Bui, The Viet and Tran, Thi Oanh and Le-Hong, Phuong", editor = "Nguyen, Minh Le and Luong, Mai Chi and Song, Sanghoun", booktitle = "Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation", month = oct, year = "2020", address = "Hanoi, Vietnam", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.paclic-1.2", pages = "13--20", } ``` **APA:** Tran, T. O., & Le Hong, P. (2020, October). Improving sequence tagging for Vietnamese text using transformer-based neural models. In Proceedings of the 34th Pacific Asia conference on language, information and computation (pp. 13-20). ## Model Card Authors [@phucdev](https://github.com/phucdev)