metadata

language:
  - vi

viBERT base model (cased)

viBERT is a pretrained model for Vietnamese using a masked language modeling (MLM) objective.

Model Details

Model Description

viBERT is based on mBERT. As such it retains the architecture of 12 layers, 768 hidden units, and 12 heads and also uses a WordPiece tokenizer. In order to specialize the model to Vietnamese the authors collected a dataset from Vietnamese online newspapers, resulting in approximately 10 GB of texts. They reduced the original mBERT vocabulary to only include tokens that occur in the Vietnamese pretraining dataset, resulting in a vocabulary size of 38168. The model was then further pre-trained on the Vietnamese pre-training data.

Model type: BERT
Language(s) (NLP): Vietnamese
Finetuned from model: https://huggingface.co/google-bert/bert-base-multilingual-cased

Model Sources

Repository: https://github.com/fpt-corp/viBERT
Paper: Improving Sequence Tagging for Vietnamese Text using Transformer-based Neural Models

Citation

BibTeX:

@inproceedings{bui-etal-2020-improving,
    title = "Improving Sequence Tagging for {V}ietnamese Text using Transformer-based Neural Models",
    author = "Bui, The Viet  and
      Tran, Thi Oanh  and
      Le-Hong, Phuong",
    editor = "Nguyen, Minh Le  and
      Luong, Mai Chi  and
      Song, Sanghoun",
    booktitle = "Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation",
    month = oct,
    year = "2020",
    address = "Hanoi, Vietnam",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.paclic-1.2",
    pages = "13--20",
}

APA:

Tran, T. O., & Le Hong, P. (2020, October). Improving sequence tagging for Vietnamese text using transformer-based neural models. In Proceedings of the 34th Pacific Asia conference on language, information and computation (pp. 13-20).

Model Card Authors

@phucdev