vibert-base-cased / README.md
phucdev's picture
Create README.md
d5177f6 verified
|
raw
history blame
2.43 kB
metadata
language:
  - vi

viBERT base model (cased)

viBERT is a pretrained model for Vietnamese using a masked language modeling (MLM) objective.

Model Details

Model Description

viBERT is based on mBERT. As such it retains the architecture of 12 layers, 768 hidden units, and 12 heads and also uses a WordPiece tokenizer. In order to specialize the model to Vietnamese the authors collected a dataset from Vietnamese online newspapers, resulting in approximately 10 GB of texts. They reduced the original mBERT vocabulary to only include tokens that occur in the Vietnamese pretraining dataset, resulting in a vocabulary size of 38168. The model was then further pre-trained on the Vietnamese pre-training data.

Model Sources

Citation

BibTeX:

@inproceedings{bui-etal-2020-improving,
    title = "Improving Sequence Tagging for {V}ietnamese Text using Transformer-based Neural Models",
    author = "Bui, The Viet  and
      Tran, Thi Oanh  and
      Le-Hong, Phuong",
    editor = "Nguyen, Minh Le  and
      Luong, Mai Chi  and
      Song, Sanghoun",
    booktitle = "Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation",
    month = oct,
    year = "2020",
    address = "Hanoi, Vietnam",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.paclic-1.2",
    pages = "13--20",
}

APA:

Tran, T. O., & Le Hong, P. (2020, October). Improving sequence tagging for Vietnamese text using transformer-based neural models. In Proceedings of the 34th Pacific Asia conference on language, information and computation (pp. 13-20).

Model Card Authors

@phucdev