Files changed (1) hide show
  1. README.md +69 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - vi
4
+ ---
5
+
6
+ # viBERT base model (cased)
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+ viBERT is a pretrained model for Vietnamese using a masked language modeling (MLM) objective.
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ viBERT is based on [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased).
19
+ As such it retains the architecture of 12 layers, 768 hidden units, and 12
20
+ heads and also uses a WordPiece tokenizer.
21
+ In order to specialize the model to Vietnamese the authors collected a dataset from Vietnamese online newspapers, resulting in approximately 10 GB of texts.
22
+ They reduced the original mBERT vocabulary to only include tokens that occur in the Vietnamese pretraining dataset, resulting in a vocabulary size of 38168.
23
+ The model was then further pre-trained on the Vietnamese pre-training data.
24
+
25
+ - **Model type:** BERT
26
+ - **Language(s) (NLP):** Vietnamese
27
+ - **Finetuned from model:** https://huggingface.co/google-bert/bert-base-multilingual-cased
28
+
29
+ ### Model Sources
30
+
31
+ <!-- Provide the basic links for the model. -->
32
+
33
+ - **Repository:** https://github.com/fpt-corp/viBERT
34
+ - **Paper:** [Improving Sequence Tagging for Vietnamese Text using Transformer-based Neural Models](https://aclanthology.org/2020.paclic-1.2/)
35
+
36
+
37
+ ## Citation
38
+
39
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
40
+
41
+ **BibTeX:**
42
+
43
+ ```tex
44
+ @inproceedings{bui-etal-2020-improving,
45
+ title = "Improving Sequence Tagging for {V}ietnamese Text using Transformer-based Neural Models",
46
+ author = "Bui, The Viet and
47
+ Tran, Thi Oanh and
48
+ Le-Hong, Phuong",
49
+ editor = "Nguyen, Minh Le and
50
+ Luong, Mai Chi and
51
+ Song, Sanghoun",
52
+ booktitle = "Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation",
53
+ month = oct,
54
+ year = "2020",
55
+ address = "Hanoi, Vietnam",
56
+ publisher = "Association for Computational Linguistics",
57
+ url = "https://aclanthology.org/2020.paclic-1.2",
58
+ pages = "13--20",
59
+ }
60
+ ```
61
+
62
+ **APA:**
63
+
64
+ Tran, T. O., & Le Hong, P. (2020, October). Improving sequence tagging for Vietnamese text using transformer-based neural models. In Proceedings of the 34th Pacific Asia conference on language, information and computation (pp. 13-20).
65
+
66
+
67
+ ## Model Card Authors
68
+
69
+ [@phucdev](https://github.com/phucdev)