ilos-vigil
commited on
Commit
•
3f0d9f8
1
Parent(s):
a397d92
Upload checkpoint 8 model and tensorboard training logs
Browse files- README.md +123 -14
- config.json +1 -0
- pytorch_model.bin +1 -1
- runs/joined_logs/events.out.tfevents.1671528643.pop-os.46984.0 +3 -0
README.md
CHANGED
@@ -2,23 +2,27 @@
|
|
2 |
language: id
|
3 |
license: mit
|
4 |
datasets:
|
5 |
-
- oscar
|
6 |
-
- wikipedia
|
7 |
-
- id_newspapers_2018
|
8 |
widget:
|
9 |
-
- text: "Saya [MASK] makan nasi goreng."
|
10 |
-
- text: "Kucing itu sedang bermain dengan [MASK]."
|
11 |
---
|
12 |
|
13 |
# Indonesian small BigBird model
|
14 |
|
15 |
-
|
|
|
|
|
16 |
|
17 |
## Model Description
|
18 |
|
19 |
-
This model
|
20 |
|
21 |
```py
|
|
|
|
|
22 |
config = BigBirdConfig(
|
23 |
vocab_size = 30_000,
|
24 |
hidden_size = 512,
|
@@ -33,11 +37,106 @@ config = BigBirdConfig(
|
|
33 |
|
34 |
## How to use
|
35 |
|
36 |
-
>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
-
|
39 |
|
40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
## Training and evaluation data
|
43 |
|
@@ -45,9 +144,19 @@ This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/dat
|
|
45 |
|
46 |
## Training Procedure
|
47 |
|
48 |
-
|
49 |
-
|
50 |
## Evaluation
|
51 |
|
52 |
-
|
53 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
language: id
|
3 |
license: mit
|
4 |
datasets:
|
5 |
+
- oscar
|
6 |
+
- wikipedia
|
7 |
+
- id_newspapers_2018
|
8 |
widget:
|
9 |
+
- text: "Saya [MASK] makan nasi goreng."
|
10 |
+
- text: "Kucing itu sedang bermain dengan [MASK]."
|
11 |
---
|
12 |
|
13 |
# Indonesian small BigBird model
|
14 |
|
15 |
+
## Source Code
|
16 |
+
|
17 |
+
Source code to create this model is available at [https://github.com/ilos-vigil/bigbird-small-indonesian](https://github.com/ilos-vigil/bigbird-small-indonesian).
|
18 |
|
19 |
## Model Description
|
20 |
|
21 |
+
This **cased** model has been pretrained with Masked LM objective. It has ~30M parameters and was pretrained with 8 epoch/51474 steps with 2.078 eval loss (7.988 perplexity). Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole dataset with 30K vocabulary size.
|
22 |
|
23 |
```py
|
24 |
+
from transformers import BigBirdConfig
|
25 |
+
|
26 |
config = BigBirdConfig(
|
27 |
vocab_size = 30_000,
|
28 |
hidden_size = 512,
|
|
|
37 |
|
38 |
## How to use
|
39 |
|
40 |
+
> Inference with Transformers pipeline (one MASK token)
|
41 |
+
|
42 |
+
```py
|
43 |
+
>>> from transformers import pipeline
|
44 |
+
>>> pipe = pipeline(task='fill-mask', model='ilos-vigil/bigbird-small-indonesian')
|
45 |
+
>>> pipe('Saya sedang bermain [MASK] teman saya.')
|
46 |
+
[{'score': 0.7199566960334778,
|
47 |
+
'token': 14,
|
48 |
+
'token_str':'dengan',
|
49 |
+
'sequence': 'Saya sedang bermain dengan teman saya.'},
|
50 |
+
{'score': 0.12370546162128448,
|
51 |
+
'token': 17,
|
52 |
+
'token_str': 'untuk',
|
53 |
+
'sequence': 'Saya sedang bermain untuk teman saya.'},
|
54 |
+
{'score': 0.0385284349322319,
|
55 |
+
'token': 331,
|
56 |
+
'token_str': 'bersama',
|
57 |
+
'sequence': 'Saya sedang bermain bersama teman saya.'},
|
58 |
+
{'score': 0.012146958149969578,
|
59 |
+
'token': 28,
|
60 |
+
'token_str': 'oleh',
|
61 |
+
'sequence': 'Saya sedang bermain oleh teman saya.'},
|
62 |
+
{'score': 0.009499032981693745,
|
63 |
+
'token': 25,
|
64 |
+
'token_str': 'sebagai',
|
65 |
+
'sequence': 'Saya sedang bermain sebagai teman saya.'}]
|
66 |
+
```
|
67 |
|
68 |
+
> Inference with PyTorch (one or multiple MASK token)
|
69 |
|
70 |
+
```py
|
71 |
+
import torch
|
72 |
+
from transformers import BigBirdTokenizerFast, BigBirdForMaskedLM
|
73 |
+
from pprint import pprint
|
74 |
+
|
75 |
+
tokenizer = BigBirdTokenizerFast.from_pretrained('ilos-vigil/bigbird-small-indonesian')
|
76 |
+
model = BigBirdForMaskedLM.from_pretrained('ilos-vigil/bigbird-small-indonesian')
|
77 |
+
topk = 5
|
78 |
+
text = 'Saya [MASK] bermain [MASK] teman saya.'
|
79 |
+
|
80 |
+
tokenized_text = tokenizer(text, return_tensors='pt')
|
81 |
+
raw_output = model(**tokenized_text)
|
82 |
+
tokenized_output = torch.topk(raw_output.logits, topk, dim=2).indices
|
83 |
+
score_output = torch.softmax(raw_output.logits, dim=2)
|
84 |
+
|
85 |
+
result = []
|
86 |
+
for position_idx in range(tokenized_text['input_ids'][0].shape[0]):
|
87 |
+
if tokenized_text['input_ids'][0][position_idx] == tokenizer.mask_token_id:
|
88 |
+
outputs = []
|
89 |
+
for token_idx in tokenized_output[0, position_idx]:
|
90 |
+
output = {}
|
91 |
+
output['score'] = score_output[0, position_idx, token_idx].item()
|
92 |
+
output['token'] = token_idx.item()
|
93 |
+
output['token_str'] = tokenizer.decode(output['token'])
|
94 |
+
outputs.append(output)
|
95 |
+
result.append(outputs)
|
96 |
+
|
97 |
+
pprint(result)
|
98 |
+
```
|
99 |
+
|
100 |
+
```py
|
101 |
+
[[{'score': 0.22353802621364594, 'token': 36, 'token_str': 'dapat'},
|
102 |
+
{'score': 0.13962049782276154, 'token': 24, 'token_str': 'tidak'},
|
103 |
+
{'score': 0.13610956072807312, 'token': 32, 'token_str': 'juga'},
|
104 |
+
{'score': 0.0725034773349762, 'token': 584, 'token_str': 'bermain'},
|
105 |
+
{'score': 0.033740025013685226, 'token': 38, 'token_str': 'akan'}],
|
106 |
+
[{'score': 0.7111291885375977, 'token': 14, 'token_str': 'dengan'},
|
107 |
+
{'score': 0.10754624754190445, 'token': 17, 'token_str': 'untuk'},
|
108 |
+
{'score': 0.022657711058855057, 'token': 331, 'token_str': 'bersama'},
|
109 |
+
{'score': 0.020862115547060966, 'token': 25, 'token_str': 'sebagai'},
|
110 |
+
{'score': 0.013086902908980846, 'token': 11, 'token_str': 'di'}]]
|
111 |
+
```
|
112 |
+
|
113 |
+
## Limitations and bias
|
114 |
+
|
115 |
+
Due to low parameter count and case-sensitive tokenizer/model, it's expected this model have low performance on certain fine-tuned task. Just like any language model, the model reflect biases from training dataset which comes from various source. Here's an example of how the model can have biased predictions,
|
116 |
+
|
117 |
+
```py
|
118 |
+
>>> pipe('Memasak dirumah adalah kewajiban seorang [MASK].')
|
119 |
+
[{'score': 0.16381049156188965,
|
120 |
+
'sequence': 'Memasak dirumah adalah kewajiban seorang budak.',
|
121 |
+
'token': 4910,
|
122 |
+
'token_str': 'budak'},
|
123 |
+
{'score': 0.1334381103515625,
|
124 |
+
'sequence': 'Memasak dirumah adalah kewajiban seorang wanita.',
|
125 |
+
'token': 649,
|
126 |
+
'token_str': 'wanita'},
|
127 |
+
{'score': 0.11588197946548462,
|
128 |
+
'sequence': 'Memasak dirumah adalah kewajiban seorang lelaki.',
|
129 |
+
'token': 6368,
|
130 |
+
'token_str': 'lelaki'},
|
131 |
+
{'score': 0.061377108097076416,
|
132 |
+
'sequence': 'Memasak dirumah adalah kewajiban seorang diri.',
|
133 |
+
'token': 258,
|
134 |
+
'token_str': 'diri'},
|
135 |
+
{'score': 0.04679233580827713,
|
136 |
+
'sequence': 'Memasak dirumah adalah kewajiban seorang gadis.',
|
137 |
+
'token': 6845,
|
138 |
+
'token_str': 'gadis'}]
|
139 |
+
```
|
140 |
|
141 |
## Training and evaluation data
|
142 |
|
|
|
144 |
|
145 |
## Training Procedure
|
146 |
|
147 |
+
The model was pretrained on single RTX 3060 with 8 epoch/51474 steps with accumalted batch size 128. The sequence was limited to 4096 tokens. The optimizer used is AdamW with LR 1e-4, weight decay 0.01, learning rate warmup for first 6% steps (~3090 steps) and linear decay of the learning rate afterwards. But due to early configuration mistake, first 2 epoch used LR 1e-3 instead. Additional information can be seen on Tensorboard training logs.
|
148 |
+
|
149 |
## Evaluation
|
150 |
|
151 |
+
The model achieve the following result during training evaluation.
|
152 |
+
|
153 |
+
| Epoch | Steps | Eval. loss | Eval. perplexity |
|
154 |
+
| ----- | ----- | ---------- | ---------------- |
|
155 |
+
| 1 | 6249 | 2.466 | 11.775 |
|
156 |
+
| 2 | 12858 | 2.265 | 9.631 |
|
157 |
+
| 3 | 19329 | 2.127 | 8.390 |
|
158 |
+
| 4 | 25758 | 2.116 | 8.298 |
|
159 |
+
| 5 | 32187 | 2.097 | 8.141 |
|
160 |
+
| 6 | 38616 | 2.087 | 8.061 |
|
161 |
+
| 7 | 45045 | 2.081 | 8.012 |
|
162 |
+
| 8 | 51474 | 2.078 | 7.988 |
|
config.json
CHANGED
@@ -1,4 +1,5 @@
|
|
1 |
{
|
|
|
2 |
"architectures": [
|
3 |
"BigBirdForMaskedLM"
|
4 |
],
|
|
|
1 |
{
|
2 |
+
"_name_or_path": "/mnt/encrypted_database/sum_nlp/checkpoint-model-bigbird-small-indonesian/checkpoint-12900-only-model",
|
3 |
"architectures": [
|
4 |
"BigBirdForMaskedLM"
|
5 |
],
|
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
size 122558078
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7bc9c9edd2ba57a1c7daf77bdd003806a0857b1515a023f137b483e9fcfc0837
|
3 |
size 122558078
|
runs/joined_logs/events.out.tfevents.1671528643.pop-os.46984.0
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:201771a39450395ab8c900f433586bd0a487438d9c0dfdd8db0c28a45c3b2c07
|
3 |
+
size 316301
|