Update README.md
Browse files
README.md
CHANGED
@@ -1,86 +1,169 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
| 16| 0.246|
|
45 |
-
| 17| 0.247|
|
46 |
-
| 18| 0.248|
|
47 |
-
| 19| 0.244|
|
48 |
-
| 20| 0.24|
|
49 |
-
| 21| 0.24|
|
50 |
-
| 22| 0.234|
|
51 |
-
| 23| 0.234|
|
52 |
-
| 24| 0.229|
|
53 |
-
| 25| 0.231|
|
54 |
-
| 26| 0.23|
|
55 |
-
| 27| 0.236|
|
56 |
-
| 28| 0.227|
|
57 |
-
| 29| 0.224|
|
58 |
-
| 30| 0.221|
|
59 |
-
| 31| 0.221|
|
60 |
-
| 32| 0.221|
|
61 |
-
| 33| 0.222|
|
62 |
-
| 34| 0.224|
|
63 |
-
| 35| 0.222|
|
64 |
-
| 36| 0.215|
|
65 |
-
| 37| 0.214|
|
66 |
-
| 38| 0.208|
|
67 |
-
| 39| 0.208|
|
68 |
-
| 40| 0.205|
|
69 |
-
| 41| 0.206|
|
70 |
-
| 42| 0.204|
|
71 |
-
| 43| 0.204|
|
72 |
-
| 44| 0.201|
|
73 |
-
| 45| 0.2|
|
74 |
-
| 46| 0.199|
|
75 |
-
| 47| 0.197|
|
76 |
-
| 48| 0.195|
|
77 |
-
| 49| 0.199|
|
78 |
-
| 50| 0.195|
|
79 |
-
## Model Plot
|
80 |
|
81 |
<details>
|
82 |
-
<summary>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
83 |
|
84 |
-
|
85 |
|
86 |
-
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- id
|
4 |
+
- ms
|
5 |
+
license: apache-2.0
|
6 |
+
tags:
|
7 |
+
- g2p
|
8 |
+
- fill-mask
|
9 |
+
inference: false
|
10 |
---
|
11 |
|
12 |
+
# ID G2P BERT
|
13 |
+
|
14 |
+
ID G2P BERT is a phoneme de-masking model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. This model was trained from scratch on a modified [Malay/Indonesian lexicon](https://huggingface.co/datasets/bookbot/id_word2phoneme).
|
15 |
+
|
16 |
+
This model was trained using the [Keras](https://keras.io/) framework. All training was done on Google Colaboratory. We adapted the [BERT Masked Language Modeling training script](https://keras.io/examples/nlp/masked_language_modeling) provided by the official Keras Code Example.
|
17 |
+
|
18 |
+
## Model
|
19 |
+
|
20 |
+
| Model | #params | Arch. | Training/Validation data |
|
21 |
+
| ------------- | ------- | ----- | ------------------------ |
|
22 |
+
| `id-g2p-bert` | 200K | BERT | Malay/Indonesian Lexicon |
|
23 |
+
|
24 |
+
![](./model.png)
|
25 |
+
|
26 |
+
## Training Procedure
|
27 |
+
|
28 |
+
<details>
|
29 |
+
<summary>Model Config</summary>
|
30 |
+
|
31 |
+
vocab_size: 32
|
32 |
+
max_len: 32
|
33 |
+
embed_dim: 128
|
34 |
+
num_attention_head: 2
|
35 |
+
feed_forward_dim: 128
|
36 |
+
num_layers: 2
|
37 |
+
|
38 |
+
</details>
|
39 |
+
|
40 |
+
<details>
|
41 |
+
<summary>Training Setting</summary>
|
42 |
+
|
43 |
+
batch_size: 32
|
44 |
+
optimizer: "adam"
|
45 |
+
learning_rate: 0.001
|
46 |
+
epochs: 100
|
47 |
+
|
48 |
+
</details>
|
49 |
+
|
50 |
+
## How to Use
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
<details>
|
53 |
+
<summary>Tokenizers</summary>
|
54 |
+
|
55 |
+
id2token = {
|
56 |
+
0: '',
|
57 |
+
1: '[UNK]',
|
58 |
+
2: 'a',
|
59 |
+
3: 'n',
|
60 |
+
4: 'ə',
|
61 |
+
5: 'i',
|
62 |
+
6: 'r',
|
63 |
+
7: 'k',
|
64 |
+
8: 'm',
|
65 |
+
9: 't',
|
66 |
+
10: 'u',
|
67 |
+
11: 'g',
|
68 |
+
12: 's',
|
69 |
+
13: 'b',
|
70 |
+
14: 'p',
|
71 |
+
15: 'l',
|
72 |
+
16: 'd',
|
73 |
+
17: 'o',
|
74 |
+
18: 'e',
|
75 |
+
19: 'h',
|
76 |
+
20: 'c',
|
77 |
+
21: 'y',
|
78 |
+
22: 'j',
|
79 |
+
23: 'w',
|
80 |
+
24: 'f',
|
81 |
+
25: 'v',
|
82 |
+
26: '-',
|
83 |
+
27: 'z',
|
84 |
+
28: "'",
|
85 |
+
29: 'q',
|
86 |
+
30: '[mask]'
|
87 |
+
}
|
88 |
+
|
89 |
+
token2id = {
|
90 |
+
'': 0,
|
91 |
+
"'": 28,
|
92 |
+
'-': 26,
|
93 |
+
'[UNK]': 1,
|
94 |
+
'[mask]': 30,
|
95 |
+
'a': 2,
|
96 |
+
'b': 13,
|
97 |
+
'c': 20,
|
98 |
+
'd': 16,
|
99 |
+
'e': 18,
|
100 |
+
'f': 24,
|
101 |
+
'g': 11,
|
102 |
+
'h': 19,
|
103 |
+
'i': 5,
|
104 |
+
'j': 22,
|
105 |
+
'k': 7,
|
106 |
+
'l': 15,
|
107 |
+
'm': 8,
|
108 |
+
'n': 3,
|
109 |
+
'o': 17,
|
110 |
+
'p': 14,
|
111 |
+
'q': 29,
|
112 |
+
'r': 6,
|
113 |
+
's': 12,
|
114 |
+
't': 9,
|
115 |
+
'u': 10,
|
116 |
+
'v': 25,
|
117 |
+
'w': 23,
|
118 |
+
'y': 21,
|
119 |
+
'z': 27,
|
120 |
+
'ə': 4
|
121 |
+
}
|
122 |
+
|
123 |
+
</details>
|
124 |
+
|
125 |
+
```py
|
126 |
+
import keras
|
127 |
+
import tensorflow as tf
|
128 |
+
import numpy as np
|
129 |
+
|
130 |
+
mlm_model = keras.models.load_model(
|
131 |
+
"bert_mlm.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
|
132 |
+
)
|
133 |
+
|
134 |
+
MAX_LEN = 32
|
135 |
+
|
136 |
+
def inference(sequence):
|
137 |
+
sequence = " ".join([c if c != "e" else "[mask]" for c in sequence])
|
138 |
+
tokens = [token2id[c] for c in sequence.split()]
|
139 |
+
pad = [token2id[""] for _ in range(MAX_LEN - len(tokens))]
|
140 |
+
|
141 |
+
tokens = tokens + pad
|
142 |
+
input_ids = tf.convert_to_tensor(np.array([tokens]))
|
143 |
+
prediction = mlm_model.predict(input_ids)
|
144 |
+
|
145 |
+
# find masked idx token
|
146 |
+
masked_index = np.where(input_ids == mask_token_id)
|
147 |
+
masked_index = masked_index[1]
|
148 |
+
|
149 |
+
# get prediction at those masked index only
|
150 |
+
mask_prediction = prediction[0][masked_index]
|
151 |
+
predicted_ids = np.argmax(mask_prediction, axis=1)
|
152 |
+
|
153 |
+
# replace mask with predicted token
|
154 |
+
for i, idx in enumerate(masked_index):
|
155 |
+
tokens[idx] = predicted_ids[i]
|
156 |
+
|
157 |
+
return "".join([id2token[t] for t in tokens if t != 0])
|
158 |
+
|
159 |
+
inference("mengembangkannya")
|
160 |
+
```
|
161 |
+
|
162 |
+
## Authors
|
163 |
+
|
164 |
+
ID G2P BERT was trained and evaluated by [Ananto Joyoadikusumo](https://anantoj.github.io/), [Steven Limcorn](https://stevenlimcorn.github.io/), [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on Google Colaboratory.
|
165 |
|
166 |
+
## Framework versions
|
167 |
|
168 |
+
- Keras 2.8.0
|
169 |
+
- TensorFlow 2.8.0
|