w11wo commited on
Commit
1432ed9
1 Parent(s): e2775b9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -79
README.md CHANGED
@@ -1,86 +1,169 @@
1
  ---
2
- library_name: keras
 
 
 
 
 
 
 
3
  ---
4
 
5
- ## Model description
6
-
7
- More information needed
8
-
9
- ## Intended uses & limitations
10
-
11
- More information needed
12
-
13
- ## Training and evaluation data
14
-
15
- More information needed
16
-
17
- ## Training procedure
18
-
19
- ### Training hyperparameters
20
-
21
- The following hyperparameters were used during training:
22
- - optimizer: {'name': 'Adam', 'learning_rate': 0.001, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False}
23
- - training_precision: float32
24
-
25
- ## Training Metrics
26
-
27
- | Epochs | Train Loss |
28
- |--- |--- |
29
- | 1| 0.43|
30
- | 2| 0.386|
31
- | 3| 0.356|
32
- | 4| 0.311|
33
- | 5| 0.303|
34
- | 6| 0.296|
35
- | 7| 0.293|
36
- | 8| 0.286|
37
- | 9| 0.28|
38
- | 10| 0.272|
39
- | 11| 0.263|
40
- | 12| 0.261|
41
- | 13| 0.256|
42
- | 14| 0.252|
43
- | 15| 0.255|
44
- | 16| 0.246|
45
- | 17| 0.247|
46
- | 18| 0.248|
47
- | 19| 0.244|
48
- | 20| 0.24|
49
- | 21| 0.24|
50
- | 22| 0.234|
51
- | 23| 0.234|
52
- | 24| 0.229|
53
- | 25| 0.231|
54
- | 26| 0.23|
55
- | 27| 0.236|
56
- | 28| 0.227|
57
- | 29| 0.224|
58
- | 30| 0.221|
59
- | 31| 0.221|
60
- | 32| 0.221|
61
- | 33| 0.222|
62
- | 34| 0.224|
63
- | 35| 0.222|
64
- | 36| 0.215|
65
- | 37| 0.214|
66
- | 38| 0.208|
67
- | 39| 0.208|
68
- | 40| 0.205|
69
- | 41| 0.206|
70
- | 42| 0.204|
71
- | 43| 0.204|
72
- | 44| 0.201|
73
- | 45| 0.2|
74
- | 46| 0.199|
75
- | 47| 0.197|
76
- | 48| 0.195|
77
- | 49| 0.199|
78
- | 50| 0.195|
79
- ## Model Plot
80
 
81
  <details>
82
- <summary>View Model Plot</summary>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
- ![Model Image](./model.png)
85
 
86
- </details>
 
 
1
  ---
2
+ language:
3
+ - id
4
+ - ms
5
+ license: apache-2.0
6
+ tags:
7
+ - g2p
8
+ - fill-mask
9
+ inference: false
10
  ---
11
 
12
+ # ID G2P BERT
13
+
14
+ ID G2P BERT is a phoneme de-masking model based on the [BERT](https://arxiv.org/abs/1810.04805) architecture. This model was trained from scratch on a modified [Malay/Indonesian lexicon](https://huggingface.co/datasets/bookbot/id_word2phoneme).
15
+
16
+ This model was trained using the [Keras](https://keras.io/) framework. All training was done on Google Colaboratory. We adapted the [BERT Masked Language Modeling training script](https://keras.io/examples/nlp/masked_language_modeling) provided by the official Keras Code Example.
17
+
18
+ ## Model
19
+
20
+ | Model | #params | Arch. | Training/Validation data |
21
+ | ------------- | ------- | ----- | ------------------------ |
22
+ | `id-g2p-bert` | 200K | BERT | Malay/Indonesian Lexicon |
23
+
24
+ ![](./model.png)
25
+
26
+ ## Training Procedure
27
+
28
+ <details>
29
+ <summary>Model Config</summary>
30
+
31
+ vocab_size: 32
32
+ max_len: 32
33
+ embed_dim: 128
34
+ num_attention_head: 2
35
+ feed_forward_dim: 128
36
+ num_layers: 2
37
+
38
+ </details>
39
+
40
+ <details>
41
+ <summary>Training Setting</summary>
42
+
43
+ batch_size: 32
44
+ optimizer: "adam"
45
+ learning_rate: 0.001
46
+ epochs: 100
47
+
48
+ </details>
49
+
50
+ ## How to Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  <details>
53
+ <summary>Tokenizers</summary>
54
+
55
+ id2token = {
56
+ 0: '',
57
+ 1: '[UNK]',
58
+ 2: 'a',
59
+ 3: 'n',
60
+ 4: 'ə',
61
+ 5: 'i',
62
+ 6: 'r',
63
+ 7: 'k',
64
+ 8: 'm',
65
+ 9: 't',
66
+ 10: 'u',
67
+ 11: 'g',
68
+ 12: 's',
69
+ 13: 'b',
70
+ 14: 'p',
71
+ 15: 'l',
72
+ 16: 'd',
73
+ 17: 'o',
74
+ 18: 'e',
75
+ 19: 'h',
76
+ 20: 'c',
77
+ 21: 'y',
78
+ 22: 'j',
79
+ 23: 'w',
80
+ 24: 'f',
81
+ 25: 'v',
82
+ 26: '-',
83
+ 27: 'z',
84
+ 28: "'",
85
+ 29: 'q',
86
+ 30: '[mask]'
87
+ }
88
+
89
+ token2id = {
90
+ '': 0,
91
+ "'": 28,
92
+ '-': 26,
93
+ '[UNK]': 1,
94
+ '[mask]': 30,
95
+ 'a': 2,
96
+ 'b': 13,
97
+ 'c': 20,
98
+ 'd': 16,
99
+ 'e': 18,
100
+ 'f': 24,
101
+ 'g': 11,
102
+ 'h': 19,
103
+ 'i': 5,
104
+ 'j': 22,
105
+ 'k': 7,
106
+ 'l': 15,
107
+ 'm': 8,
108
+ 'n': 3,
109
+ 'o': 17,
110
+ 'p': 14,
111
+ 'q': 29,
112
+ 'r': 6,
113
+ 's': 12,
114
+ 't': 9,
115
+ 'u': 10,
116
+ 'v': 25,
117
+ 'w': 23,
118
+ 'y': 21,
119
+ 'z': 27,
120
+ 'ə': 4
121
+ }
122
+
123
+ </details>
124
+
125
+ ```py
126
+ import keras
127
+ import tensorflow as tf
128
+ import numpy as np
129
+
130
+ mlm_model = keras.models.load_model(
131
+ "bert_mlm.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
132
+ )
133
+
134
+ MAX_LEN = 32
135
+
136
+ def inference(sequence):
137
+ sequence = " ".join([c if c != "e" else "[mask]" for c in sequence])
138
+ tokens = [token2id[c] for c in sequence.split()]
139
+ pad = [token2id[""] for _ in range(MAX_LEN - len(tokens))]
140
+
141
+ tokens = tokens + pad
142
+ input_ids = tf.convert_to_tensor(np.array([tokens]))
143
+ prediction = mlm_model.predict(input_ids)
144
+
145
+ # find masked idx token
146
+ masked_index = np.where(input_ids == mask_token_id)
147
+ masked_index = masked_index[1]
148
+
149
+ # get prediction at those masked index only
150
+ mask_prediction = prediction[0][masked_index]
151
+ predicted_ids = np.argmax(mask_prediction, axis=1)
152
+
153
+ # replace mask with predicted token
154
+ for i, idx in enumerate(masked_index):
155
+ tokens[idx] = predicted_ids[i]
156
+
157
+ return "".join([id2token[t] for t in tokens if t != 0])
158
+
159
+ inference("mengembangkannya")
160
+ ```
161
+
162
+ ## Authors
163
+
164
+ ID G2P BERT was trained and evaluated by [Ananto Joyoadikusumo](https://anantoj.github.io/), [Steven Limcorn](https://stevenlimcorn.github.io/), [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on Google Colaboratory.
165
 
166
+ ## Framework versions
167
 
168
+ - Keras 2.8.0
169
+ - TensorFlow 2.8.0