id-g2p-bert / README.md
w11wo's picture
Update README.md
1432ed9
|
raw
history blame
3.73 kB
metadata
language:
  - id
  - ms
license: apache-2.0
tags:
  - g2p
  - fill-mask
inference: false

ID G2P BERT

ID G2P BERT is a phoneme de-masking model based on the BERT architecture. This model was trained from scratch on a modified Malay/Indonesian lexicon.

This model was trained using the Keras framework. All training was done on Google Colaboratory. We adapted the BERT Masked Language Modeling training script provided by the official Keras Code Example.

Model

Model #params Arch. Training/Validation data
id-g2p-bert 200K BERT Malay/Indonesian Lexicon

Training Procedure

Model Config
vocab_size: 32
max_len: 32
embed_dim: 128
num_attention_head: 2
feed_forward_dim: 128
num_layers: 2
Training Setting
batch_size: 32
optimizer: "adam"
learning_rate: 0.001
epochs: 100

How to Use

Tokenizers
id2token = {
    0: '',
    1: '[UNK]',
    2: 'a',
    3: 'n',
    4: 'ə',
    5: 'i',
    6: 'r',
    7: 'k',
    8: 'm',
    9: 't',
    10: 'u',
    11: 'g',
    12: 's',
    13: 'b',
    14: 'p',
    15: 'l',
    16: 'd',
    17: 'o',
    18: 'e',
    19: 'h',
    20: 'c',
    21: 'y',
    22: 'j',
    23: 'w',
    24: 'f',
    25: 'v',
    26: '-',
    27: 'z',
    28: "'",
    29: 'q',
    30: '[mask]'
}

token2id = {
    '': 0,
    "'": 28,
    '-': 26,
    '[UNK]': 1,
    '[mask]': 30,
    'a': 2,
    'b': 13,
    'c': 20,
    'd': 16,
    'e': 18,
    'f': 24,
    'g': 11,
    'h': 19,
    'i': 5,
    'j': 22,
    'k': 7,
    'l': 15,
    'm': 8,
    'n': 3,
    'o': 17,
    'p': 14,
    'q': 29,
    'r': 6,
    's': 12,
    't': 9,
    'u': 10,
    'v': 25,
    'w': 23,
    'y': 21,
    'z': 27,
    'ə': 4
}
import keras
import tensorflow as tf
import numpy as np

mlm_model = keras.models.load_model(
    "bert_mlm.h5", custom_objects={"MaskedLanguageModel": MaskedLanguageModel}
)

MAX_LEN = 32

def inference(sequence):
    sequence = " ".join([c if c != "e" else "[mask]" for c in sequence])
    tokens = [token2id[c] for c in sequence.split()]
    pad = [token2id[""] for _ in range(MAX_LEN - len(tokens))]

    tokens = tokens + pad
    input_ids = tf.convert_to_tensor(np.array([tokens]))
    prediction = mlm_model.predict(input_ids)

    # find masked idx token
    masked_index = np.where(input_ids == mask_token_id)
    masked_index = masked_index[1]

    # get prediction at those masked index only
    mask_prediction = prediction[0][masked_index]
    predicted_ids = np.argmax(mask_prediction, axis=1)

    # replace mask with predicted token
    for i, idx in enumerate(masked_index):
        tokens[idx] = predicted_ids[i]

    return "".join([id2token[t] for t in tokens if t != 0])

inference("mengembangkannya")

Authors

ID G2P BERT was trained and evaluated by Ananto Joyoadikusumo, Steven Limcorn, Wilson Wongso. All computation and development are done on Google Colaboratory.

Framework versions

  • Keras 2.8.0
  • TensorFlow 2.8.0