Better Tokenizer

by Ehsanjahanbakhsh - opened Nov 16, 2023

Nov 16, 2023

•

edited Nov 16, 2023

The code used to create the tokenizer adds 100 extra tokens by default. The padding token for T5Tokenizer is "<pad>" by default, which doesn't exist in the sentencepiece model, thus adding another extra token, making batch inference impossible.

tokenizer = T5Tokenizer('256k_vocab/spm.model', legacy=False)

using the code below makes it work fine:

tokenizer = T5Tokenizer('vocabulary_256k_vocab_spm.model', extra_ids = 0, pad_token = '<s>', legacy = False)

jbochi

Owner Nov 17, 2023

Thank you for reporting this and for the fix in #6.

jbochi changed discussion status to closed Nov 17, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment