Better Tokenizer
#5
by
Ehsanjahanbakhsh
- opened
The code used to create the tokenizer adds 100 extra tokens by default. The padding token for T5Tokenizer
is "<pad>" by default, which doesn't exist in the sentencepiece model, thus adding another extra token, making batch inference impossible.
tokenizer = T5Tokenizer('256k_vocab/spm.model', legacy=False)
using the code below makes it work fine:
tokenizer = T5Tokenizer('vocabulary_256k_vocab_spm.model', extra_ids = 0, pad_token = '<s>', legacy = False)
jbochi
changed discussion status to
closed