Issue with the tokenizer for French texts?

#11
by krogoldAI - opened

I observed a strange behavior of the tokenizer when dealing with texts in French. In particular, contrary to previous models, it seems to consistently remove the spaces before "!" or "?", e.g.

tokenizer.decode(tokenizer.encode("Ah ? Eh bien !"))

becomes

"Ah? Eh bien!" 

(i.e. it defaults to English punctuation rules, which differs from the French one). I understand this might seem unimportant to some, but it does matter for my use case.

I can fix this by adding two spaces instead of one, but this does not feel like an elegant solution. Is there something I'm missing/doing wrong?

For context, I am using the transformers library (with the aim of fine-tuning the model).

Can confirm this fixes it :)

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Xenova/Mistral-Nemo-Instruct-Tokenizer', revision='refs/pr/2')
tokenizer.decode(tokenizer.encode("Ah ? Eh bien !"))
# <s>Ah ? Eh bien !

Great, many thanks!
Will the tokenizer in the official repo change accordingly? In the meanwhile, I'll use yours :)

Just got merged! :) You can now access it normally.

Sign up or log in to comment