mistralai/Mistral-Nemo-Instruct-2407 · Issue with the tokenizer for French texts?

Jul 19

I observed a strange behavior of the tokenizer when dealing with texts in French. In particular, contrary to previous models, it seems to consistently remove the spaces before "!" or "?", e.g.

tokenizer.decode(tokenizer.encode("Ah ? Eh bien !"))

becomes

"Ah? Eh bien!"

(i.e. it defaults to English punctuation rules, which differs from the French one). I understand this might seem unimportant to some, but it does matter for my use case.

I can fix this by adding two spaces instead of one, but this does not feel like an elegant solution. Is there something I'm missing/doing wrong?

For context, I am using the transformers library (with the aim of fine-tuning the model).

Xenova

Jul 19

Thanks! Should be fixed by https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/discussions/13.

Xenova

Jul 19

Can confirm this fixes it :)

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Xenova/Mistral-Nemo-Instruct-Tokenizer', revision='refs/pr/2')
tokenizer.decode(tokenizer.encode("Ah ? Eh bien !"))
# <s>Ah ? Eh bien !

krogoldAI

Jul 19

Great, many thanks!
Will the tokenizer in the official repo change accordingly? In the meanwhile, I'll use yours :)

Xenova

Jul 19

Just got merged! :) You can now access it normally.