Issue with tokenizer
#1
by
marinone94
- opened
Hi,
I was trying out this model and it seems there is an issue with the tokenizer replacing Swedish characters with English ones (ä --> a, å --> a, ö --> o). It looks weird to me since the vocab file contains words including those swedish characters.
Examples
Input: försändelse från utlandet
Decoded: [CLS] forsandelse fran utlandet [SEP]
Input: Örebro är en fin stad
Decoded: [CLS] orebro ar en fin stad [SEP]
To reproduce:
from transformers import AutoTokenizer
examples = ["försändelse från utlandet", "Örebro är en fin stad"]
tokenizer = AutoTokenizer.from_pretrained("af-ai-center/bert-base-swedish-uncased")
enc = tokenizer(examples)
dec = tokenizer.batch_decode(enc["input_ids"])
for input_example, decoded_example in zip(examples, dec):
print("Input: ", input_example)
print("Decoded: ", decoded_example)
Env:
- `transformers` version: 4.18.0
- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.6
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.9.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no