no tokenizer.json

by Eruuu - opened 17 days ago

Discussion

Eruuu

17 days ago

no tokenizer.json

rAIfle

Owner 17 days ago

There is indeed no tokenizer.json in this repo. There is no tokenizer.json in the repo for the base model I tuned this off of either. While I haven't personally used this unquantized version for inference, I know for a fact it is (or at the very least, was) possible to quantize it.

Which stack are you using that insists such a file is required?

Eruuu

17 days ago

I don't have any plan to quantize or finetune the model, yet, at least. But there's this platform where I want to host the model at and it requires the tokenizer.json in order to run the model, that's why i require the file

Thank you!

Eruuu

10 days ago

could you add it, by any chance?

rAIfle

Owner 10 days ago

Since it's missing from the base model as well it will take some effort to track down the proper file. I'll try to get this done during the coming week, though.

gghfez

6 days ago

I've been through this recently when finetuning WizardLM.

The tokenizer.json is missing because the base model uses the slow_tokenizer, which has the 3 separate files.

You can build the fast_tokenizer for your inference engine like this:

from transformers import AutoTokenizer
from transformers.convert_slow_tokenizer import convert_slow_tokenizer
import os

# Load the slow tokenizer
BASE_MODEL = "rAIfle/SorcererLM-8x22b-bf16"
slow_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=False)

# Convert to fast tokenizer
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)

# Create directory if it doesn't exist
os.makedirs("fast_tokenizer", exist_ok=True)
os.makedirs("slow_tokenizer", exist_ok=True)

# Save the fast tokenizer as tokenizer.json
fast_tokenizer.save("fast_tokenizer/tokenizer.json")

# You can also save the other necessary files from the slow tokenizer
slow_tokenizer.save_pretrained("slow_tokenizer")

However, I don't recommend uploading it into the repo, since a lot of tools expect WizardLM2 MoE based models to use the slow_tokenizer, which behaves differently

(padding and special tokens are handled differently, exl2 quant would fail if you put the tokenizer.config in the repo, etc)

Test:

test_text = "The quick brown fox jumps over the lazy dog."

# Test slow tokenizer
slow_tokens = slow_tokenizer.encode(test_text)
print("Slow tokenizer:")
print("- IDs:", slow_tokens[:10])
print("- Tokens:", slow_tokenizer.convert_ids_to_tokens(slow_tokens[:10]))
print("- Decoded:", slow_tokenizer.decode(slow_tokens))

# Test fast tokenizer
fast_encoding = fast_tokenizer.encode(test_text)
print("\nFast tokenizer:")
print("- IDs:", fast_encoding.ids[:10])
print("- Tokens:", fast_tokenizer.decode_batch([fast_encoding.ids[:10]]))
print("- Decoded:", fast_tokenizer.decode(fast_encoding.ids))

Output:

Slow tokenizer:
- IDs: [1, 415, 2936, 9060, 285, 1142, 461, 10575, 754, 272]
- Tokens: ['<s>', '▁The', '▁quick', '▁brown', '▁f', 'ox', '▁j', 'umps', '▁over', '▁the']
- Decoded: <s>The quick brown fox jumps over the lazy dog.

Fast tokenizer:
- IDs: [415, 2936, 9060, 285, 1142, 461, 10575, 754, 272, 17898]
- Tokens: ['The quick brown fox jumps over the lazy']
- Decoded: The quick brown fox jumps over the lazy dog.

@Eruuu Here's the fast_tokenizer version I created using the above python code:

https://huggingface.co/gghfez/SorcererLM-8x22b-fast_tokenizer/blob/main/tokenizer.json

rAIfle

Owner 6 days ago

@gghfez Thanks for the assist, and the clear explanation!

Eruuu

5 days ago

thank you so much @gghfez

gghfez

5 days ago

rAlfle - no problem, thanks for the model! I've wanted something like this for a while but don't have the compute to train it.

Eruuu - no worries, happy to help.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment