Generation does not terminate on the eos type used in prompting
Hello again,
I failed to reproduce the TowerInstruct generation example shown on the model page. While the example output terminates after generating a single sentence, I did not find a simple way to get the model to do so. I suspect the reason has to do with a mismatch between the model's generation_config (which specifies eos_token_id=2) and what the model actually uses as an end-of-sequence marker ("<|im_end|>", token_id=32005). Since they don't match, generation does not stop when it reaches an <|im_end|>, which means it continues to generate until it hits the max length.
Overriding the default generation config would probably solve this issue (I can't test because I'm waiting for a free GPU), but this seems like a slightly clunky fix. Any idea what we should do about it?
Hi,
Thank you for noticing!
I have fixed the generation config and tested it. It should work as expected.
Can you check on your side if it works now (you may need to redownload it)?
Hey Duarte and bpop, thanks for the comments... I am traveling but will share with you my findings in three days. Will download the generation config and test it.
trying as is I have "torch.cuda.OutOfMemoryError: CUDA out of memory"
then deside to quantize in my docker container:
tgi-towerinstruct-gpu:
image: ghcr.io/huggingface/text-generation-inference:1.4
command: --model-id Unbabel/TowerInstruct-7B-v0.1 --quantize eetq --num-shard 1 --max-batch-prefill-tokens 512 --max-input-length 512
volumes:
- ./models:/data
ports:
- 8102:80
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [ gpu ]
but the quantize process return:
2024-01-28 00:11:40 2024-01-28T00:11:40.004314Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-01-28 00:11:49 2024-01-28T00:11:49.839464Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-01-28 00:11:49
2024-01-28 00:11:49 2024-01-28T00:11:49.925051Z INFO shard-manager: text_generation_launcher: Shard ready in 811.549244253s rank=0
2024-01-28 00:11:50 2024-01-28T00:11:50.021987Z INFO text_generation_launcher: Starting Webserver
2024-01-28 00:11:50 2024-01-28T00:11:50.053023Z INFO text_generation_router: router/src/main.rs:181: Using the Hugging Face API
2024-01-28 00:11:50 2024-01-28T00:11:50.053123Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-01-28 00:11:50 2024-01-28T00:11:50.237200Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.14.1/src/tokenizer/serialization.rs:159: Warning: Token '' was expected to have ID '32000' but was given ID 'None'
2024-01-28 00:11:50 2024-01-28T00:11:50.237383Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.14.1/src/tokenizer/serialization.rs:159: Warning: Token '' was expected to have ID '32001' but was given ID 'None'
2024-01-28 00:11:50 2024-01-28T00:11:50.237391Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.14.1/src/tokenizer/serialization.rs:159: Warning: Token '' was expected to have ID '32002' but was given ID 'None'
2024-01-28 00:11:50 2024-01-28T00:11:50.237398Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.14.1/src/tokenizer/serialization.rs:159: Warning: Token '