Is `--speculate` needed for Medusa in TGI?
Hi @Narsil , thanks for releasing this! In the TGI 2.0 release notes it says to test this model with:
model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4
But in the HF documentation about medusa it says:
In order to use medusa models in TGI, simply point to a medusa enabled model, and everything will load automatically.
And it specifically says that --speculate
is for enabling n-gram speculation:
In order to enable n-gram speculation simply use
--speculate 2
in your flags.
So I'm wondering if the meaning of the --speculate
flag has changed? I'm guessing the docs just need to be updated.
(Also, as an aside, it'd be great if the model card had the exact commands to reproduce this medusa conversion of the original model.)
No the docs are correct.
Without --speculate
, pointing to the medusa model, everything will load from the configuration and work OK (logs will show what's happening).
With--speculate
TGI will respect that command. If the model is medusa, it will limit the medusa heads to whatever you specified, If it's a regular model, then it will use what it can which will be ngram (usually much worse on average than speculation, but sometimes still beneficial dpending on the model and the prompts).
Thanks!