You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v1.23.3).
TGI on Gaudi
Text Generation Inference (TGI) on Intel® Gaudi® AI Accelerator is supported via Intel® Gaudi® TGI repository. Start TGI service on Gaudi system simply by pulling a TGI Gaudi Docker image and launching a local TGI service instance.
For example, TGI service on Gaudi for Llama 2 7B model can be started with:
docker run \
-p 8080:80 \
-v $PWD/data:/data \
--runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
--cap-add=sys_nice \
--ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 \
--model-id meta-llama/Llama-2-7b-hf \
--max-input-tokens 1024 \
--max-total-tokens 2048
You can then send a simple request:
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' \
-H 'Content-Type: application/json'
To run static benchmark test, please refer to TGI’s benchmark tool. More examples of running the service instances on single or multi HPU device system are available here.
< > Update on GitHub