Optimum documentation

TGI on Gaudi

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v1.23.3).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

TGI on Gaudi

Text Generation Inference (TGI) on Intel® Gaudi® AI Accelerator is supported via Intel® Gaudi® TGI repository. Start TGI service on Gaudi system simply by pulling a TGI Gaudi Docker image and launching a local TGI service instance.

For example, TGI service on Gaudi for Llama 2 7B model can be started with:

docker run \
  -p 8080:80 \
  -v $PWD/data:/data \
  --runtime=habana \
  -e HABANA_VISIBLE_DEVICES=all \
  -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
  --cap-add=sys_nice \
  --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 \
  --model-id meta-llama/Llama-2-7b-hf \
  --max-input-tokens 1024 \
  --max-total-tokens 2048

You can then send a simple request:

curl 127.0.0.1:8080/generate \
  -X POST \
  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":32}}' \
  -H 'Content-Type: application/json'

To run static benchmark test, please refer to TGI’s benchmark tool. More examples of running the service instances on single or multi HPU device system are available here.

< > Update on GitHub