google/gemma-7b · Deploy in Sagemaker

Feb 24

I got this error message when deploying the model in Sagemaker using HuggingFaceModel() with the image 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.1-tgi1.4.0-gpu-py310-cu121-ubuntu20.04.

ValueError: Unsupported model type gemma

zhongyu09

Feb 26

I saw the message and want to know the plan/ETA.
"google/gemma-7b is not yet available for Amazon SageMaker deployments.
We are working on adding support."

philschmid

Feb 26

Yes we are working on releasing the new TGI version 1.4.2, which will enable support.

suryabhupa

Google org Mar 4

@philschmid please keep us posted if you make any headway here!

philschmid

Mar 4

@suryabhupa you should be able to deploy Gemma now, check the latest code snippet.

hfkfabo

Mar 4

•

edited Mar 4

@philschmid could you please share your deploy configs:
I used the configs below
TGI version: 1.4.2
INSTANCE_TYPE: ml.g5.2xlarge
"HF_MODEL_ID": "google/gemma-7b-it", # model_id from hf.co/models
"SM_NUM_GPUS": json.dumps(1), # Number of GPU used per replica
"HUGGING_FACE_HUB_TOKEN": os.getenv("HUGGING_FACE_HUB_TOKEN", ""), # Hugging Face token to access private models

Error:
#033[2mtext_generation_client#033[0m#033[2m:#033[0m #033[2mrouter/client/src/lib.rs#033[0m#033[2m:#033[0m#033[2m33:#033[0m Server error: Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens
Error: Warmup(Generation("Not enough memory to handle 4096 prefill tokens. You need to decrease --max-batch-prefill-tokens"))
#033[2m2024-03-04T11:39:32.407327Z#033[0m #033[31mERROR#033[0m #033[2mtext_generation_launcher#033[0m#033[2m:#033[0m Webserver Crashed

philschmid

Mar 4

We update the script to use a bigger instance. Alternatively you can decrease the configurations on TGI.