Deploying Llama3.1 to Nvidia T4 instance (sagemaker endpoints)

#80
by mleiter - opened

When I try and deploy meta-llama/Meta-Llama-3.1-8B-Instruct to a g4dn.xlarge (Nvidia T4) with quantization enabled, I get:

RuntimeError: FlashAttention only supports Ampere GPUs or newer..

I am NOT able to use any newer GPU due to the region I am deploying a model to. I see models like unsloth SHOULD work and I get past the flash attention error, but I have also been unable to use that one for a different reason.

How can I get flash attention error to go away?

config = {
    "HF_MODEL_ID": "meta-llama/Meta-Llama-3.1-8B-Instruct",  # model_id from hf.co/models
    "SM_NUM_GPUS": json.dumps(number_of_gpu),  # Number of GPU used per replica
    "MAX_INPUT_LENGTH": "4096",  # Max length of input text
    "MAX_TOTAL_TOKENS": "8192",  # Max length of the generation (including input text)
    "MAX_BATCH_TOTAL_TOKENS": "8192",  # Limits the number of tokens that can be processed in parallel during the generation
    "MESSAGES_API_ENABLED": "true",  # Enable the messages API
    "HF_MODEL_QUANTIZE" : "bitsandbytes" #   [possible values: awq, eetq, exl2, gptq, marlin, bitsandbytes, bitsandbytes-nf4, bitsandbytes-fp4, fp8]
}

# check if token is set
assert (
    config["HUGGING_FACE_HUB_TOKEN"] != "test"
), "Please set your Hugging Face Hub token"

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
    role=role, image_uri=llm_image, env=config, sagemaker_session=sess, transformers_version="4.43.3", tensorflow_version="2.17.0", pytorch_version="2.3.1", 
)

+1, following this issue if someone eventually has a workaround

@sumanthnall if you have access to an AWS rep, ask for access to more EC2 instance types then are in GA.

Alternatively, you can use a custom sagemaker image that uses VLLM (instead of TGI) if you want to customize the packages.

Try adding 'CUDA_GRAPHS':0 to the config

And also USE_FLASH_ATTENTION: "false"

Sign up or log in to comment