meta-llama
/

Llama-3.1-405B-FP8

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Llama-3.1-405B-FP8 / README.md

marcsun13's picture

marcsun13 HF staff

Create README.md

20e1c24 verified 4 months ago

|

1.64 kB

	---
	library_name: transformers
	---
	## How to run it

	### setup enviroment

	```
	pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

	pip install fbgemm-gpu==0.8.0rc4

	Download the enablement fork, https://huggingface.co/sllhf/transformers_enablement_fork/tree/main

	unzip the file

	cd transformers

	# add changes from this PR https://github.com/huggingface/transformers/pull/32047

	git fetch origin pull/32047/head:new-quant-method

	git merge new-quant-method


	pip install -e .

	git clone https://github.com/huggingface/accelerate.git

	cd accelerate

	pip install -e .

	# Next install vLLM PR https://github.com/vllm-project/vllm/pull/6559

	git clone https://github.com/vllm-project/vllm.git

	cd vllm

	git fetch origin pull/6559/head:fbgemm-checkpoints

	git checkout fbgemm-checkpoints

	pip install -e .



	```

	### Load back the HF model

	```
	from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer

	model_name = "sllhf/Meta-Llama-3.1-405B-Instruct-FP8"

	quantization_config = FbgemmFp8Config()
	quantized_model = AutoModelForCausalLM.from_pretrained(
	model_name, device_map="auto")

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	input_text = "What are we having for dinner?"
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	# make sure to set up your own params, temperature, top_p etc.

	output = quantized_model.generate(**input_ids, max_new_tokens=10)

	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```

	### How to run in it with vLLM

	```
	from vllm import LLM
	model = LLM("sllhf/Meta-Llama-3.1-405B-Instruct-FP8")

	```