Llama-3.1-405B-FP8 / README.md
marcsun13's picture
marcsun13 HF staff
Create README.md
20e1c24 verified
|
raw
history blame
1.64 kB
---
library_name: transformers
---
## How to run it
### setup enviroment
```
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip install fbgemm-gpu==0.8.0rc4
Download the enablement fork, https://huggingface.co/sllhf/transformers_enablement_fork/tree/main
unzip the file
cd transformers
# add changes from this PR https://github.com/huggingface/transformers/pull/32047
git fetch origin pull/32047/head:new-quant-method
git merge new-quant-method
pip install -e .
git clone https://github.com/huggingface/accelerate.git
cd accelerate
pip install -e .
# Next install vLLM PR https://github.com/vllm-project/vllm/pull/6559
git clone https://github.com/vllm-project/vllm.git
cd vllm
git fetch origin pull/6559/head:fbgemm-checkpoints
git checkout fbgemm-checkpoints
pip install -e .
```
### Load back the HF model
```
from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer
model_name = "sllhf/Meta-Llama-3.1-405B-Instruct-FP8"
quantization_config = FbgemmFp8Config()
quantized_model = AutoModelForCausalLM.from_pretrained(
model_name, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# make sure to set up your own params, temperature, top_p etc.
output = quantized_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
### How to run in it with vLLM
```
from vllm import LLM
model = LLM("sllhf/Meta-Llama-3.1-405B-Instruct-FP8")
```