|
--- |
|
library_name: transformers |
|
--- |
|
## How to run it |
|
|
|
### setup enviroment |
|
|
|
``` |
|
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 |
|
|
|
pip install fbgemm-gpu==0.8.0rc4 |
|
|
|
Download the enablement fork, https://huggingface.co/sllhf/transformers_enablement_fork/tree/main |
|
|
|
unzip the file |
|
|
|
cd transformers |
|
|
|
# add changes from this PR https://github.com/huggingface/transformers/pull/32047 |
|
|
|
git fetch origin pull/32047/head:new-quant-method |
|
|
|
git merge new-quant-method |
|
|
|
|
|
pip install -e . |
|
|
|
git clone https://github.com/huggingface/accelerate.git |
|
|
|
cd accelerate |
|
|
|
pip install -e . |
|
|
|
# Next install vLLM PR https://github.com/vllm-project/vllm/pull/6559 |
|
|
|
git clone https://github.com/vllm-project/vllm.git |
|
|
|
cd vllm |
|
|
|
git fetch origin pull/6559/head:fbgemm-checkpoints |
|
|
|
git checkout fbgemm-checkpoints |
|
|
|
pip install -e . |
|
|
|
|
|
|
|
``` |
|
|
|
### Load back the HF model |
|
|
|
``` |
|
from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name = "sllhf/Meta-Llama-3.1-405B-Instruct-FP8" |
|
|
|
quantization_config = FbgemmFp8Config() |
|
quantized_model = AutoModelForCausalLM.from_pretrained( |
|
model_name, device_map="auto") |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
input_text = "What are we having for dinner?" |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
|
|
# make sure to set up your own params, temperature, top_p etc. |
|
|
|
output = quantized_model.generate(**input_ids, max_new_tokens=10) |
|
|
|
print(tokenizer.decode(output[0], skip_special_tokens=True)) |
|
``` |
|
|
|
### How to run in it with vLLM |
|
|
|
``` |
|
from vllm import LLM |
|
model = LLM("sllhf/Meta-Llama-3.1-405B-Instruct-FP8") |
|
|
|
``` |