--- library_name: transformers --- ## How to run it ### setup enviroment ``` pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 pip install fbgemm-gpu==0.8.0rc4 Download the enablement fork, https://huggingface.co/sllhf/transformers_enablement_fork/tree/main unzip the file cd transformers # add changes from this PR https://github.com/huggingface/transformers/pull/32047 git fetch origin pull/32047/head:new-quant-method git merge new-quant-method pip install -e . git clone https://github.com/huggingface/accelerate.git cd accelerate pip install -e . # Next install vLLM PR https://github.com/vllm-project/vllm/pull/6559 git clone https://github.com/vllm-project/vllm.git cd vllm git fetch origin pull/6559/head:fbgemm-checkpoints git checkout fbgemm-checkpoints pip install -e . ``` ### Load back the HF model ``` from transformers import FbgemmFp8Config, AutoModelForCausalLM, AutoTokenizer model_name = "sllhf/Meta-Llama-3.1-405B-Instruct-FP8" quantization_config = FbgemmFp8Config() quantized_model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) input_text = "What are we having for dinner?" input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") # make sure to set up your own params, temperature, top_p etc. output = quantized_model.generate(**input_ids, max_new_tokens=10) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ### How to run in it with vLLM ``` from vllm import LLM model = LLM("sllhf/Meta-Llama-3.1-405B-Instruct-FP8") ```