Text Generation
Transformers
Safetensors
English
llama
nvidia
llama3.1
conversational
text-generation-inference

FP8 Quantized model now available! (only requires half the original model's VRAM)

#33
by mysticbeing - opened

Runs on 1x H100 / A100 (80GB) : https://huggingface.co/mysticbeing/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC

Weight-and-activation quantization to FP8 is virtually lossless, as the text generated by FP8 models is nearly indistinguishable from that of their unquantized counterparts, requiring a very close examination to notice any differences.

Sign up or log in to comment