Text Generation
Transformers
Safetensors
English
llama
nvidia
llama3.1
conversational
text-generation-inference

Suitable hardware config for usage this model

#22
by nimool - opened

I'm looking for a good hardware confidg which can run this model well. Is anyone could run this model and use it on server or local machines? if so share the hardware config of that machine. contains GPU model and ram and ....

Hello! To run the 70b for 4bit quantization you would need at least 42GB VRAM to fully offload the model in the GPU for fastest inference. Else offload the rest of it to your RAM. Ideally get 64GB RAM and 2x RTX 3090 for a cost effective setup! Hope that helps!

Hello! To run the 70b for 4bit quantization you would need at least 42GB VRAM to fully offload the model in the GPU for fastest inference. Else offload the rest of it to your RAM. Ideally get 64GB RAM and 2x RTX 3090 for a cost effective setup! Hope that helps!

Thank you for your reply, would you mind tell me this configs come from what resource ??

Sure thing, please check out the following website regarding the resource requirements: https://llamaimodel.com/requirements/#70B

Sure thing, please check out the following website regarding the resource requirements: https://llamaimodel.com/requirements/#70B

Many thanks my friend

Hello! To run the 70b for 4bit quantization you would need at least 42GB VRAM to fully offload the model in the GPU for fastest inference. Else offload the rest of it to your RAM. Ideally get 64GB RAM and 2x RTX 3090 for a cost effective setup! Hope that helps!

I don't think that's enough if you use long context, I load it on a server with 4x NVIDIA A40 and it runs smoothly, about same speed as chatgpt, but I don't use the full context length.

Thank you for your remark, trully 4x A40 would be amazing, although a bit costly for the solo enthusiast (5k USD per gpu), and even that isn't enough sometimes to run full precision models... Try to evaluate what your needs are and based on that decide what fits u better and adjust your hardware accordingly.

Thank you for your remark, trully 4x A40 would be amazing, although a bit costly for the solo enthusiast (5k USD per gpu), and even that isn't enough sometimes to run full precision models... Try to evaluate what your needs are and based on that decide what fits u better and adjust your hardware accordingly.

You can rent servers when you use them and pay only for that time, given that you don't need it 24/7. You can do it here on huggingface and a couple of other places. That's what I do personally, it's a lot cheaper than buying hardware and more powerful. About 25$ for 15 hours, plenty of time to play around compared to having a slow setup at home.

will it work on 24 GB VRAM with RTX 4090 ?

will it work on 24 GB VRAM with RTX 4090 ?

If you have 64 GIG of ram yes, but not this version, you'll have to use the GGUF format with LMStudio. You'll get about 0.5 token/seconds at best. If you can wait half an hour for a response, then you can say it works.

You can run it on a single H100 with the FP8 quantized version - https://huggingface.co/mysticbeing/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC

Sign up or log in to comment