Suitable hardware config for usage this model

#22

by nimool - opened Oct 23

Oct 23

•

I'm looking for a good hardware confidg which can run this model well. Is anyone could run this model and use it on server or local machines? if so share the hardware config of that machine. contains GPU model and ram and ....

pantelnm

Oct 23

Hello! To run the 70b for 4bit quantization you would need at least 42GB VRAM to fully offload the model in the GPU for fastest inference. Else offload the rest of it to your RAM. Ideally get 64GB RAM and 2x RTX 3090 for a cost effective setup! Hope that helps!

nimool

Oct 23

Hello! To run the 70b for 4bit quantization you would need at least 42GB VRAM to fully offload the model in the GPU for fastest inference. Else offload the rest of it to your RAM. Ideally get 64GB RAM and 2x RTX 3090 for a cost effective setup! Hope that helps!

Thank you for your reply, would you mind tell me this configs come from what resource ??

pantelnm

Oct 23

Sure thing, please check out the following website regarding the resource requirements: https://llamaimodel.com/requirements/#70B

nimool

Oct 23

Sure thing, please check out the following website regarding the resource requirements: https://llamaimodel.com/requirements/#70B

Many thanks my friend

pwroff

Oct 23

Hello! To run the 70b for 4bit quantization you would need at least 42GB VRAM to fully offload the model in the GPU for fastest inference. Else offload the rest of it to your RAM. Ideally get 64GB RAM and 2x RTX 3090 for a cost effective setup! Hope that helps!

I don't think that's enough if you use long context, I load it on a server with 4x NVIDIA A40 and it runs smoothly, about same speed as chatgpt, but I don't use the full context length.

pantelnm

Oct 23

•

edited Oct 23

Thank you for your remark, trully 4x A40 would be amazing, although a bit costly for the solo enthusiast (5k USD per gpu), and even that isn't enough sometimes to run full precision models... Try to evaluate what your needs are and based on that decide what fits u better and adjust your hardware accordingly.

pwroff

Oct 23

•

edited Oct 23

Thank you for your remark, trully 4x A40 would be amazing, although a bit costly for the solo enthusiast (5k USD per gpu), and even that isn't enough sometimes to run full precision models... Try to evaluate what your needs are and based on that decide what fits u better and adjust your hardware accordingly.

You can rent servers when you use them and pay only for that time, given that you don't need it 24/7. You can do it here on huggingface and a couple of other places. That's what I do personally, it's a lot cheaper than buying hardware and more powerful. About 25$ for 15 hours, plenty of time to play around compared to having a slow setup at home.

suri

Oct 25

will it work on 24 GB VRAM with RTX 4090 ?

pwroff

Oct 25

will it work on 24 GB VRAM with RTX 4090 ?

If you have 64 GIG of ram yes, but not this version, you'll have to use the GGUF format with LMStudio. You'll get about 0.5 token/seconds at best. If you can wait half an hour for a response, then you can say it works.

mysticbeing

21 days ago

•

edited 21 days ago

You can run it on a single H100 with the FP8 quantized version - https://huggingface.co/mysticbeing/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment