Transformers
GGUF
9 languages
chat
Inference Endpoints

136GB VRAM enough to run Q8?

#1
by AIGUYCONTENT - opened

I have 120GB of VRAM currently. I have a 4080 (16GB) collecting dust in my closet. The Q8 quant for this LLM is 130GB in size.

If I were to connect the 4080 to the motherboard, that brings me to 136GB. However, I know I need some VRAM overhead (above and beyond 130GB) for etc misc.

Do you think 136GB is enough to run this...at like 4k context? I do not do role playing—I only use AI for work (copywriting and content marketing).

Edit: I also have a 3080 with 10GB of VRAM. Wondering if it's even worth it to go through the hassle of hooking those two cards up? And I know that the entire system would only run at the speed of the 3080 (or 4080 if I only hook that one up).

Yes 136 GiB (146 GB of VRAM) will be enough to run this is Q8 with 4k context assuming the layers align nicely to fill all your cards.

I would stick with 120GB of VRAM and go Q6 with imatrix (will be uploaded soon under https://huggingface.co/mradermacher/magnum-v2-123b-i1-GGUF) or EXL2 6.5 bpw as there is no real benefit of using Q8 and it only makes inference run slower. If you really want to go with Q8 you could also just use stupidly quantized Transformers or GPTQ quants as they will likely run faster than Q8 GGUF but I recommend to compare their performance using your specific setup.

Edit: I also have a 3080 with 10GB of VRAM. Wondering if it's even worth it to go through the hassle of hooking those two cards up? And I know that the entire system would only run at the speed of the 3080 (or 4080 if I only hook that one up).

No, it will not if you just spread the layers across all your GPUs as llama.cpp does. Just the layer on your slower GPUs will take a bit longer to process. By default, it will round robin across all your GPUs for every token it generates leaving them idle while layers on the other GPUs are being processed. The performance impact of mixing in some slightly slower GPUs for a few layers will be minimal especially compared to other factors like using GGUF vs EXL2 vs GPTQ vs Transformers and 6.5 bpw vs 8 bpw.

Ok, thanks. I always thought Q8 > Q6.

The main reason I try to run Q8 is due to my supposition that Q8 is smarter than other quants. I need an AI that can pretend it's a human copywriter who is sitting right beside me. I ask it questions and appreciate intelligent responses (e.g., "Does this value proposition have the intended effect of x and y"). Llama 3.1 seems to do this the best so far. But even it falls short some times.

p.s. Do you know if there could be any potential difference in quality (or?) between this Q6 (https://huggingface.co/anthracite-org/magnum-v2-123b-gguf/tree/main) and your Q8? And will go ahead and hook up the 4080. Once I get another PCIE cable I will hook up the 3080 as well.

p.s.ssssss...

Apologies—I misread your comment. Will wait for your Q6 with imatrix. I thought they were only doing imatrix for Q4 and below.

Ok, thanks. I always thought Q8 > Q6.

The main reason I try to run Q8 is due to my supposition that Q8 is smarter than other quants. I need an AI that can pretend it's a human copywriter who is sitting right beside me. I ask it questions and appreciate intelligent responses (e.g., "Does this value proposition have the intended effect of x and y"). Llama 3.1 seems to do this the best so far. But even it falls short some times.

After i1-Q5_K_S there are diminishing returns especially for larger models. I saw no meaningful difference between i1-Q6_K and Q8 by measuring perplexity, popular AI benchmarks or in my real-world experience in the past year. For real world use the speed improvement of i1-Q6_K over Q8 seams always more beneficial than the almost neglectable quality difference. It is like arguing that a 320 kbps mp3 is worse than FLAC. Theoretically it is but 99.9% of humans will not be able to hear the difference.

Here a table from Qwen1.5-72B-Chat-GGUF:

Size fp16 q8_0 q6_k q5_k_m q5_0 q4_k_m q4_0 q3_k_m q2_k
0.5B 34.20 34.22 34.31 33.80 34.02 34.27 36.74 38.25 62.14
1.8B 15.99 15.99 15.99 16.09 16.01 16.22 16.54 17.03 19.99
4B 13.20 13.21 13.28 13.24 13.27 13.61 13.44 13.67 15.65
7B 14.21 14.24 14.35 14.32 14.12 14.35 14.47 15.11 16.57
14B 10.91 10.91 10.93 10.98 10.88 10.92 10.92 11.24 12.27
32B 8.87 8.89 8.91 8.94 8.93 8.96 9.17 9.14 10.51
72B 7.97 7.99 7.99 7.99 8.01 8.00 8.01 8.06 8.63

As you can see for 72b the q5_k_m, q6_k and q8_0 results are equal. If you use q6 with imatrix (i1-Q6_K) and use a 123b model the difference between i1-Q6_K and Q8 will be even less to a point where it likely wouldn't even be measurable unless you run perplexity measurements for days on a large dataset.

The most important factor is for sure what base model and finetune you use. The are massive differences between different base models and finetunes of said base model. The base model defines the general intelligence and capabilities of a model while the finetune makes it use a certain writing style. If you have a very specific writing style and task in mind you should consider the possibility of creating your own finetune based on your specific use-case.

Apologies—I misread your comment. Will wait for your Q6 with imatrix. I thought they were only doing imatrix for Q4 and below.

I-quants have nothing to do with imatrix quants. While it is true that I-quants only go until Q4 imatrix quant exists for every size up to Q6,
In case you want to check the status of the magnum-v2-123b imatrix quants you can always check: http://hf.tst.eu/status.html
It currently completed 9 out of 21 of them. This is quite a big model so it might take a few days for all of them to be completed. In the meantime, I recommend trying out the Q6_K quants without imatrix just so you can evaluate if this model best satisfies your use-case.

Great idea and will take you up on that. Will download the Q6_K now. I can report back with my findings if it will help you out? In addition to what I said earlier, I also look for an AI that can follow a very strict set of grammar rules. Honestly, in my mind, it's "basic English grammar 101," but for some reason AI models LOVE to write complex sentences with dependent clauses. They also love to write in passive voice and use duplicate words close together. That is a mortal sin for the content I write.

And no amount of prompt engineering (or character cards/etc) over the past year and a half has helped. So, I'm currently researching the best way to fine tune and am hopeful that will help. I plan to scrape a few hundred websites in the niche I am in and then use that one tool to format the data so it's ready to be used in fine tuning. I also have ~5 years of content that I wrote. I plan to fine tune the AI on thousands of Google Docs (I work in GDocs...just need to figure out the best way to extract that info and get it in a format that works for fine tuning. And then figure out the what software I should use to fine tune it).

As you can see for 72b the q5_k_m, q6_k and q8_0 results are equal. If you use q6 with imatrix (i1-Q6_K) and use a 123b model the difference between i1-Q6_K and Q8 will be even less to a point where it likely wouldn't even be measurable unless you run perplexity measurements for days on a large dataset.

Thanks for your detailed response. Will be giving the Q6_K a shot now (which I'm guessing is this one that Nicoboss was referring to: https://huggingface.co/anthracite-org/magnum-v2-123b-gguf/tree/main) and then will monitor i1-Q6_K progress and download it when it's ready and will compare the two.

Edit: There's like 11 files...For Oobabooga, do I need to concatenate via termal (e.g., "cat magnum-v2-123b-q6_k-00001-of-00011.gguf...blah blah blah 11x > magnum-v2-123b-q6_k.gguf) or do I just throw them all in the models folder and load up the first one?

I'm currently researching the best way to fine tune and am hopeful that will help. I plan to scrape a few hundred websites in the niche I am in and then use that one tool to format the data so it's ready to be used in fine tuning. I also have ~5 years of content that I wrote. I plan to fine tune the AI on thousands of Google Docs (I work in GDocs...just need to figure out the best way to extract that info and get it in a format that works for fine tuning. And then figure out the what software I should use to fine tune it).

There are many tools to finetune. Here a list of some I recommend:

But much more important than the tool is your dataset. Ideally your dataset would contain instructions plus expected text. You can train just on raw text if you go for a LoRA. If you only care about the writing style like it writing easier sentences a LoRA could be enough. You don't even need that much data for a LoRA. Around 20 million characters of text should be enough. If you however want to do a full finetune and only have raw text then you need much more of it and likely need to first finetune a base model with your text and then later instruction finetune the model yourself which will require a massive amount of GPU hours for large models. With an instruction-based dataset you can just finetune an instruction tuned model which is way quicker.

And no amount of prompt engineering (or character cards/etc) over the past year and a half has helped.

Not surprising as the writing style is defined by the finetune/LoRA and not your prompt.

Will be giving the Q6_K a shot now

It's here. Just download and concatenate those 3 files:
https://huggingface.co/mradermacher/magnum-v2-123b-GGUF/resolve/main/magnum-v2-123b.Q6_K.gguf.part1of3
https://huggingface.co/mradermacher/magnum-v2-123b-GGUF/resolve/main/magnum-v2-123b.Q6_K.gguf.part2of3
https://huggingface.co/mradermacher/magnum-v2-123b-GGUF/resolve/main/magnum-v2-123b.Q6_K.gguf.part3of3

Will monitor i1-Q6_K progress and download it when it's ready and will compare the two.

You probably won't notice any difference because Q6_K is already basically identical to the unquantized version but i1-Q6_K will theoretically be slightly better while offering the same size and performance so I recommend to switch once it is ready.

Hey, really appreciative you to taking the time to help me out today. I would prefer to spend a few months writing out questions and preferred answers (e.g. Write a 15-word value proposition for a company that sells blue widgets. A: "Our innovative blue widgets revolutionize business operations by boosting productivity, streamlining complex workflows, and driving efficiency gains.") at night time to create a good dataset that will allow me to properly fine tune the Instruct LLM on (maybe L3.1 8b?). Found quite a few tutorials on Axolotl on YouTube and that should be enough to get me going so I can experiment.

Thanks, downloaded a few hours ago and just loaded up. It has a different feel to it for sure vs. llama 3.1 (duh). Will play around with it for a few days and then download the imatrix quant and see if there's any noticable difference.

This comment has been hidden

Sign up or log in to comment