How to use run the code on Colab Free Tier or Mac OS?

#131
by dounykim - opened

When I run the code example exactly as it is on the model page, it stops at loading checkpoint shards after downloading SafeTensors on both Colab and Mac. I encounter a memory error. Does this mean I cannot run Mixtral unless I upgrade Colab or increase the RAM capacity on my Mac?

Indeed, mixtral is a quite large model. The model has around ~45B parameters meaning that you need around 90GB (or more) just to store the model weights. for inference, you'll need slightly more GPU memory.
To run this model locally you need to use the quantized versions of the model.
Here you can see a rough estimate of the amount of needed RAM: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/discussions/77#659910160c972f4c7789542d - now you can also run the model in 2-bit precision in a free tier google colab instance: https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing see for reference: https://huggingface.co/posts/ybelkada/434200761252287

@ybelkada Thanks for the comment. I checked the links you provided and it says 32 GB of RAM is needed for quantized int4. So I guess it is difficult to run it in locally. Maybe running it in 2-bit precision is the only option right? But what is the difference between just running the model and 2-bit precision?

Mistral AI_ org

Performance will suffer a huge impact mostly, u are reducing the weights precision after all. To make it short, Quantization reduces considerably the weights precision -> reducing memory necessary -> reducing ram necessary overall -> faster and easier to run on less powerfull hardware. But has a drawback of reducing performance overall the lower the precision. But most of the time it's better to take a quantized version of a huge model than a normal model of the same file size, that's why it's still very popular anyways !

@pandora-s Thank you for the simple explanation. I have another question. How do people usually run these huge model without using quantization? Buy a server with high GPU and RAM?

Mistral AI_ org

That's right, there is little choice to be made, Mixtral for example needs 2 GPUs if you want to run it as it is.

@pandora-s 2 I am sorry for keep asking. So you mean running a Mixtral 8x7B model must require 2 GPUs rather than just one really expensive GPU? Then I guess I should just use 2 bit precision. Is there any research comparing accuracy or benchmarks of base model and different quanitizations?

Mistral AI_ org

@dounykim that's right ! You need an absurd amount of 100GB of VRAM, even an A100 lacks with it's 80GB of ram !! However, some Q versions can be run on CPU (GGUF Quantized models) and then share some layers with the GPU, you can check on sites such as https://anakin.ai/blog/how-to-run-mixtral-8x7b-locally/ the results you can expect, here is a screenshot of one of their comparisons:
image.png

@pandora-s Thank you again for the link. I checked the link and seems like it was ran on 64GB of RAM and RTX 4090. As far as I know, RTX 4090 has 24 VRAM. But you said 100 VRAM is needed?

Mistral AI_ org
edited Feb 19

Let me try to explain again:

GGUF Models can be run on a CPU instead of a GPU, using the RAM instead of VRAM, but you can also share some layers of the model with the GPU using VRAM ! To make it simple, a GGUF version of the model can be run simultaneously by the CPU & GPU, if you do the math, 64+24= 88 Gb of RAM+VRAM, not exactly 100 but it was just an average, and they tested up to Q8 anyways not F16, you get the idea.

Mistral AI_ org
edited Feb 19

If you want to run the full model on GPU, you will need ridiculously powerfull GPUs, but GGUF Quantized models are both smaller, more efficient, and can be run on CPU&GPU.

Thank you very much @pandora-s @dounykim for this insightful discussion! Thanks also for sharing that blogpost !

@pandora-s Oh now I understand.. Thanks for the kind explanation ! I guess it is pretty necessary to utilize CPU.

Mistral AI_ org

Well, not necessarily. If you have a powerful GPU, you do not necessarily need to opt for GGUF models. However, for larger models like Mixtral, in my opinion, most of us may not have the budget (or it may not be worth it) to invest in a powerful GPU. You can use a quantized version of the model entirely on the GPU as well. It's just that, specifically for this model, if you aim for minimal quantization, going for a GGUF model and utilizing both the CPU and GPU is the best approach in my point of view. However, I do not know your situation, budget, or specific requirements.

We have mainly discussed Mixtral, but there are also smaller models like Mistral.

You have different approaches:

  • Run it fully on a GPU -> Full model (or Q/smaller versions). (very fast but expensive)
  • Run it fully on CPU -> GGUF version (or GGUF Q/smaller versions). (extremely slow but cheaper)
  • Run it on both CPU & GPU -> GGUF Version (or GGUF Q versions). (a good balance)

I mostly discussed GGUF models, but there are also other technologies you might want to explore, such as GPTQ and AWS. GPTQ is more focused on GPU, for example.

I cannot decisively determine what's better for you, but I wanted to explain GGUF as it's a very good approach that allows you to run models with most hardware !

In any case, I'm glad I was of help !

Mistral AI_ org

I recommend you to check the models of our beloved TheBloke here: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF
He also gives some comparisons between models !

@pandora-s Your comments really helped me a lot ! I really appreciate it. My MacBook Pro is M3, 32 RAM and my desktop is equipped with 4070 ti, 32 RAM, i5-13500k. I can try GGUF with my desktop but as far as I know, I can't run GGUF models on Mac becasue MacOS does not support bitsandbytes ?

Mistral AI_ org
edited Feb 19

I'am not sure myself, but GGUF should work I believe? Do not quote me on that tho, Mac is out of my expertise, but BNB shouldn't be required for GGUF models I think... we do use it often to train models tho... I am not sure if there should be a compatibility issue. Well the easiest way is just to try and run a small GGUF model, if it works then all of them should if you have ram/vram enough !

Sorry, I'm glad I could help but Mac is out of my league-

Sign up or log in to comment