Fine-tuning toolkit for Mixtral 8x7B MoE model
It only requires 28GB to fine-tune the 8x7B model with LLaMA Factory.
We adopt 4-bit quantization, LoRA adapters and FlashAttention-2 to save the GPU memory.
This sounds great! Could you kindly provide your command line parameters and a Deepspeed config to run it on multiple H100s?
This is great @hiyouga . I wonder how efficient the training will be, especially with sparse models, and how issues like token dropping will be addressed.
I saw some comments showing that quantization was an issue to leverage with Mixtral MOE.
Mixtral routes each token to experts. Quantization can reduce probability for each token, resulting routing to only go a small portion of experts.
Can you share a mini-guide on the steps necessary to perform the taining, or share the commands and configs used? thanks
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path mistralai/Mixtral-8x7B-v0.1 \
--dataset alpaca_en \
--template mistral \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--output_dir mixtral \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 5e-5 \
--num_train_epochs 1.0 \
--quantization_bit 4 \
--bf16
i'm following the same and doing a 4-bit LoRA finetuning on a custom dataset. Tried changing templates between Alpaca & Mistral. But my training loss diverges after 1k steps or so. Any ideas ?
Reference Notebook - https://colab.research.google.com/drive/1VDa0lIfqiwm16hBlIlEaabGVTNB3dN1A?usp=sharing
cc - @hiyouga
What are the minimum compute resources required to train the model?
@aigeek0x0
We used q_proj,v_proj
modules just to estimate the minimum resource usage. It is recommended to use all linear layers with LoRA adapters for better fitting.
@aigeek0x0
You can specify and finetune only the linear layers of any LLM model while using LORA. When you print(model) you will get the layers in that some people use only attention layers such as q, k, v, o in case of Mistral or some people use all linear layers. I'm exactly not sure how it affects the performance, but it will surely reduce the RAM size of the peft model and it is of very small amount.
It's still not clear for me whether one should also finetune the routers. Any resources discussing this ?
I was also wondering about the need of fine tuning routers. Intuitively it does not make much sense to fine tune the routers with the proj layers, because it can make training unstable, as you’d be fine tuning both representations and routers, and changes in one can affect the other and vice versa.
Other way to see it is that if you change the expert(s) for a given token, you’re losing very valuable information from the base model, and changing the routing decision e.g. from experts (1,3) at a given layer to experts e.g. (4,6) would have much of a bigger, sudden impact than a small, gradual update of the proj matrices in every update step.
But all of this is just speculation, and can be task-specific.
Is it possible to run the fine-tuning of Mixtral with LLaMa-Factory on CPU only or on both GPU and CPU (my GPU is 16 GB of VRAM)?
Thanks a lot!