intfloat/multilingual-e5-base · how train in my domain

Jun 7, 2023

Is there any code to continue training?

Owner Jun 7, 2023

Any framework for dense retriever training will do the job. We use the codebase at https://github.com/microsoft/unilm/tree/master/simlm

wilfoderek

Jun 20, 2023

Hi Friend! a little complex understand that framework(https://github.com/microsoft/unilm/tree/master/simlm). Maybe if you have an easier example to cath the idea would be great. I would like to keep training this model in my own domain.
Thanks in advance bro!

pascalhuerten

Jul 11

I was able to finetune this model quite well on a domain specific task for information retrieval using FlagEmbeddings finetuning methods. You can find the finetuning examples at https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune.

wilfoderek

Jul 12

@pascalhuerten , how many GPU did you use? how much time does it take? is is possible to train it on google colab pro A100(40gb vram)?

pascalhuerten

Jul 12

@wilfoderek Just a single T4 GPU with 15 GB VRAM was enough to fine-tune this model in about 15 minutes on 2000 data points. Fine-tuning on 26000 triplets took about an hour, so it shouldn’t be a problem for an A100. 😊

FYI: My goal was to fine-tune this model for the task of quickly retrieving the most relevant skills based on a database of over 13,000 skills for course descriptions in German language. By fine-tuning on the smaller dataset, I was able to increase the Mean Reciprocal Rank (MRR@10) in the specific domain from 0.32 to 0.69, which is a significant improvement! So fine-tuning is definitely recommended. Even a dataset of just 250 triplets showed notable enhancements. Additionally, I also fine-tuned bge_reranker_base on the same dataset, which further increased the MRR to 0.74. The only other embedding model that performed even better for me on the same dataset was BAAI/bge-m3, but it also takes four times the time to compute an embedding compared to intfloat/multilingual-e5-base.

I used the following training parameters:

torchrun --nproc_per_node 1 \
-m FlagEmbedding.baai_general_embedding.finetune.run \
--output_dir multilingual_e5_base_finetuned \
--model_name_or_path intfloat/multilingual-e5-base \
--train_data ./course_competency_alignment_de.jsonl \
--learning_rate 1e-5 \
--fp16 \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--dataloader_drop_last True \
--normlized True \
--temperature 0.02 \
--query_max_len 512 \
--passage_max_len 64 \
--train_group_size 4 \
--negatives_cross_device \
--logging_steps 10 \
--save_steps 1500 \
--query_instruction_for_retrieval ""

wilfoderek

Jul 13

@pascalhuerten I am so grateful for your help! It is truly very valuable.