dataset

#1
by prudant - opened

Can you share the small train dataset please, I want to train the same but on spanish =)

Hi!
Thanks for showing interest in this model, it was trained as a POC using a small handmade dataset, I uploaded the dataset and made it public here

Feel free to use it as you please!

THANKS! I was tryging to solve the follow up questions training a small LLM but was a mess (too poor and small dataset), will try this right now!
regards!

No problem!
The purpose of this model for me was to solve the problem of RAG in chat, in order to minimize the number of tokens used, while keeping all of the relevant RAG information. Meaning that if it's a follow up question, the previous RAG context will be passed and no search in the Vector DB will be made. I hope this will solve your issue, this dataset is still quite small so I would recommend maybe try generating similar data and augment it using gemini? since they have a free api plan, I find it quite useful in cases like these!

same use case =) also training another one for user intent detection in order to avoid chitchat question for RAG or other intents that do not need query and send tokens to the LLM, have a lot of "troubles" to get the model in a scale and production environment, the modelhead is not really integrated to the body model so thats a problem to use serving model artifacts, but I took the body and instance a AutoModelForSequenceClassification then freeze the base model and let the Trainer to do its work on the classification head that way I got a final model with hugging face/pytorch compliant to use in production environments with the same performance as the original aproach from SetFit., how do you solve the production environment of serving the model?

To be honest, this model was not used in production, it was only locally tested to see if it was a viable option, and even we did keep it, the model should be small enough to run comfortably just like the embedding model (which is the model that this classification was fine tuned from)
I'm sorry if I didn't understand or answer your question correctly, I'm still a junior trying to figure out all of these production environment things xD

thanks, no problem, its perfect!

I took your dataset, I passed it to an LLM to generate more similar synthetic data and the result was a Setfit model with a high accuracy, I am using the baia/m3 as embeddings, which is much superior compared to the rest of the multilingual ones that are in huggingface, If you need to improve your model, I can upload the dataset, it is in Spanish but since they are multi-language models, the training language does not matter, it works just as well when predicting in multi-language.

prudant changed discussion status to closed
prudant changed discussion status to open

Thanks for the update! that's good to hear, I used e5 as it was the best model for semantic similarity that worked on my dataset in both french and arabic. I believe the model you mentioned is BAAI/bge-m3 right? I unfortunately couldn't use it because it was too large for our limitations (constrained to 387 vector dimensions).
i'll definitely give it a try in future projects where I have less restrictions!

prudant changed discussion status to closed

Sign up or log in to comment