dataset

by prudant - opened Jul 13

Discussion

prudant

Jul 13

Can you share the small train dataset please, I want to train the same but on spanish =)

super-cinnamon

Owner Jul 13

Hi!
Thanks for showing interest in this model, it was trained as a POC using a small handmade dataset, I uploaded the dataset and made it public here

Feel free to use it as you please!

prudant

Jul 13

THANKS! I was tryging to solve the follow up questions training a small LLM but was a mess (too poor and small dataset), will try this right now!
regards!

super-cinnamon

Owner Jul 13

No problem!
The purpose of this model for me was to solve the problem of RAG in chat, in order to minimize the number of tokens used, while keeping all of the relevant RAG information. Meaning that if it's a follow up question, the previous RAG context will be passed and no search in the Vector DB will be made. I hope this will solve your issue, this dataset is still quite small so I would recommend maybe try generating similar data and augment it using gemini? since they have a free api plan, I find it quite useful in cases like these!

prudant

Jul 13

same use case =) also training another one for user intent detection in order to avoid chitchat question for RAG or other intents that do not need query and send tokens to the LLM, have a lot of "troubles" to get the model in a scale and production environment, the modelhead is not really integrated to the body model so thats a problem to use serving model artifacts, but I took the body and instance a AutoModelForSequenceClassification then freeze the base model and let the Trainer to do its work on the classification head that way I got a final model with hugging face/pytorch compliant to use in production environments with the same performance as the original aproach from SetFit., how do you solve the production environment of serving the model?

super-cinnamon

Owner Jul 14

To be honest, this model was not used in production, it was only locally tested to see if it was a viable option, and even we did keep it, the model should be small enough to run comfortably just like the embedding model (which is the model that this classification was fine tuned from)
I'm sorry if I didn't understand or answer your question correctly, I'm still a junior trying to figure out all of these production environment things xD

prudant

Jul 15

thanks, no problem, its perfect!

prudant

Jul 17

•

edited Jul 17

I took your dataset, I passed it to an LLM to generate more similar synthetic data and the result was a Setfit model with a high accuracy, I am using the baia/m3 as embeddings, which is much superior compared to the rest of the multilingual ones that are in huggingface, If you need to improve your model, I can upload the dataset, it is in Spanish but since they are multi-language models, the training language does not matter, it works just as well when predicting in multi-language.

prudant changed discussion status to closed Jul 17

prudant changed discussion status to open Jul 17

super-cinnamon

Owner Jul 18

Thanks for the update! that's good to hear, I used e5 as it was the best model for semantic similarity that worked on my dataset in both french and arabic. I believe the model you mentioned is BAAI/bge-m3 right? I unfortunately couldn't use it because it was too large for our limitations (constrained to 387 vector dimensions).
i'll definitely give it a try in future projects where I have less restrictions!

prudant changed discussion status to closed Jul 20

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment