Dataset Example
hello, what kind of data set did you use, could you give some information about the training?
But I have a dataset like this. 40 thousand in total. However, I can't get very good results for the Turkish language. In a long text, for example, in a 300-character sentence, he always writes the same words.
https://huggingface.co/google/mt5-small and https://huggingface.co/google/mt5-base but I have problems with long texts on both models. He corrects one part of the sentence and then starts to write the same words over and over.
In my humble experience, here are a few things that you might consider:
mt5 is a bit hard to train and still struggles a lot with grammar. I would recommend getting much more data if possible for training. I prefer mt5-base over mt5-small.
This could be due to your decoding strategy. I don't know exactly what you use for this. Beam search? how many beams? typical p? temperature? I would recommend a beam search here for the correctness and little hallucination.
Also, how do you prompt? How did you design your prefix and special chars?
We did the training with happytransformer. But I think we made a mistake there.