MMLU

by arimakana - opened Aug 20, 2023

Aug 20, 2023

Just curious, does the training dataset includes the MMLU dataset which is used during evaluation? If so, is it still fair to use MMLU as a metric to evaluate this model? Because compared with other 13B(or maybe 30B) models, the MMLU metric is very high.

arimakana

Aug 20, 2023

btw, if you rank all the models using MMLU, trurl-2-13b is the highest one now... Even higher than the best 70B models by 8%. What tricks did you use?

AgaMiko

VoiceLab.ai org Aug 20, 2023

•

edited Aug 20, 2023

Hi,
Yes, our model was trained on MMLU* as we stated in the readme description and results table we provided. This model was supposed to be our private one, however, it was decided by the company to release it too. Our main goal was an improvement of the model in Polish, and our private evaluation didn't include any of the leaderboard datasets, as we focused on testing the model in Polish. Originally, we planned to release the model trained without the MMLU, but the training is still in progress. We will release it as soon as it ends, but it will take another few weeks. Our 7b model shows results without MMLU.
We didn't have any plans of evaluating trurl-13b on MMLU in the leaderboard, unfortunately, anyone can send the model for evaluation.
Since it is already in the leaderboard we will send on Monday the request to the HF leaderboard to remove our 13b model from the MMLU leaderboard.
I hope this answers your question.

*It was only a part of MMLU, modified to only text (no ABCD answers) but still MMLU was in the training set.

AgaMiko

VoiceLab.ai org Aug 23, 2023

Hello,
Just a quick update. Our model have been excluded from the leaderboard and we asked if it's possible to add a new parameter to the model card that would block submitting the model to the leaderboard so similar situation won't happen (third party submitting it again).
We will release the model trained without MMLU when the training will end, and this one will be submitted to the leaderboard by us. However, we guess that results will not be much higher than Llama 2 (if not slightly lower) as the model was trained to work better on Polish, especially on business-related tasks, and no effort was made towards beating other models on HF benchmarks.
Cheers

AgaMiko changed discussion status to closed Aug 23, 2023

AgaMiko

VoiceLab.ai org Sep 19, 2023

Just a quick update. The new Trurl is available here: https://huggingface.co/Voicelab/trurl-2-13b-academic

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment