Comparison with the distilled model
Great work! 🤗
I’m curious about how the fine-tuned model compares to the distilled one. Would it be possible to add it to your evaluation results table?
Yes, of course I can add the distilled version too.
Since the turbo version with the new recipe is even better than the original large model and the distilled version is not as good as the large one, i didn't compared it yet directly
Yep, it definitively makes sense from an accuracy point of view, but since the distilled version is still faster (2 vs 4 layers decoder), one might be interested in the speed/ accuracy tradeoff you get with fine-tuned large-v3-turbo VS distilled large-v3. You could even add a generation speed metric (see this gist)
Idea, before publishing I will use the new recipe to train another distil version with 2 decoder layers
Does it sound good for you ? :)