Benchmark Evals?
Just discovered this model and I agree it's writing and reasoning depth seems greatly improved.
Are you going to submit this to Huggingface leaderboard, I'm interested in seeing it's benchmarks.
Nice work!
i just tried to compare another 7b model(not this one) , with its extended version (using same config) on openllm leaderboard results, here is what i get :
comparison
Metric | diff | Extended(10.7b) | Origin(7b) |
---|---|---|---|
Avg. | -3.76 | 69.75 | 73.51 |
AI2 Reasoning Challenge (25-Shot) | -3.07 | 68.09 | 71.16 |
HellaSwag (10-Shot) | -0.66 | 87.10 | 87.76 |
MMLU (5-Shot) | -0.34 | 64.43 | 64.77 |
TruthfulQA (0-shot) | -0.97 | 64.28 | 65.25 |
Winogrande (5-shot) | -0.31 | 82.72 | 83.03 |
GSM8k (5-shot) | -17.21 | 51.86 | 69.07 |
but the effect in the chat seems good and stable , thank for this great config
@seyf1elislam Interesting, thanks for sharing. Nothing a little fine-tuning couldn't fix with potentially a higher ceiling on evals like MMLU.
@senseable Exactly: we have the potential to build some amazing larger models with the great Mistral-7B as a base. Your fine-tune is the perfect starting point. I think the process should go fine-tune > self-merge > fine-tune > self-merge > fine-tune > etc
After each self-merge, reapplying the original fine-tune should help realign the layers and get rid of the errors introduced by the self-merge. It should also result in a new model which can be further self-merged. If you would like to give a try to reapplying your WestLake fine-tune to this 10.7B self-merge, I would like to try to see how far we can push it. I expect the next good self-merge could result in a 16-20B model. And maybe it is possible to push it all the way to 34B.
Here is the HF LLM leaderboard comparison:
comparison
Metric | diff | WestLake-10.7B-v2 | WestLake-7B-v2 |
---|---|---|---|
Avg. | -5.14 | 70.28 | 75.42 |
AI2 Reasoning Challenge (25-Shot) | -1.88 | 71.16 | 73.04 |
HellaSwag (10-Shot) | -0.72 | 87.93 | 88.65 |
MMLU (5-Shot) | -0.90 | 63.81 | 64.71 |
TruthfulQA (0-shot) | -2.15 | 64.91 | 67.06 |
Winogrande (5-shot) | -1.58 | 85.40 | 86.98 |
GSM8k (5-shot) | -19.18 | 48.45 | 67.63 |