Synthetic evaluation hypothesis
#6
by
DmitriSS
- opened
Q*, MS Orca 2 - used synthetic data (gpt-4) for building efficient LLM's that outcompete larger ones.
given this fact,
Hypothesis:
Could we utilise an LLM for synthetic evaluation of other LLM's ?
-Perhaps chose the most dominant LLM (GPT-4) or screen the (generated challenge) message through several models.
- Use GPT 4/human to Generate a challenge or request for 2 or more LLMS
- Use GPT-4 to evaluate, rate and choose the best responses and create synthetic leaderboards.
- Consider making the models critique and argue why X choice was superior, and even evaluating the reasoning.
-Consider finetuning for the purpose, or at least pre-prompting for specific functions.
-Use data collected for fine tuning new models????
downsides; setup, experimentation time, token cost.
I apologise I wrote this on a whim before I had read the actual study...