π© Report
Reflection 70b benchmarks are not real
The whole drama is described here:
https://x.com/shinboson/status/1832933753837982024
This is literally a model posted for you to run.
You're making an assumption from a broken openrouter implementation that no one can reproduce
This is literally a model posted for you to run.
You're making an assumption from a broken openrouter implementation that no one can reproduce
It was not OpenRouter's implementation, they just forwarded requests to Matt's privately hosted API. (which was just a proxy for sonnet 3.5)
The evidence based on tokenisation, <META>
tag, getting it to output its system message, the questions that revealed it was really Claude are clear proof it wasn't the correct end-point.
If it was an honest mistake and openrouter was accidentally routing the model to the wrong end-point then it wouldn't be filtering and replacing the word "Claude".
The evidence based on tokenisation,
<META>
tag, getting it to output its system message, the questions that revealed it was really Claude are clear proof it wasn't the correct end-point.If it was an honest mistake and openrouter was accidentally routing the model to the wrong end-point then it wouldn't be filtering and replacing the word "Claude".
The model also changed to GPT-4o ( I assume, they changed it from that pretty quickly )