Is this based on the "Update (5/3)" version?
Gradient uploaded a new version 18 hours ago claiming "Update (5/3): We further fine-tuned our model to strengthen its assistant-like chat ability as well. The NIAH result is updated." (also for their 1048k model btw)
Your quants were uploaded 5 hours ago and maybe you used the latest source, but it's so close to their update it could very well have been the previous version.
Yeah this is why I dislike model updates in place lmao, yes this is using the version from 18 hours ago
You were fast then! Which is good, but also...
I kind of secretly hoped it wasn't the latest and there was a chance the new version would be better in chat quality. This version makes formatting errors and is far less detailed/nuanced in its answers compared to the 8k base model. Then, like you already know, there's the odd</s>
token.
I've made them aware of the issue but the rest I'm not sure what is needed
https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k/discussions/20#66372ca74b43ab85e5ba5dbb
Ah well, at least we know it's not in the way you/llama.cpp does the quantisation.
It is a bit strange these issues like non capitalisation and the odd additional stop token were not caught by Gradient in their testing of the model.
I suppose most of their testing was automated and targetted at context retrieval rather than performance of output