Context length plus feedback

#1
by MarinaraSpaghetti - opened

Howdy! I'm just here to report the model breaks down completely after passing the 16k mark of context length. When tested on 32k and 64k respectively, it spurted out nonsense, as seen below.

Screenshot 2024-08-22 at 14.11.48.png

Screenshot 2024-08-22 at 14.13.35.png

On 16k context it still holds, but I wouldn't say the quality is anything substantial.

Screenshot 2024-08-22 at 14.19.14.png

In comparison, here's a screenshot produced by NemoMix Unleashed on 64k context.

Screenshot 2024-08-22 at 13.18.37.png

If you're a size queen like I am, then it's most likely a skip for you. It's a shame — really — since on empty context test, the model produced quite nice prose and seems to be capable of getting into character well. The writing style, however, seems to be lacking; a little stiff and souless? I wish it was a bit more 'unhinged' and more 'human', if that makes sense, but that's just my personal opinion and preference. I understand most probably prefer stability. Here's the screenshot.

Screenshot 2024-08-22 at 14.23.41.png

I was testing the model using nothing but Temperature at 1.25, Top A at 0.1 and standard DRY. As for format, classic ChatML. Here is my Story String with prompt: https://files.catbox.moe/586gtt.json.

Regardless, thank you for the model and for your hard work! Fine-tuning NeMo is not an easy thing!

The Chronos Gold 12B was trained at 16k. You can try the instruct model and it seems to deteriorate past 16k as well. I believe this is part of the pretraining of the model. It's definitely not going to hit 128k with coherency. Also I should mention Nemo does weird things with high temps. Try lowering it to 0.7 and re-evaluate. I don't use the DRY sampler nor Top A so I can't comment on those.

My settings are
Temp - 0.7
Presence Penalty - 1.0
Repetition Penalty range - 2800
Min P - 0.10

What I find weird is that I have definitely seen people who do have success using NeMo and some of the models based on it at higher contexts than the 16k it was likely trained on; but yeah, they all deteriorate (see ie the RULER benchmark; refer to this terrible mspaint drawing i made
image.png
)

Actually, I have a feeling what happened is:
Base was trained on 16k for most, then they did some ~128k training towards the end
Instruct was then finetuned on ~16k which atrophied the 128k skills it got from the long context annealing
All models finetuned on base (especially LoRAs, which forget less) work great at longer ctx, but ones with the NeMo instruct in them fall into a pit past 16k

Um, actually…

actually.gif

Models like Gutenberg which were trained atop Instruct work surprisingly well on higher contexts! I think the only model trained atop Instruct which did not work well on higher contexts was Rocinante. But I agree that most of the models working well on higher contexts are those which were trained on base (like Shuttle). Honestly, the whole context thing with Nemo is just weird, and I pray for MistralAI team to release an updated version that fixes those issues. If an 8B model can handle higher contexts, I don't see why a 12B model shouldn't.

Also, thanks for the advice, @elinas ! I hope my review didn't come off as too harsh!!! I'm positively sure the model is great!

Sign up or log in to comment