Context length plus feedback

by MarinaraSpaghetti - opened Aug 22

Aug 22

Howdy! I'm just here to report the model breaks down completely after passing the 16k mark of context length. When tested on 32k and 64k respectively, it spurted out nonsense, as seen below.

On 16k context it still holds, but I wouldn't say the quality is anything substantial.

In comparison, here's a screenshot produced by NemoMix Unleashed on 64k context.

If you're a size queen like I am, then it's most likely a skip for you. It's a shame — really — since on empty context test, the model produced quite nice prose and seems to be capable of getting into character well. The writing style, however, seems to be lacking; a little stiff and souless? I wish it was a bit more 'unhinged' and more 'human', if that makes sense, but that's just my personal opinion and preference. I understand most probably prefer stability. Here's the screenshot.

I was testing the model using nothing but Temperature at 1.25, Top A at 0.1 and standard DRY. As for format, classic ChatML. Here is my Story String with prompt: https://files.catbox.moe/586gtt.json.

Regardless, thank you for the model and for your hard work! Fine-tuning NeMo is not an easy thing!

elinas

Owner Aug 22

•

edited Aug 22

The Chronos Gold 12B was trained at 16k. You can try the instruct model and it seems to deteriorate past 16k as well. I believe this is part of the pretraining of the model. It's definitely not going to hit 128k with coherency. Also I should mention Nemo does weird things with high temps. Try lowering it to 0.7 and re-evaluate. I don't use the DRY sampler nor Top A so I can't comment on those.

My settings are
Temp - 0.7
Presence Penalty - 1.0
Repetition Penalty range - 2800
Min P - 0.10

Fizzarolli

Aug 22

•

edited Aug 22

What I find weird is that I have definitely seen people who do have success using NeMo and some of the models based on it at higher contexts than the 16k it was likely trained on; but yeah, they all deteriorate (see ie the RULER benchmark; refer to this terrible mspaint drawing i made

)

Actually, I have a feeling what happened is:
Base was trained on 16k for most, then they did some ~128k training towards the end
Instruct was then finetuned on ~16k which atrophied the 128k skills it got from the long context annealing
All models finetuned on base (especially LoRAs, which forget less) work great at longer ctx, but ones with the NeMo instruct in them fall into a pit past 16k

MarinaraSpaghetti

Aug 22

Um, actually…

Models like Gutenberg which were trained atop Instruct work surprisingly well on higher contexts! I think the only model trained atop Instruct which did not work well on higher contexts was Rocinante. But I agree that most of the models working well on higher contexts are those which were trained on base (like Shuttle). Honestly, the whole context thing with Nemo is just weird, and I pray for MistralAI team to release an updated version that fixes those issues. If an 8B model can handle higher contexts, I don't see why a 12B model shouldn't.

Also, thanks for the advice, @elinas ! I hope my review didn't come off as too harsh!!! I'm positively sure the model is great!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment