Results
Unsurprisingly, it is totally demented. It was worth a shot for science's sake, but watching the per-token perplexity and seeing WHERE it fails... I've just come to the conclusion this line of experimentation is indeed a dead end.
7b models are just too small per layer to have the kind of redundancy needed for multiple slices like this, leaving 11b merges as the only real viable enlarged Mistral. Even then, the problems seen here are scaled down but still apparent in 11b, right down to the pattern of what sequences cause massive perplexity spikes.
Perhaps, if one toyed with the layer placement just right, you could get a “solid” >7b Mistral merge. Even then, it would be smaller than I really want to work with. 70 billions and merges like Venus and Goliath prove what seems intuitive, higher parameter count models (when executed sanely) will outperform a smaller model at certain tasks.
My last foray into this will be a single-join merge that eats a little more into the layers at the beginning and end, hopefully my hypothesis that you can bleed into the last few layers more with Mistral is correct. But multiple joins is a dead-end.
- Downloads last month
- 13