Interesting observations about the strides - I also found that striding back to less than layer 10 causes incoherancy
- Goliath-120B is still a good standard for coherancy below 4096 context. A few miqu-1 merges are comparable, but testing found a small amount coherancy could be sacrificed for notable creativity improvements.
I've never had much luck with getting miqu
to write stuff. His writing always seems a bit boring/dry and he also always writes something akin to "Quatrains" where every paragraph is short and similar length.
Can you try your model on these 2 prompts:
Write me the opening chapter of a Grimdark trilogy in the style of Joe Abercrombie and Rob J Hayes. Use third person personal and feature internal monologues of the characters. The POV character for chapter 1 is a cultist who has just escaped his cult. He is dressed in dirty yellow robes and his only possession is a mysterious small mirror. The story starts with him arriving at an apparently deserted ghost town on the edge of a desert.
I found goliath-120b
seems most able to write Grimdark stories of any model I've tried and when questioned seems to have a deep knowledge of the subject, eg: he knows the differences between say Joe Abercrombie and Rob J Hayes (which on the surface appear to share very similar writing styles).
Write a new Sherlock Holmes story set in Hull in 1890 about whaling ship. The story should feature a pocket watch. # Chapter 1: The Letter
I also found goliath-120b
can write almost perfect pastiches of Sherlock Holmes, but the logic of his stories makes no sense most of the time. The addition of the "pocket watch" detail is to test this: sometimes he will even realise himself that he started off with this being an important clue, but then in the end it wasn't used and will actually write this into a discussion between Watson and Holmes in the ending!!!
They are quite funny even when they don't make sense: one time I gave him this prompt and he had Holmes go incognito onto a boat in Hull harbour using the name "Bond, James Bond" but it fit perfectly with the story! :D
I'm now trying to get goliath-120b
to write the start of a story and then moving it to command-r-plus
to have it refined and continued.
writing always seems a bit boring/dry
I was seeing the same. This was the first merge I've made that seems to be fairly creative.
A lot of it hinges on how Senku gets interleaved. I found that base Miqu and the various MiquMaid were all very smart - but they were also sometimes boring and preferred short responses or even showed too much character bias (way too much ERP) for a good writing model.
Senku is capable of very good long and somewhat inventive prose, but it prefers to skirt around conflict / adult writing themes. PlayMaid was the most creative of the MiquMaid models I tested, and interleaving it with Senku seems to have done the trick of making it less restricted and removed some of Senku's AI assistant behavior.
I didn't rate command-r-plus
particularly high. It has good recall, and an excellent unrestricted writing style.
But it's not a particularly clever model - many of of the llama2 derived models I tested outperformed it in intelligence tests. Keep in mind these tests were all done in the testing DnD scenarios I made. I'm not surprised command-r-plus scores better on traditional tests.
From my testing I found goliath is still the best writing model with small writing prompts. It's perplexity score on tests was significantly better than other popular frankenmerges. Tenebris got relatively close on perplexity tests, but Tenebris seems to benefit a lot more from more detailed prompts or examples - if anything I've found it can be too unhinged if you don't give it some example text. Some of my merges actually scored better on perplexity than Tenebris, but they seemed to inherit the stifled writing style from MiquMaid / base Miqu.
Can you try your model on these 2 prompts
- sure, here they are (min_p 0.3, temp 0.6)
The sun was a merciless bastard, thought Eldred as he staggered through the dusty streets of the ghost town. His yellow robes clung to his sweat-soaked skin like a second, suffocating layer of flesh. The relentless heat of the desert was a constant reminder of the price he had paid to escape the suffocating embrace of the cult.
Eldred had been born into the cult, raised on their twisted teachings and indoctrinated into their twisted beliefs. But he had always harbored a seed of doubt, a small, rebellious voice that whispered in his ear, urging him to question the cult's dogma. And now, after years of planning and subterfuge, he had finally managed to slip away from their clutches.
He had stolen away in the dead of night, leaving behind everything he had ever known, everything he had ever loved. He had taken only the small mirror his mother had given him before she had been taken by the cult's dark rituals. It was a talisman of sorts, a reminder of the life he had left behind, and the life he now sought to reclaim.
The mirror was small and unassuming, no larger than his palm, its surface clouded with age and use. But when he held it up to the light, he could see his reflection, distorted and warped, but still his own. It was a reminder of who he had been before the cult had claimed him, before they had twisted his mind and soul to their own dark ends.
As he stumbled through the empty streets of the ghost town, Eldred couldn't help but wonder if he had made the right decision. The cult would be searching for him, of that he had no doubt. They would stop at nothing to reclaim their lost sheep, to drag him back into their fold and punish him for his betrayal.
But he couldn't go back. Not now. Not after everything he had seen, everything he had done in the name of their twisted faith. He had to find a new life, a new purpose, away from the suffocating embrace of the cult.
As he trudged through the dusty streets, Eldred couldn't shake the feeling that he was being watched. The ghost town seemed to hold its breath, as if waiting for something to happen, for the first sign of life to break the oppressive silence.
He quickened his pace, his heart pounding in his chest, his breath coming in ragged gasps. He had to find shelter, somewhere to rest and regroup before continuing his journey.
As he rounded a corner, he spotted a dilapidated building, its once-grand facade now crumbling and decayed. It looked like it had been abandoned for years, but it was the only refuge he could see for miles around.
He hesitated for a moment, his instincts screaming at him to keep moving, to put as much distance between himself and the cult as possible. But his body was spent, his legs trembling with exhaustion. He had no choice but to trust his instincts and hope that the building was as abandoned as it seemed.
He crept towards the building, his senses heightened, his eyes scanning the surrounding area for any sign of movement. The silence was oppressive, the only sound the crunch of his boots on the dusty ground.
As he reached the building, he paused for a moment, his hand hovering over the door handle. He took a deep breath, steeling himself for whatever lay within.
And then, with a quiet prayer to whatever gods might still be listening, he pushed open the door and stepped inside.
The interior of the building was as decrepit as its exterior, the once-grandiose foyer now reduced to a crumbling shell of its former glory. Cobwebs hung from the ceiling, their delicate strands catching the dust motes that danced in the shafts of sunlight that filtered through the broken windows.
Eldred's footsteps echoed through the empty space, the sound bouncing off the walls and reverberating through his skull. He winced, cursing himself for his carelessness. He couldn't afford to attract attention, not here, not now.
He moved cautiously through the foyer, his eyes scanning the shadows for any sign of movement. The building seemed abandoned, but he couldn't afford to take any chances.
The year was 1890, and the city of Hull, Yorkshire, was bustling with activity. The cobbled streets echoed with the clatter of horse hooves and the distant cries of street vendors hawking their wares. The salty sea breeze carried the scent of fish and the faint whiff of tar from the docks.
In the heart of the city, at 221 Kingston Street, resided the renowned detective, Mr. Sherlock Holmes, and his loyal companion, Dr. John Watson. They had recently moved to Hull from London, drawn by the promise of new cases and a quieter life. However, as Holmes often said, "There is no such thing as a quiet life for a consulting detective."
One crisp autumn morning, a knock sounded at their door. Mrs. Hudson, their landlady, shuffled down the stairs to answer it. A moment later, she appeared in the doorway of Holmes's study, bearing a letter on a silver tray.
"A telegram for you, Mr. Holmes," she said, her voice tinged with curiosity.
Holmes, who had been engrossed in his violin, set it aside with a sigh. "Thank you, Mrs. Hudson. You may leave it here."
As the door closed behind her, Watson, who had been perusing the morning paper, looked up with interest. "A telegram, Holmes? That must be urgent."
Holmes took the envelope and slit it open with a letter opener. His keen eyes scanned the contents, his brow furrowing as he read.
"What is it, Holmes?" Watson asked, unable to contain his curiosity.
"It seems we have a new case, Watson," Holmes replied, folding the telegram and tucking it into his waistcoat pocket. "A whaling ship has gone missing off the coast of Hull, and the owner suspects foul play."
# Chapter 2: The Missing Whaler.
Within the hour, Holmes and Watson had donned their coats and hats and were making their way to the docks. The air was crisp and the sky overcast, a fitting atmosphere for the grim task ahead.
As they approached the wharf, they were met by a burly, bearded man in his sixties, dressed in a thick woolen coat and a battered hat. He introduced himself as Captain James McAllister, owner of the missing whaling ship, The Sea Serpent.
"I'm glad ye've come, Mr. Holmes," he said, his voice gruff with emotion. "The authorities are no' taking this seriously, but I ken in my bones that something's amiss."
"Tell us what you know, Captain," Holmes said, his keen eyes taking in every detail of the man's appearance and demeanor.
McAllister led them to a nearby tavern, where he ordered three mugs of strong, dark ale. As they settled around a worn wooden table, he began his tale.
"The Sea Serpent set sail from Hull three months ago, bound for the Arctic waters. She was under the command of my son, Angus, and carried a crew of twenty-five skilled whalers and harpooners. They were due back two weeks ago, but there's been no sign of them."
"Have you received any communication from the ship during its voyage?" Watson asked, taking notes in his journal.
"Aye, we received regular reports from Angus via the telegraph office in Greenland. The last message we received was two weeks ago, just before they were due to head home. They'd had a successful season, and their hold was full of whale oil and blubber. Angus said they were eager to return to their families."
"And what makes you suspect foul play, Captain?" Holmes interjected, his interest piqued.
McAllister reached into his coat pocket and produced a small, tarnished pocket watch. "This belonged to my grandfather, and it's been passed down through the McAllister family for generations. Angus always wore it on his chain, and he swore he'd never lose it. But three days ago, this washed up on the shore near the mouth of the Humber."
Holmes took the watch, examining it closely. "Interesting. And you're certain this belonged to your son?"
"Aye, there's no doubt about it. The watch is engraved with the McAllister family crest on the back."
"And have you shown this to the authorities?" Watson asked.
McAllister scoffed. "Aye, I have, but they just said it was a coincidence. They think The Sea Serpent must've been caught in a storm and sunk with all hands. But I cannae believe that. Angus was an experienced captain, and his crew were the best in the business. Something else must've happened to them."
Holmes handed the pocket watch back to McAllister, his mind already racing with possibilities. "I agree, Captain. There's more to this story than meets the eye. Watson, we'll need to start our investigation immediately. We mustn't lose any more time."
From my testing I found goliath is still the best writing model with small writing prompts. It's perplexity score on tests was significantly better than other popular frankenmerges.
Yeah, my biggest gripe (other than the 4k context) is once he gets going he won't stop writing and ignores instructions.
He also has a lot of trouble switching between different POV characters and will latch onto a single one and/or disregard one characters' POV in light of the actions of another's POV he just wrote about (eg: the attacker and defender of a keep).
As for this model:
You can see in the first Grimdark story above the "positivity" creeping in with the way the he frames the POV character:
indoctrinated into their twisted beliefs. But he had always harbored a seed of doubt
Goliath can write completely unhinged dark characters without any of this.
In the heart of the city, at 221 Kingston Street
This looks a bit like some information has been lost or scrambled as 99% of models this size will know the address of Sherlock Holmes (as it's out of copyright and likely used as pre-training).
I didn't rate
command-r-plus
particularly high
I didn't to start with, but if you back and forth with him for 3-4 prompt he loses all the positivity bias and will write quite well.
My hope is to use Goliath's 1-shot ability and command-r-plus
refinement ability to speed this up.
In the heart of the city, at 221 Kingston Street
This looks a bit like some information has been lost or scrambled as 99% of models this size will know the address of Sherlock Holmes (as it's out of copyright and likely used as pre-training).
Also there is some chance the "attenuation" could help this: if you look at the mergekit github thread I asked a question about a paper from the early 90s that I knew miqu-1
knew and had been trained on, but my previous merges had all lost this or had been scrambled; but the attenuated version of miqu-1-120b
seemed to recover the correct information, so it's possible it's not actually losing information when the layers get split/moved and simply just the theorised "overshooting" problem.
Some information loss is probably inevitable when layer skipping.
What I don't quite understand is how the layer skipping can drastically improve the model's prose and intelligence. My testing has pretty much confirmed this isn't always the case either - sometimes not skipping works better - the only requirement seems to be that the interleave offset is roughly regular.
I intentionally avoid tests that require more niche prior knowledge (i.e detailed knowledge of niche training material).
Will probably start finetuning soon. Frankly a lot of this is a learning exercise in anticipation of long context llama3 releasing. I reckon it will be much easier to train out the the remaining positivity bias in a Miqu merge than it is to make commandr+ comparatively smarter. commandr+ being a shallower model is possibly part of the problems.
Some information loss is probably inevitable when layer skipping.
What I don't quite understand is how the layer skipping can drastically improve the model's prose and intelligence. My testing has pretty much confirmed this isn't always the case either - sometimes not skipping works better - the only requirement seems to be that the interleave offset is roughly regular.
Yeah, I'd like to see if there is anything that can be analysed about where the overlap is likely to cause the model to go crazy: with coding models I found anything less than the first 8-10 layers couldn't be overlapped, but only the last 2-3 layers were important (and 2 layers mostly worked even then).
Just saw your reddit post about reducing perplexity in frankenmerges, but I don't have a reddit account anymore...
I actually went all out and ran coordinate descent on the scale factors k_proj
and q_proj
(with equal values; optimised as one variable) and out_proj
and down_proj
(again with equal values; optimised as one variable):
I found with:
k_proj
andq_proj
= 0.8out_proj
anddown_proj
= 0.5
Could almost replicate the base model's PPL, but by doing this the writing style was just terrible!
I also found that:
k_proj
andq_proj
= sqrt(sqrt(1/2)) = 0.84down_proj
= sqrt(1/2) = 0.71
can make almost any frankenmerge almost 100% coherent so long as you don't start eating into the early and final layers where the model seems to be transforming into its internal embedding space. These numbers were derived by assuming the layers are producing random i.i.d.vectors and looking at how to transform the sum of their norms back (these are actually the upper bounds for uncorrelated vectors and I don't really understand why out_proj
shouldn't also be scaled; possibly it breaks the distribution entering the MLP layers and/or is already accounted for by the transformer block mixing more values of V by attenuation of the score matrix?).
This seems to work on almost any pattern of repeated or interleaved layers too. The only exception was when the interleaved blocks were very large and stories would have strange "backward time skips".
I say "almost 100% coherent" as you still do sometimes get slightly strange stuff, but probably about as often as the very best frankenmerge. Also mixing 10k and 1M base RoPE seems to work but not models with the RoPE scale set to 4.
I really wish there was something like PPL we could optimise against, but sadly it looks a lot like it isn't really linked to the frankenmerge quality and more just a useful tool to weed out completely broken models :/
Have just started looking at llama3 merges and it seemed like a good time to revisit this idea of normalizing repeated proj layers.
To keep the variable dimensionality down I've only been testing proj scaling on a 16 layer interleave - no skips - between migtissera/Tess-2.0-Llama-3-70B-v0.2 and NeverSleep/Llama-3-Lumimaid-70B-v0.1-alt.
Think I can confidently say the technique consistently improves both the perplexity score and writing style when applied correctly. Perplexity improvement isn't huge, but big enough for me to think it is not noise. Improvement in the writing prose and coherency is noticeable too. I also saw that the scaling works best when not applied anywhere near the outermost layers.
My optimal values have been a bit higher than sqrt(sqrt(1/2))
and sqrt(1/2)
- but I've not done an exhaustive search yet. k_proj
/ q_proj
has been ~0.9. down_proj
has been ~ 0.85. Not touched out_proj
yet either. Annoyingly this is ballooning the number of tweakable variables to the point where automated grid search is pretty hard
I'm also messing around with context extension - SLERPing against the context extended giraffe and gradient models. abacusai/Llama-3-Giraffe-70B
looks very promising.
[edit]
Interestingly I am seeing slightly better results using variable proj scaling, with the lowest value being in the middle of the model.
Just saw your reddit post about reducing perplexity in frankenmerges, but I don't have a reddit account anymore...
I actually went all out and ran coordinate descent on the scale factors
k_proj
andq_proj
(with equal values; optimised as one variable) andout_proj
anddown_proj
(again with equal values; optimised as one variable):I found with:
k_proj
andq_proj
= 0.8out_proj
anddown_proj
= 0.5Could almost replicate the base model's PPL, but by doing this the writing style was just terrible!
I also found that:
k_proj
andq_proj
= sqrt(sqrt(1/2)) = 0.84down_proj
= sqrt(1/2) = 0.71can make almost any frankenmerge almost 100% coherent so long as you don't start eating into the early and final layers where the model seems to be transforming into its internal embedding space. These numbers were derived by assuming the layers are producing random i.i.d.vectors and looking at how to transform the sum of their norms back (these are actually the upper bounds for uncorrelated vectors and I don't really understand why
out_proj
shouldn't also be scaled; possibly it breaks the distribution entering the MLP layers and/or is already accounted for by the transformer block mixing more values of V by attenuation of the score matrix?).This seems to work on almost any pattern of repeated or interleaved layers too. The only exception was when the interleaved blocks were very large and stories would have strange "backward time skips".
I say "almost 100% coherent" as you still do sometimes get slightly strange stuff, but probably about as often as the very best frankenmerge. Also mixing 10k and 1M base RoPE seems to work but not models with the RoPE scale set to 4.
I really wish there was something like PPL we could optimise against, but sadly it looks a lot like it isn't really linked to the frankenmerge quality and more just a useful tool to weed out completely broken models :/
Hey Juk, thanks for all the great info and insights. :)
I just have a question, would this method apply to a already frankenmerged model?
i.e I want to do a self merge of Undi95/PsyMedRP-v1-20B to expand it's creative qualities and wondering if your methods would work for it too.