Why is there an MLP in the Mamba Layer?

#28
by naston - opened

I enjoyed the Jamba paper but was left with one lingering thought. In the initial Mamba paper the authors show that interleaving Mamba with an MLP actually degrades overall performance, a finding that seems to counter the design of the Mamba Layer for Jamba. Why was this decision made instead of creating a Mamba Layer with two Mamba components as the authors findings would suggest?
(The Mamba MoE Layer on the other hand agrees with what I have seen in the literature)

jamba.PNG

Sign up or log in to comment