Why is there an MLP in the Mamba Layer?
#28
by
naston
- opened
I enjoyed the Jamba paper but was left with one lingering thought. In the initial Mamba paper the authors show that interleaving Mamba with an MLP actually degrades overall performance, a finding that seems to counter the design of the Mamba Layer for Jamba. Why was this decision made instead of creating a Mamba Layer with two Mamba components as the authors findings would suggest?
(The Mamba MoE Layer on the other hand agrees with what I have seen in the literature)