Is the 7B trained on 1.5 trillion tokens, but the *40B* on 1 trillion only?
Is this a typo or was there a reasoning behind this decision?
Most likely it's because training a 40B model is significantly more expensive than training a 7B model.
I'm interested in this question too. Looking forward to an official explanation from the authors.
BTW, according to the leaderboard, falcon-7b outperforms mpt-7b by 0.2, which could also be attributed to the fact that falcon-7b is only trained on 1T tokens of unrefined web data.
Hey!
This is a purely arbitrary decision :). We iterate a lot on internal models, and Falcon-40B was our first serious foray into this scale--so we wanted to validate infra, codebase, data, etc. That's why we stuck to 1T.
The 7B came later, when we had 384 GPUs unscheduled for two weeks, so 1.5T was a good match.
Regarding the different with MPT-7B being smaller, we believe this is due to a combination of three factors: (1) we are approaching the limits of what can be done with a 7B pretrained model; (2) multiquery with 64 attention head size improves inference scalability, but that's at the cost of some task performance; (3) we experimented for the 7B with a very large batch size.