Abstract
We break the linear link between the layer size and its inference cost by introducing the fast feedforward (FFF) architecture, a log-time alternative to feedforward networks. We demonstrate that FFFs are up to 220x faster than feedforward networks, up to 6x faster than mixture-of-experts networks, and exhibit better training properties than mixtures of experts thanks to noiseless conditional execution. Pushing FFFs to the limit, we show that they can use as little as 1% of layer neurons for inference in vision transformers while preserving 94.2% of predictive performance.
Models citing this paper 0
No model linking this paper
Cite arxiv.org/abs/2308.14711 in a model README.md to link it from this page.
Datasets citing this paper 0
No dataset linking this paper
Cite arxiv.org/abs/2308.14711 in a dataset README.md to link it from this page.
Spaces citing this paper 0
No Space linking this paper
Cite arxiv.org/abs/2308.14711 in a Space README.md to link it from this page.