A Loss Curvature Perspective on Training Instability in Deep Learning Paper • 2110.04369 • Published Oct 8, 2021
Small-scale proxies for large-scale Transformer training instabilities Paper • 2309.14322 • Published Sep 25, 2023 • 19
Transformers Can Navigate Mazes With Multi-Step Prediction Paper • 2412.05117 • Published 22 days ago • 5