Diving into MiniMax01 405B MoE
Community Article
Published
January 15, 2025
This post dives into the technical details of the MiniMax-Text-01 model, let's break it down:
tl;dr:
- Very (very) nice paper/model with lots of details and experimental insights.
- Hybrid attention: 7/8 Lightning Attention (linear) + 1/8 softmax.
- Unique MoE strategy, different from DeepSeek.
- Incorporates deepnorm and WSD schedule.
- Training details: ~2000 H800 GPUs, ~12T tokens.
MoE Specifications (vs DeepSeek v3):
- Token drop strategy: Uses auxiliary loss for load balancing instead of dropless auxiliary loss-free balancing.
- Global router: Optimized to balance the number of tokens per EP group.
- Top k: 2 (vs 8 + 1 shared in DeepSeek).
- MoE layer size: 9216 vs 2048 (DeepSeek).
- Same total MLP parameters activated per layer: 9 * 2048 = 2 * 9216 = 18432.
- Expert count: 32 vs 256 (+1 shared).
- Layers: 80 vs 61. Linear attention benefits more from depth than width.
- Hidden size: 6144 vs 7168.
- No shared expert: Unique design choice.
Hybrid Modeling:
7/8 Linear Attention (Lightning Attention-2):
- Similar to "flash attention" for NormAttention.
- Big advantage: Complexity O(d^2n) instead of O(n^2d), making very long contexts feasible.
- Mechanism:
- Input X
- Q, K, V = SiLU(X)
- Y = Q * (K^T * V) (complexity: O(d^2))
- RMSNorm(Y) * sigmoid(X)
1/8 Softmax Attention:
- Applies rope only to half the dimension size (claimed to allow length extrapolation without performance degradation).
- Rope 10k base seems unconventional compared to other models.
Comparisons:
- cosformer2: Very bad at NIAH, overall slower and lower performance.
- hgrn2: Slightly better but slower (+ small gap on NIAH).
- SWA softmax: Similar TGS (tokens per GPU/sec), but overall lower performance and big gap on NIAH.
- Limitations: No complex benchmarks for long contexts beyond NIAH.
Model Design Thinking:
Goal: Build the best model that fits on ~H100 nodes (8x80G) with:
- 1M sequence length.
- 8-bit quantization.
- Good balance of softmax and linear attention.
- Optimal depth/width ratio (deeper models need more softmax layers).
- Proper memory size vs hidden size ratio.
- Effective FFN size vs model dimension.
- Dimensional considerations for rope use in softmax attention.
Scaling Laws and Experiments:
- MoE vs Dense: At 1T tokens, MoE (2B active, 24B total params) significantly outperforms dense models (7B) at the same FLOPs.
(Be cautious about Fig4’s axis; scores are similar for all benchmarks.) - Linear vs Softmax vs Hybrid Attention:
- Range: 70M → 7B params on 300B tokens.
- Softmax > Linear for NIAH (big gap).
- Hybrid performs better than softmax overall.
- Caveats: Fixed learning rates for scaling laws without fast decay may skew comparisons in favor of hybrid.
Training Data:
- Used previous MoE (5B active, 60B total) for data labeling. Likely trained a classifier after (details unclear).
- Metrics: Knowledge, practical helpfulness, categorical distribution.
- Data formatting balances QA format and natural distribution (might improve MMLU).
- Deduplication: High-quality data deduped 4x, low-quality 2x.
- Tracks metrics with acc_norm^2 (byte normalization).
Training Hyperparameters:
- ~12T tokens.
- WSD-like schedule: Reduces LR to 10% of the peak; no final decay.
- Initialization: Xavier with deepnorm modifications.
- Optimizer: AdamW (0.9, 0.95).
- Critical batch size warmup explanation (16M → 128M): Unique and insightful.
Long Context Training:
Three phases:
- Main training (8k tokens, rope 10k).
- 128k tokens, 300B tokens total, rope base 5M, mix of short (<32k) and medium (<128k) contexts.
- 512k → 1M tokens, rope base 10M, balanced across short, medium, and long contexts.
Key Technique: Linear interpolation to mitigate distribution shift:
W_t = alpha * W_prev + (1-alpha) * W_current.
Post Training and Infra:
Post-training steps: Iterative SFT → RL (offline: DPO, online: GRPO).
- Short-context SFT → Long-context SFT → Short-context RL → Long-context RL.
(Critical for achieving great long-context performance.)
- Short-context SFT → Long-context SFT → Short-context RL → Long-context RL.
Infrastructure:
- ~1500–2500 GPUs.
- Efficient MoE tensor/parallelism with optimized ring attention.
- Improvements in linear attention sequence parallelism and padding optimization.