Diving into MiniMax01 405B MoE

Community Article Published January 15, 2025

This post dives into the technical details of the MiniMax-Text-01 model, let's break it down:

tl;dr:

Token drop strategy: Uses auxiliary loss for load balancing instead of dropless auxiliary loss-free balancing.
Global router: Optimized to balance the number of tokens per EP group.
Top k: 2 (vs 8 + 1 shared in DeepSeek).
MoE layer size: 9216 vs 2048 (DeepSeek).
Same total MLP parameters activated per layer: 9 * 2048 = 2 * 9216 = 18432.
Expert count: 32 vs 256 (+1 shared).
Layers: 80 vs 61. Linear attention benefits more from depth than width.
Hidden size: 6144 vs 7168.
No shared expert: Unique design choice.

7/8 Linear Attention (Lightning Attention-2):

Similar to "flash attention" for NormAttention.
Big advantage: Complexity O(d^2n) instead of O(n^2d), making very long contexts feasible.
Mechanism:
1. Input X
2. Q, K, V = SiLU(X)
3. Y = Q * (K^T * V) (complexity: O(d^2))
4. RMSNorm(Y) * sigmoid(X)

1/8 Softmax Attention:

Applies rope only to half the dimension size (claimed to allow length extrapolation without performance degradation).
Rope 10k base seems unconventional compared to other models.

Comparisons:

cosformer2: Very bad at NIAH, overall slower and lower performance.
hgrn2: Slightly better but slower (+ small gap on NIAH).
SWA softmax: Similar TGS (tokens per GPU/sec), but overall lower performance and big gap on NIAH.
Limitations: No complex benchmarks for long contexts beyond NIAH.

Goal: Build the best model that fits on ~H100 nodes (8x80G) with:

MoE vs Dense: At 1T tokens, MoE (2B active, 24B total params) significantly outperforms dense models (7B) at the same FLOPs.
(Be cautious about Fig4’s axis; scores are similar for all benchmarks.)
Linear vs Softmax vs Hybrid Attention:
- Range: 70M → 7B params on 300B tokens.
- Softmax > Linear for NIAH (big gap).
- Hybrid performs better than softmax overall.
Caveats: Fixed learning rates for scaling laws without fast decay may skew comparisons in favor of hybrid.

Used previous MoE (5B active, 60B total) for data labeling. Likely trained a classifier after (details unclear).
Metrics: Knowledge, practical helpfulness, categorical distribution.
Data formatting balances QA format and natural distribution (might improve MMLU).
Deduplication: High-quality data deduped 4x, low-quality 2x.
Tracks metrics with acc_norm^2 (byte normalization).

Three phases:

Main training (8k tokens, rope 10k).
128k tokens, 300B tokens total, rope base 5M, mix of short (<32k) and medium (<128k) contexts.
512k → 1M tokens, rope base 10M, balanced across short, medium, and long contexts.

Key Technique: Linear interpolation to mitigate distribution shift:
W_t = alpha * W_prev + (1-alpha) * W_current.

Post-training steps: Iterative SFT → RL (offline: DPO, online: GRPO).
- Short-context SFT → Long-context SFT → Short-context RL → Long-context RL.
  (Critical for achieving great long-context performance.)
Infrastructure:
- ~1500–2500 GPUs.
- Efficient MoE tensor/parallelism with optimized ring attention.
- Improvements in linear attention sequence parallelism and padding optimization.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment