Diving into MiniMax01 405B MoE

Community Article Published January 15, 2025

This post dives into the technical details of the MiniMax-Text-01 model, let's break it down:

tl;dr:

  • Very (very) nice paper/model with lots of details and experimental insights.
  • Hybrid attention: 7/8 Lightning Attention (linear) + 1/8 softmax.
  • Unique MoE strategy, different from DeepSeek.
  • Incorporates deepnorm and WSD schedule.
  • Training details: ~2000 H800 GPUs, ~12T tokens.

MoE Specifications (vs DeepSeek v3):

  • Token drop strategy: Uses auxiliary loss for load balancing instead of dropless auxiliary loss-free balancing.
  • Global router: Optimized to balance the number of tokens per EP group.
  • Top k: 2 (vs 8 + 1 shared in DeepSeek).
  • MoE layer size: 9216 vs 2048 (DeepSeek).
  • Same total MLP parameters activated per layer: 9 * 2048 = 2 * 9216 = 18432.
  • Expert count: 32 vs 256 (+1 shared).
  • Layers: 80 vs 61. Linear attention benefits more from depth than width.
  • Hidden size: 6144 vs 7168.
  • No shared expert: Unique design choice.

Hybrid Modeling:

7/8 Linear Attention (Lightning Attention-2):

  • Similar to "flash attention" for NormAttention.
  • Big advantage: Complexity O(d^2n) instead of O(n^2d), making very long contexts feasible.
  • Mechanism:
    1. Input X
    2. Q, K, V = SiLU(X)
    3. Y = Q * (K^T * V) (complexity: O(d^2))
    4. RMSNorm(Y) * sigmoid(X)

1/8 Softmax Attention:

  • Applies rope only to half the dimension size (claimed to allow length extrapolation without performance degradation).
  • Rope 10k base seems unconventional compared to other models.

Comparisons:

  • cosformer2: Very bad at NIAH, overall slower and lower performance.
  • hgrn2: Slightly better but slower (+ small gap on NIAH).
  • SWA softmax: Similar TGS (tokens per GPU/sec), but overall lower performance and big gap on NIAH.
  • Limitations: No complex benchmarks for long contexts beyond NIAH.

Model Design Thinking:

Goal: Build the best model that fits on ~H100 nodes (8x80G) with:

  • 1M sequence length.
  • 8-bit quantization.
  • Good balance of softmax and linear attention.
  • Optimal depth/width ratio (deeper models need more softmax layers).
  • Proper memory size vs hidden size ratio.
  • Effective FFN size vs model dimension.
  • Dimensional considerations for rope use in softmax attention.

Scaling Laws and Experiments:

  • MoE vs Dense: At 1T tokens, MoE (2B active, 24B total params) significantly outperforms dense models (7B) at the same FLOPs.
    (Be cautious about Fig4’s axis; scores are similar for all benchmarks.)
  • Linear vs Softmax vs Hybrid Attention:
    • Range: 70M → 7B params on 300B tokens.
    • Softmax > Linear for NIAH (big gap).
    • Hybrid performs better than softmax overall.
  • Caveats: Fixed learning rates for scaling laws without fast decay may skew comparisons in favor of hybrid.

Training Data:

  • Used previous MoE (5B active, 60B total) for data labeling. Likely trained a classifier after (details unclear).
  • Metrics: Knowledge, practical helpfulness, categorical distribution.
  • Data formatting balances QA format and natural distribution (might improve MMLU).
  • Deduplication: High-quality data deduped 4x, low-quality 2x.
  • Tracks metrics with acc_norm^2 (byte normalization).

Training Hyperparameters:

  • ~12T tokens.
  • WSD-like schedule: Reduces LR to 10% of the peak; no final decay.
  • Initialization: Xavier with deepnorm modifications.
  • Optimizer: AdamW (0.9, 0.95).
  • Critical batch size warmup explanation (16M → 128M): Unique and insightful.

Long Context Training:

Three phases:

  1. Main training (8k tokens, rope 10k).
  2. 128k tokens, 300B tokens total, rope base 5M, mix of short (<32k) and medium (<128k) contexts.
  3. 512k → 1M tokens, rope base 10M, balanced across short, medium, and long contexts.

Key Technique: Linear interpolation to mitigate distribution shift:
W_t = alpha * W_prev + (1-alpha) * W_current.


Post Training and Infra:

  • Post-training steps: Iterative SFT → RL (offline: DPO, online: GRPO).

    • Short-context SFT → Long-context SFT → Short-context RL → Long-context RL.
      (Critical for achieving great long-context performance.)
  • Infrastructure:

    • ~1500–2500 GPUs.
    • Efficient MoE tensor/parallelism with optimized ring attention.
    • Improvements in linear attention sequence parallelism and padding optimization.

Community

Sign up or log in to comment