## 2. Model Architecture The architecture of MiniMax-Text-01 is briefly described as follows: - Total Parameters: 456B - Activated Parameters per Token: 45.9B - Number Layers: 80 - Hybrid Attention: a softmax attention is positioned after every 7 lightning attention. - Number of attention heads: 64 - Attention head dimension: 128 - Mixture of Experts: - Number of experts: 32 - Expert hidden dimension: 9216 - Top-2 routing strategy - Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000 - Hidden Size: 6144 - Vocab Size: 200,064 ## 3. Evaluation ### Core Academic Benchmarks | **Tasks** | **GPT-4o (11-20)** | **Claude-3.5-Sonnet (10-22)** | **Gemini-1.5-Pro (002)** | **Gemini-2.0-Flash (exp)** | **Qwen2.5-72B-Inst.** | **DeepSeek-V3** | **Llama-3.1-405B-Inst.** | **MiniMax-Text-01** | |-------------------------------|--------------------|-------------------------------|--------------------------|----------------------------|-----------------------|-----------------|--------------------------|---------------------| | **General** | | | | | | | | | | MMLU* | 85.7 | 88.3 | 86.8 | 86.5 | 86.1 | 88.5 | **88.6** | 88.5 | | MMLU-Pro* | 74.4 | **78.0** | 75.8 | 76.4 | 71.1 | 75.9 | 73.3 | 75.7 | | SimpleQA | **39.0** | 28.1 | 23.4 | 26.6 | 10.3 | 24.9 | 23.2 | 23.7 | | C-SimpleQA | 64.6 | 56.8 | 59.4 | 63.3 | 52.2 | 64.8 | 54.7 | **67.4** | | IFEval _(avg)_ | 84.1 | **90.1** | 89.4 | 88.4 | 87.2 | 87.3 | 86.4 | 89.1 | | Arena-Hard | **92.4** | 87.6 | 85.3 | 72.7 | 81.2 | 91.4 | 63.5 | 89.1 | | **Reasoning** | | | | | | | | | | GPQA* _(diamond)_ | 46.0 | **65.0** | 59.1 | 62.1 | 49.0 | 59.1 | 50.7 | 54.4 | | DROP* _(F1)_ | 89.2 | 88.8 | 89.2 | 89.3 | 85.0 | 91.0 | **92.5** | 87.8 | | **Mathematics** | | | | | | | | | | GSM8k* | 95.6 | **96.9** | 95.2 | 95.4 | 95.8 | 96.7 | 96.7 | 94.8 | | MATH* | 76.6 | 74.1 | **84.6** | 83.9 | 81.8 | **84.6** | 73.8 | 77.4 | | **Coding** | | | | | | | | | | MBPP + | 76.2 | 75.1 | 75.4 | 75.9 | 77.0 | **78.8** | 73.0 | 71.7 | | HumanEval | 90.2 | **93.7** | 86.6 | 89.6 | 86.6 | 92.1 | 89.0 | 86.9 | * Evaluated following a _0-shot CoT_ setting. ### Long Benchmarks #### 4M Needle In A Haystack Test
#### Ruler | Model | 4k | 8k | 16k | 32k | 64k | 128k | 256k | 512k | 1M | |-------|----|----|-----|-----|-----|------|------|------|----| | **GPT-4o (11-20)** | **0.970** | 0.921 | 0.890 | 0.888 | 0.884 | - | - | - | - | | **Claude-3.5-Sonnet (10-22)** | 0.965 | 0.960 | 0.957 | 0.950 | **0.952** | 0.938 | - | - | - | | **Gemini-1.5-Pro (002)** | 0.962 | 0.960 | **0.960** | **0.958** | 0.938 | 0.917 | 0.916 | 0.861 | 0.850 | | **Gemini-2.0-Flash (exp)** | 0.960 | 0.960 | 0.951 | 0.957 | 0.937 | 0.860 | 0.797 | 0.709 | - | | **MiniMax-Text-01** | 0.963 | **0.961** | 0.953 | 0.954 | 0.943 | **0.947** | **0.945** | **0.928** | **0.910** | #### LongBench v2 | **Model** | **overall** | **easy** | **hard** | **short** | **medium** | **long** | |----------------------------|-------------|----------|----------|------------|------------|----------| | Human | 53.7 | 100.0 | 25.1 | 47.2 | 59.1 | 53.7 | | **w/ CoT** | | | | | | | | GPT-4o (11-20) | 51.4 | 54.2 | 49.7 | 59.6 | 48.6 | 43.5 | | Claude-3.5-Sonnet (10-22) | 46.7 | 55.2 | 41.5 | 53.9 | 41.9 | 44.4 | | Deepseek-V3 | - | - | - | - | - | - | | Qwen2.5-72B-Inst. | 43.5 | 47.9 | 40.8 | 48.9 | 40.9 | 39.8 | | **MiniMax-Text-01** | **56.5** | **66.1** | **50.5** | **61.7** | **56.7** | **47.2** | | **w/o CoT** | | | | | | | | GPT-4o (11-20) | 50.1 | 57.4 | 45.6 | 53.3 | 52.4 | 40.2 | | Claude-3.5-Sonnet (10-22) | 41.0 | 46.9 | 37.3 | 46.1 | 38.6 | 37.0 | | Deepseek-V3 | 48.7 | - | - | - | - | - | | Qwen2.5-72B-Inst. | 42.1 | 42.7 | 41.8 | 45.6 | 38.1 | **44.4** | | **MiniMax-Text-01** | **52.9** | **60.9** | **47.9** | **58.9** | **52.6** | 43.5 | #### MTOB | **Context Type** | **no context** | **half book** | **full book** | **Δ half book** | **Δ full book** | |------------------|----------------|---------------|---------------|------------------|-----------------| | **eng → kalam (ChrF)** | | | | | | | GPT-4o (11-20) | 9.90 | **54.30** | - | 44.40 | - | | Claude-3.5-Sonnet (10-22) | 20.22 | 53.62 | 55.65 | 33.39 | 35.42 | | Gemini-1.5-Pro (002) | 16.79 | 53.68 | **57.90** | 36.89 | 41.11 | | Gemini-2.0-Flash (exp) | 12.20 | 49.50 | 53.30 | 37.30 | 41.10 | | Qwen-Long | 16.55 | 48.48 | 45.94 | 31.92 | 29.39 | | **MiniMax-Text-01** | 6.0 | 51.74 | 51.60 | **45.7** | **45.6** | | **kalam → eng (BLEURT)** | | | | | | | GPT-4o (11-20) | 33.20 | 58.30 | - | 25.10 | - | | Claude-3.5-Sonnet (10-22) | 31.42 | 59.70 | 62.30 | 28.28 | 30.88 | | Gemini-1.5-Pro (002) | 32.02 | **61.52** | **63.09** | **29.50** | **31.07** | | Gemini-2.0-Flash (exp) | 33.80 | 57.50 | 57.00 | 23.70 | 23.20 | | Qwen-Long | 30.13 | 53.14 | 32.15 | 23.01 | 2.02 | | **MiniMax-Text-01** | 33.65 | 57.10 | 58.00 | 23.45 | 24.35 | ## 4. Quickstart Here we provide a simple example of loading the tokenizer and model to generate content. ```python from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, QuantoConfig, GenerationConfig # load hf config hf_config = AutoConfig.from_pretrained("MiniMax-Text-01", trust_remote_code=True) # quantization config, int8 is recommended quantization_config = QuantoConfig( weights="int8", modules_to_not_convert=[ "lm_head", "embed_tokens", ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)] + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)] ) # set device map device_map = { 'model.embed_tokens': 'cuda:0', 'model.norm': f'cuda:{world_size - 1}', 'lm_head': f'cuda:{world_size - 1}' } # assume 8 GPUs world_size = 8 layers_per_device = hf_config.num_hidden_layers // world_size for i in range(world_size): for j in range(layers_per_device): device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}' # load tokenizer tokenizer = AutoTokenizer.from_pretrained("MiniMax-Text-01") prompt = "Hello!" messages = [ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]}, {"role": "user", "content": [{"type": "text", "text": prompt}]}, ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # tokenize and move to device model_inputs = tokenizer(text, return_tensors="pt").to("cuda") # load bfloat16 model, move to device, and apply quantization quantized_model = AutoModelForCausalLM.from_pretrained( "MiniMax-Text-01", torch_dtype="bfloat16", device_map=device_map, quantization_config=quantization_config, trust_remote_code=True, offload_buffers=True, ) # generate response generation_config = GenerationConfig( max_new_tokens=20, eos_token_id=200020, use_cache=True, ) generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config) print(f"generated_ids: {generated_ids}") generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` ## 5. Chatbot & API For general use and evaluation, we provide a [Chatbot](https://www.hailuo.ai/) with online search capabilities and the [online API](https://intl.minimaxi.com) for developers. Contact us at [model@minimaxi.com](mailto:model@minimaxi.com).