Mistral-v0.3-6B / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
5279c54 verified
|
raw
history blame
10.4 kB
metadata
language:
  - en
license: apache-2.0
tags:
  - axolotl
  - generated_from_trainer
base_model: pszemraj/Mistral-7B-v0.3-prune6
datasets:
  - BEE-spoke-data/knowledge-inoc-concat-v1
model-index:
  - name: Mistral-v0.3-6B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 45.14
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 71.65
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 51.83
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 45.64
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 72.77
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 8.34
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: IFEval (0-Shot)
          type: HuggingFaceH4/ifeval
          args:
            num_few_shot: 0
        metrics:
          - type: inst_level_strict_acc and prompt_level_strict_acc
            value: 24.54
            name: strict accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BBH (3-Shot)
          type: BBH
          args:
            num_few_shot: 3
        metrics:
          - type: acc_norm
            value: 13.52
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MATH Lvl 5 (4-Shot)
          type: hendrycks/competition_math
          args:
            num_few_shot: 4
        metrics:
          - type: exact_match
            value: 0.83
            name: exact match
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GPQA (0-shot)
          type: Idavidrein/gpqa
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 2.01
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MuSR (0-shot)
          type: TAUR-Lab/MuSR
          args:
            num_few_shot: 0
        metrics:
          - type: acc_norm
            value: 6.61
            name: acc_norm
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU-PRO (5-shot)
          type: TIGER-Lab/MMLU-Pro
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 12.7
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=pszemraj/Mistral-v0.3-6B
          name: Open LLM Leaderboard

Mistral-v0.3-6B

Brief continued pretraining @ ctx 4096 to 'heal' the layer-pruning.

Model description

This model is a fine-tuned version of pszemraj/Mistral-7B-v0.3-prune6 on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 1.2860

Built with Axolotl

See axolotl config

axolotl version: 0.4.0

base_model: pszemraj/Mistral-7B-v0.3-prune6
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer

strict: false
seed: 80085
max_steps: 2000
# dataset
datasets:
    - path: BEE-spoke-data/knowledge-inoc-concat-v1
      name: smorgasbord-tb-quality
      type: completion 
      field: text 
val_set_size: 0.01

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: false
train_on_inputs: false
group_by_length: false

# WANDB
wandb_project: llama3-pruning
wandb_entity: pszemraj
wandb_watch: gradients
wandb_name: Mistral-6B-v0.3-v0.1-ii
hub_model_id: pszemraj/Mistral-v0.3-6B-ii
hub_strategy: every_save

gradient_accumulation_steps: 16
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_32bit
weight_decay: 0.1
lr_scheduler: cosine
learning_rate: 2e-5
warmup_ratio: 0.1

load_in_8bit: false
load_in_4bit: false
bfloat16: true
tf32: true

flash_attention: true
torch_compile: true 
torch_compile_backend: inductor 
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false

# hyperparams for freq of evals, saving, etc
evals_per_epoch: 5
saves_per_epoch: 5
save_safetensors: true
save_total_limit: 1
output_dir: /workspace/output-axolotl/output-model-6b
logging_steps: 6

deepspeed:

special_tokens:

Quick eval

Quick eval for: pszemraj/Mistral-v0.3-6B-ii

bootstrapping for stddev: perplexity hf (pretrained=pszemraj/Mistral-v0.3-6B-ii,trust_remote_code=True,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 2

Tasks Version Filter n-shot Metric Value Stderr
arc_easy 1 none 0 acc 0.7109 ± 0.0093
none 0 acc_norm 0.6654 ± 0.0097
boolq 2 none 0 acc 0.7930 ± 0.0071
lambada_openai 1 none 0 perplexity 4.9892 ± 0.1269
none 0 acc 0.6746 ± 0.0065
openbookqa 1 none 0 acc 0.2460 ± 0.0193
none 0 acc_norm 0.3700 ± 0.0216
piqa 1 none 0 acc 0.7350 ± 0.0103
none 0 acc_norm 0.7350 ± 0.0103
winogrande 1 none 0 acc 0.6930 ± 0.0130

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 80085
  • gradient_accumulation_steps: 16
  • total_train_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 200
  • training_steps: 2000

Training results

Training Loss Epoch Step Validation Loss
No log 0.0002 1 1.5980
1.578 0.0955 400 1.4028
1.5828 0.1911 800 1.3809
1.4355 0.2866 1200 1.3152
1.4618 0.3822 1600 1.2877
1.4551 0.4777 2000 1.2860

Framework versions

  • Transformers 4.40.2
  • Pytorch 2.3.0+cu118
  • Datasets 2.19.1
  • Tokenizers 0.19.1

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 49.23
AI2 Reasoning Challenge (25-Shot) 45.14
HellaSwag (10-Shot) 71.65
MMLU (5-Shot) 51.83
TruthfulQA (0-shot) 45.64
Winogrande (5-shot) 72.77
GSM8k (5-shot) 8.34

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 10.03
IFEval (0-Shot) 24.54
BBH (3-Shot) 13.52
MATH Lvl 5 (4-Shot) 0.83
GPQA (0-shot) 2.01
MuSR (0-shot) 6.61
MMLU-PRO (5-shot) 12.70