metadata

license: cc
library_name: transformers
datasets:
  - jondurbin/truthy-dpo-v0.1
model-index:
  - name: MBX-7B-v3-DPO
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 73.55
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 89.11
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 64.91
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 74
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 85.56
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 69.67
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
          name: Open LLM Leaderboard

MBX-7B-v3-DPO

This model is a finetune of flemmingmiguel/MBX-7B-v3 using jondurbin/truthy-dpo-v0.1

Code Example

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("macadeliccc/MBX-7B-v3-DPO")
model = AutoModelForCausalLM.from_pretrained("macadeliccc/MBX-7B-v3-DPO")

messages = [
    {"role": "system", "content": "Respond to the users request like a pirate"},
    {"role": "user", "content": "Can you write me a quicksort algorithm?"}
]
gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")

Example Output

GGUF

Available here

Exllamav2

Quants are available from bartowski, check them out here

Download the size you want below, VRAM figures are estimates.

Branch	Bits	lm_head bits	VRAM (4k)	VRAM (16k)	VRAM (32k)	Description
8_0	8.0	8.0	8.4 GB	9.8 GB	11.8 GB	Maximum quality that ExLlamaV2 can produce, near unquantized performance.
6_5	6.5	8.0	7.2 GB	8.6 GB	10.6 GB	Very similar to 8.0, good tradeoff of size vs performance, recommended.
5_0	5.0	6.0	6.0 GB	7.4 GB	9.4 GB	Slightly lower quality vs 6.5, but usable on 8GB cards.
4_25	4.25	6.0	5.3 GB	6.7 GB	8.7 GB	GPTQ equivalent bits per weight, slightly higher quality.
3_5	3.5	6.0	4.7 GB	6.1 GB	8.1 GB	Lower quality, only use if you have to.

Evaluations

EQ-Bench Comparison

----Benchmark Complete----
2024-01-30 15:22:18
Time taken: 145.9 mins
Prompt Format: ChatML
Model: macadeliccc/MBX-7B-v3-DPO
Score (v2): 74.32
Parseable: 166.0
---------------
Batch completed
Time taken: 145.9 mins
---------------

Original Model

----Benchmark Complete----
2024-01-31 01:26:26
Time taken: 89.1 mins
Prompt Format: Mistral
Model: flemmingmiguel/MBX-7B-v3
Score (v2): 73.87
Parseable: 168.0
---------------
Batch completed
Time taken: 89.1 mins
---------------

Model	AGIEval	GPT4All	TruthfulQA	Bigbench	Average
MBX-7B-v3-DPO	45.16	77.73	74.62	48.83	61.58

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	27.95	±	2.82
		acc_norm	26.77	±	2.78
agieval_logiqa_en	0	acc	41.01	±	1.93
		acc_norm	40.55	±	1.93
agieval_lsat_ar	0	acc	25.65	±	2.89
		acc_norm	23.91	±	2.82
agieval_lsat_lr	0	acc	50.78	±	2.22
		acc_norm	52.94	±	2.21
agieval_lsat_rc	0	acc	66.54	±	2.88
		acc_norm	65.80	±	2.90
agieval_sat_en	0	acc	77.67	±	2.91
		acc_norm	77.67	±	2.91
agieval_sat_en_without_passage	0	acc	43.20	±	3.46
		acc_norm	43.20	±	3.46
agieval_sat_math	0	acc	32.27	±	3.16
		acc_norm	30.45	±	3.11

Average: 45.16%

GPT4All

Task	Version	Metric	Value		Stderr
arc_challenge	0	acc	68.43	±	1.36
		acc_norm	68.34	±	1.36
arc_easy	0	acc	87.54	±	0.68
		acc_norm	82.11	±	0.79
boolq	1	acc	88.20	±	0.56
hellaswag	0	acc	69.76	±	0.46
		acc_norm	87.40	±	0.33
openbookqa	0	acc	40.20	±	2.19
		acc_norm	49.60	±	2.24
piqa	0	acc	83.68	±	0.86
		acc_norm	85.36	±	0.82
winogrande	0	acc	83.11	±	1.05

Average: 77.73%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	58.87	±	1.72
		mc2	74.62	±	1.44

Average: 74.62%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	60.00	±	3.56
bigbench_date_understanding	0	multiple_choice_grade	63.14	±	2.51
bigbench_disambiguation_qa	0	multiple_choice_grade	47.67	±	3.12
bigbench_geometric_shapes	0	multiple_choice_grade	22.56	±	2.21
		exact_str_match	0.84	±	0.48
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	33.20	±	2.11
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	23.00	±	1.59
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	59.67	±	2.84
bigbench_movie_recommendation	0	multiple_choice_grade	47.40	±	2.24
bigbench_navigate	0	multiple_choice_grade	56.10	±	1.57
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	71.25	±	1.01
bigbench_ruin_names	0	multiple_choice_grade	56.47	±	2.35
bigbench_salient_translation_error_detection	0	multiple_choice_grade	35.27	±	1.51
bigbench_snarks	0	multiple_choice_grade	73.48	±	3.29
bigbench_sports_understanding	0	multiple_choice_grade	75.46	±	1.37
bigbench_temporal_sequences	0	multiple_choice_grade	52.10	±	1.58
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	22.64	±	1.18
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	19.83	±	0.95
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	59.67	±	2.84

Average: 48.83%

Average score: 61.58%

Elapsed time: 02:37:39

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	76.13
AI2 Reasoning Challenge (25-Shot)	73.55
HellaSwag (10-Shot)	89.11
MMLU (5-Shot)	64.91
TruthfulQA (0-shot)	74.00
Winogrande (5-shot)	85.56
GSM8k (5-shot)	69.67