File size: 7,375 Bytes

---
license: cc-by-nc-4.0
model-index:
- name: Kunoichi-DPO-v2-7B
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: AI2 Reasoning Challenge (25-Shot)
      type: ai2_arc
      config: ARC-Challenge
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: acc_norm
      value: 69.62
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: HellaSwag (10-Shot)
      type: hellaswag
      split: validation
      args:
        num_few_shot: 10
    metrics:
    - type: acc_norm
      value: 87.44
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU (5-Shot)
      type: cais/mmlu
      config: all
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 64.94
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: TruthfulQA (0-shot)
      type: truthful_qa
      config: multiple_choice
      split: validation
      args:
        num_few_shot: 0
    metrics:
    - type: mc2
      value: 66.06
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Winogrande (5-shot)
      type: winogrande
      config: winogrande_xl
      split: validation
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 80.82
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GSM8k (5-shot)
      type: gsm8k
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 65.88
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=SanjiWatsuki/Kunoichi-DPO-v2-7B
      name: Open LLM Leaderboard
---

| Model                | MT Bench | EQ Bench | MMLU   | Logic Test |
|----------------------|----------|----------|---------|-------------|
| GPT-4-Turbo         | 9.32     | -        | -       | -           |
| GPT-4               | 8.99     | 62.52    | 86.4    | 0.86        |
| **Kunoichi-DPO-v2-7B** | **8.51**     | **42.18**    | **64.94**| **0.58**        |
| Mixtral-8x7B-Instruct| 8.30     | 44.81    | 70.6    | 0.75        |
| **Kunoichi-DPO-7B** | **8.29**     | **41.60**    | **64.83**    | **0.59**        |
| **Kunoichi-7B**     | **8.14**     | **44.32**    | **64.9**    | **0.58**            |
| Starling-7B         | 8.09     | -        | 63.9    | 0.51        |
| Claude-2            | 8.06     | 52.14    | 78.5    | -           |
| Silicon-Maid-7B     | 7.96     | 40.44    | 64.7    | 0.54           |
| Loyal-Macaroni-Maid-7B | 7.95     | 38.66    | 64.9   | 0.57        |
| GPT-3.5-Turbo       | 7.94     | 50.28    | 70      | 0.57        |
| Claude-1            | 7.9       | -        | 77      | -           |
| Openchat-3.5        | 7.81     | 37.08    | 64.3    | 0.39        |
| Dolphin-2.6-DPO     | 7.74     | 42.88    | 61.9    | 0.53        |
| Zephyr-7B-beta      | 7.34     | 38.71    | 61.4    | 0.30        |
| Llama-2-70b-chat-hf | 6.86     | 51.56    | 63      | -           |
| Neural-chat-7b-v3-1 | 6.84     | 43.61    | 62.4    | 0.30        |

| Model | Average | AGIEval | GPT4All | TruthfulQA | Bigbench |
|---|---:|---:|---:|---:|---:|
| **Kunoichi-DPO-7B**|**58.4**|  45.08 |  74|     66.99|   47.52|
| **Kunoichi-DPO-v2-7B**|**58.31**|  44.85|  75.05|     65.69|   47.65|
| [Kunoichi-7B](https://huggingface.co/SanjiWatsuki/Kunoichi-7B)|57.54|  44.99|  74.86|     63.72|   46.58|
| [OpenPipe/mistral-ft-optimized-1218](https://huggingface.co/OpenPipe/mistral-ft-optimized-1218)| 56.85 | 44.74 | 75.6 | 59.89 | 47.17 |
| [Silicon-Maid-7B](https://huggingface.co/SanjiWatsuki/Silicon-Maid-7B) | 56.45|  44.74|  74.26|      61.5|   45.32|
| [mlabonne/NeuralHermes-2.5-Mistral-7B](https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B) | 53.51 | 43.67 | 73.24 | 55.37 | 41.76 |
| [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B)  | 52.42 | 42.75 | 72.99 | 52.99 | 40.94 |
| [openchat/openchat_3.5](https://huggingface.co/openchat/openchat_3.5) | 51.34 | 42.67 | 72.92 | 47.27 | 42.51 |
| [berkeley-nest/Starling-LM-7B-alpha](https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha) | 51.16 | 42.06 | 72.72 | 47.33 | 42.53 |
| [HuggingFaceH4/zephyr-7b-beta](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) | 50.99 | 37.33 | 71.83 | 55.1 | 39.7 |

| Model                       | AlpacaEval2 | Length |
| --------------------------- | ----------- | ------ |
| GPT-4                       | 23.58%      | 1365   |
| GPT-4 0314                  | 22.07%      | 1371   |
| Mistral Medium              | 21.86%      | 1500   |
| Mixtral 8x7B v0.1           | 18.26%      | 1465   |
| **Kunoichi-DPO-v2**         | **17.19%**  | 1785   |
| Claude 2                    | 17.19%      | 1069   |
| Claude                      | 16.99%      | 1082   |
| Gemini Pro                  | 16.85%      | 1315   |
| GPT-4 0613                  | 15.76%      | 1140   |
| Claude 2.1                  | 15.73%      | 1096   |
| Mistral 7B v0.2             | 14.72%      | 1676   |
| GPT 3.5 Turbo 0613          | 14.13%      | 1328   |
| LLaMA2 Chat 70B             | 13.87%      | 1790   |
| LMCocktail-10.7B-v1         | 13.15%      | 1203   |
| WizardLM 13B V1.1           | 11.23%      | 1525   |
| Zephyr 7B Beta              | 10.99%      | 1444   |
| OpenHermes-2.5-Mistral (7B) | 10.34%      | 1107   |
| GPT 3.5 Turbo 0301          | 9.62%       | 827    |
| **Kunoichi-7B**             | **9.38%**   | 1492   |
| GPT 3.5 Turbo 1106          | 9.18%       | 796    |
| GPT-3.5                     | 8.56%       | 1018   |
| Phi-2 DPO                   | 7.76%       | 1687   |
| LLaMA2 Chat 13B             | 7.70%       | 1513   |
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_SanjiWatsuki__Kunoichi-DPO-v2-7B)

|             Metric              |Value|
|---------------------------------|----:|
|Avg.                             |72.46|
|AI2 Reasoning Challenge (25-Shot)|69.62|
|HellaSwag (10-Shot)              |87.44|
|MMLU (5-Shot)                    |64.94|
|TruthfulQA (0-shot)              |66.06|
|Winogrande (5-shot)              |80.82|
|GSM8k (5-shot)                   |65.88|