XwinCoder-34B / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
ff521ac verified
|
raw
history blame
5.35 kB
metadata
license: llama2
model-index:
  - name: XwinCoder-34B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 51.02
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 74.02
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 49.53
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 43.82
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 68.35
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 39.35
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
          name: Open LLM Leaderboard

XwinCoder

We are glad to introduce our instruction finetuned code generation models based on CodeLLaMA: XwinCoder. We release model weights and evaluation code.

Repository: https://github.com/Xwin-LM/Xwin-LM/tree/main/Xwin-Coder

Models:

Model 🤗hf link HumanEval pass@1 MBPP pass@1 APPS-intro pass@5
XwinCoder-7B link 63.8 57.4 31.5
XwinCoder-13B link 68.8 60.1 35.4
XwinCoder-34B link 74.2 64.8 43.0

Updates

  • 💥 We released XwinCoder-7B, XwinCoder-13B, XwinCoder-34B. Our XwinCoder-34B reached 74.2 on HumanEval and it achieves comparable performance as GPT-3.5-turbo on 6 benchmarks.

  • We support evaluating instruction finetuned models on HumanEval, MBPP, APPS, DS1000 and MT-Bench. See our github repository.

Overview

Chat demo

  • To fully demonstrate our model's coding capabilities in real-world usage scenarios, we have conducted thorough evaluations on several existing mainstream coding capability leaderboards (rather than only on the currently most popular HumanEval).
  • As shown in the radar chart results, our 34B model achieves comparable performance as GPT-3.5-turbo on coding abilities.
  • It is worth mentioning that, to ensure accurate visualization, our radar chart has not been scaled (only translated; MT-Bench score is scaled by 10x to be more comparable with other benchmarks).
  • Multiple-E-avg6 refer to the 6 languages used in CodeLLaMA paper. Results of GPT-4 and GPT-3.5-turbo are conducted by us, more details will be released later.

Demo

We provide a chat demo in our github repository, here are some examples: Chat demo

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 54.35
AI2 Reasoning Challenge (25-Shot) 51.02
HellaSwag (10-Shot) 74.02
MMLU (5-Shot) 49.53
TruthfulQA (0-shot) 43.82
Winogrande (5-shot) 68.35
GSM8k (5-shot) 39.35