XwinCoder-34B / README.md

Adding Evaluation Results

ff521ac verified 9 months ago

5.35 kB

	---
	license: llama2
	model-index:
	- name: XwinCoder-34B
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 51.02
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 74.02
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 49.53
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 43.82
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 68.35
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 39.35
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Xwin-LM/XwinCoder-34B
	name: Open LLM Leaderboard
	---
	# XwinCoder

	We are glad to introduce our instruction finetuned code generation models based on CodeLLaMA: XwinCoder. We release model weights and evaluation code.

	Repository: [https://github.com/Xwin-LM/Xwin-LM/tree/main/Xwin-Coder](https://github.com/Xwin-LM/Xwin-LM/tree/main/Xwin-Coder)

	Models:
	\| Model \| 🤗hf link \| HumanEval pass@1 \| MBPP pass@1 \| APPS-intro pass@5 \|
	\|-------\|------------\|----------\|------\|-------------\|
	\| XwinCoder-7B \| [link](https://huggingface.co/Xwin-LM/XwinCoder-7B) \| 63.8 \| 57.4 \| 31.5 \|
	\| XwinCoder-13B \| [link](https://huggingface.co/Xwin-LM/XwinCoder-13B) \| 68.8 \| 60.1 \| 35.4 \|
	\| XwinCoder-34B \| [link](https://huggingface.co/Xwin-LM/XwinCoder-34B) \| 74.2 \| 64.8 \| 43.0 \|

	## Updates
	- 💥 We released [XwinCoder-7B](https://huggingface.co/Xwin-LM/XwinCoder-7B), [XwinCoder-13B](https://huggingface.co/Xwin-LM/XwinCoder-13B), [XwinCoder-34B](https://huggingface.co/Xwin-LM/XwinCoder-34B). Our XwinCoder-34B reached 74.2 on HumanEval and it achieves comparable performance as GPT-3.5-turbo on 6 benchmarks.

	- We support evaluating instruction finetuned models on HumanEval, MBPP, APPS, DS1000 and MT-Bench. See our github repository.

	## Overview

	![Chat demo](rader.png)

	* To fully demonstrate our model's coding capabilities in real-world usage scenarios, we have conducted thorough evaluations on several existing mainstream coding capability leaderboards (rather than only on the currently most popular HumanEval).
	* As shown in the radar chart results, our 34B model achieves comparable performance as GPT-3.5-turbo on coding abilities.
	* It is worth mentioning that, to ensure accurate visualization, our radar chart has not been scaled (only translated; MT-Bench score is scaled by 10x to be more comparable with other benchmarks).
	* Multiple-E-avg6 refer to the 6 languages used in CodeLLaMA paper. Results of GPT-4 and GPT-3.5-turbo are conducted by us, more details will be released later.

	## Demo
	We provide a chat demo in our github repository, here are some examples:
	![Chat demo](exm.gif)




	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_Xwin-LM__XwinCoder-34B)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|54.35\|
	\|AI2 Reasoning Challenge (25-Shot)\|51.02\|
	\|HellaSwag (10-Shot) \|74.02\|
	\|MMLU (5-Shot) \|49.53\|
	\|TruthfulQA (0-shot) \|43.82\|
	\|Winogrande (5-shot) \|68.35\|
	\|GSM8k (5-shot) \|39.35\|