Adding Evaluation Results (#4)

36fb6f7 verified 4 months ago

11.4 kB

	---
	license: apache-2.0
	tags:
	- llm
	- fine-tune
	- yi
	datasets:
	- adamo1139/AEZAKMI_v2
	license_name: yi-license
	license_link: LICENSE
	model-index:
	- name: Yi-34B-200K-AEZAKMI-v2
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 67.92
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 85.61
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 75.22
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 56.74
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 81.61
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 58.91
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: HuggingFaceH4/ifeval
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 45.55
	name: strict accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: BBH
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 35.28
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: hendrycks/competition_math
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 4.83
	name: exact match
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 10.96
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 6.48
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 39.03
	name: accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=adamo1139/Yi-34B-200K-AEZAKMI-v2
	name: Open LLM Leaderboard
	---

	## Model description

	Yi-34B 200K base model fine-tuned on AEZAKMI v2 dataset. Training took around 25 hours on single local RTX 3090 Ti.
	It's like airoboros but with less gptslop, no refusals and less typical language used by RLHFed OpenAI models.
	Say goodbye to "It's important to remember"! \
	Prompt format is standard chatml. Don't expect it to be good at math, riddles or be crazy smart. My end goal with AEZAKMI is to create a cozy free chatbot.
	Cost of this fine-tune is about $10 in electricity. It took me 3 tries to get it right.
	Base model used for fine-tuning was 200k context Yi-34B-Llama model shared by larryvrh.

	I had to lower max_positional_embeddings in config.json and model_max_length for training to start, otherwise I was OOMing straight away.
	My first attempt had max_positional_embeddings set to 16384 and model_max_length set to 200000. This allowed fine-tuning to finish, but that model was broken after applying LoRA and merging it. \
	This attempt had both max_position_embeddings and model_max_length set to 4096, which worked perfectly fine.

	## Quants!

	Huge thank you to LoneStriker and TheBloke for providing quantized versions.

	EXL2 \
	3bpw - https://huggingface.co/LoneStriker/Yi-34B-200K-AEZAKMI-v2-3.0bpw-h6-exl2 \
	4bpw - https://huggingface.co/LoneStriker/Yi-34B-200K-AEZAKMI-v2-4.0bpw-h6-exl2 \
	4.65bpw - https://huggingface.co/adamo1139/Yi-34B-200K-AEZAKMI-v2-exl2-4.65bpw \
	5bpw - https://huggingface.co/LoneStriker/Yi-34B-200K-AEZAKMI-v2-5.0bpw-h6-exl2 \
	6bpw - https://huggingface.co/LoneStriker/Yi-34B-200K-AEZAKMI-v2-6.0bpw-h6-exl2 \
	8bpw - https://huggingface.co/LoneStriker/Yi-34B-200K-AEZAKMI-v2-8.0bpw-h8-exl2

	GGUF - https://huggingface.co/TheBloke/Yi-34B-200K-AEZAKMI-v2-GGUF

	GPTQ - https://huggingface.co/TheBloke/Yi-34B-200K-AEZAKMI-v2-GPTQ

	AWQ - https://huggingface.co/TheBloke/Yi-34B-200K-AEZAKMI-v2-AWQ


	## Prompt Format

	I recommend using ChatML format, as this was used during fine-tune. \
	Here's a prompt format you should use, you can set a different system message, model seems to respect that fine, so it wasn't overfitted.

	```
	<\|im_start\|>system
	A chat with uncensored assistant.<\|im_end\|>
	<\|im_start\|>user
	{prompt}<\|im_end\|>
	<\|im_start\|>assistant
	```

	## Intended uses & limitations

	Use is limited by apache-2.0 license.

	## Known Issues

	I recommend to set repetition penalty to something around 1.05 to avoid repetition. So far I had good experience running this model with temperature 1.2.
	Stories have ChatGPT like paragraph spacing, I will work on this in the future maybe, not a high priority

	Unrestricted-ness of the v2 isn't quite something that I am happy with yet, especially using prompt "A chat.". \
	With a slightly modifed prompt it works much better, I recommend using "A chat with uncensored assistant.", it's stupid but it helps a lot. \
	Base Yi-34B model is contaminated with refusals and this contaminates all models trained on Yi-34B \
	My next project is to attempt to de-contaminate base Yi-34B 4K and Yi-34B 200K using DPO with preferred data coming from uncontaminated raw models. I plan to release that dataset openly.

	I was made aware of the frequent occurrence of the phrase "sending shivers down a spine" in the generations during RP of v1, so I fixed those samples - it should be better now. \
	I can hold up to 24000 ctx with 4.65bpw exl2 version and 8-bit cache - long context should work as good as other models trained on 200k version of Yi-34B \
	There is also some issue with handling long system messages for RP, I was planning to investigate it for v2 but I didn't.


	## Axolotl training parameters

	- bnb_4bit_use_double_quant: true
	- is_llama_derived_model: true
	- load_in_4bit: true
	- adapter: qlora
	- sequence_len: 1400
	- sample_packing: true
	- lora_r: 16
	- lora_alpha: 32
	- lora_target_modules:
	- q_proj
	- v_proj
	- k_proj
	- o_proj
	- gate_proj
	- down_proj
	- up_proj
	- lora_target_linear: true
	- pad_to_sequence_len: false
	- micro_batch_size: 1
	- gradient_accumulation_steps: 1
	- num_epochs: 2.4
	- optimizer: adamw_bnb_8bit
	- lr_scheduler: constant
	- learning_rate: 0.00005
	- train_on_inputs: false
	- group_by_length: false
	- bf16: true
	- bfloat16: true
	- flash_optimum: false
	- gradient_checkpointing: true
	- flash_attention: true
	- seed: 42


	## Upcoming

	I will probably be working on de-contaminating base Yi-34B model now. \
	My second run of AEZAKMI v2 fine-tune was just 0.15 epochs and I really like how natural this model is and how rich is it's vocabulary. I will try to train less to hit the sweetspot. \
	I will be uploading LoRA adapter for that second run that was just 0.15 epochs. \
	I believe that I might have gotten what I want if I would have stopped training sooner. I don't have checkpoints older than 1500 steps back so I would need to re-run training to get it back.
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_adamo1139__Yi-34B-200K-AEZAKMI-v2)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|71.00\|
	\|AI2 Reasoning Challenge (25-Shot)\|67.92\|
	\|HellaSwag (10-Shot) \|85.61\|
	\|MMLU (5-Shot) \|75.22\|
	\|TruthfulQA (0-shot) \|56.74\|
	\|Winogrande (5-shot) \|81.61\|
	\|GSM8k (5-shot) \|58.91\|


	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_adamo1139__Yi-34B-200K-AEZAKMI-v2)

	\| Metric \|Value\|
	\|-------------------\|----:\|
	\|Avg. \|23.69\|
	\|IFEval (0-Shot) \|45.55\|
	\|BBH (3-Shot) \|35.28\|
	\|MATH Lvl 5 (4-Shot)\| 4.83\|
	\|GPQA (0-shot) \|10.96\|
	\|MuSR (0-shot) \| 6.48\|
	\|MMLU-PRO (5-shot) \|39.03\|