leaderboard-pr-bot's picture
Adding Evaluation Results
65eca52 verified
|
raw
history blame
7.93 kB
---
language:
- en
license: apache-2.0
tags:
- sft
pipeline_tag: text-generation
widget:
- text: <|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>
- text: <|prompter|>What's the Earth total population<|endoftext|><|assistant|>
- text: <|prompter|>Write a story about future of AI development<|endoftext|><|assistant|>
model-index:
- name: oasst-sft-4-pythia-12b-epoch-3.5
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 45.73
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 68.59
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 26.82
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 37.81
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 65.9
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 3.03
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
name: Open LLM Leaderboard
---
# Open-Assistant SFT-4 12B Model
This is the 4th iteration English supervised-fine-tuning (SFT) model of
the [Open-Assistant](https://github.com/LAION-AI/Open-Assistant) project.
It is based on a Pythia 12B that was fine-tuned on human demonstrations
of assistant conversations collected through the
[https://open-assistant.io/](https://open-assistant.io/) human feedback web
app before March 25, 2023.
## Model Details
- **Developed by:** [Open-Assistant Contributors](https://open-assistant.io/)
- **Model type:** Transformer-based Language Model
- **Language:** English
- **Finetuned from:** [EleutherAI / pythia-12b-deduped](https://huggingface.co/EleutherAI/pythia-12b-deduped)
- **Code:** [Open-Assistant/model/model_training](https://github.com/LAION-AI/Open-Assistant/tree/main/model/model_training)
- **Demo:** [Continuations for 250 random prompts](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-04-03_andreaskoepf_oasst-sft-4-pythia-12b-epoch-3_5_sampling_noprefix_lottery.json%0Ahttps%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Fchat-gpt%2F2023-04-11_gpt-3.5-turbo_lottery.json)
- **License:** Apache 2.0
- **Contact:** [Open-Assistant Discord](https://ykilcher.com/open-assistant-discord)
## Prompting
Two special tokens are used to mark the beginning of user and assistant turns:
`<|prompter|>` and `<|assistant|>`. Each turn ends with a `<|endoftext|>` token.
Input prompt example:
```
<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>
```
The input ends with the `<|assistant|>` token to signal that the model should
start generating the assistant reply.
## Dev Details
- wandb: https://wandb.ai/open-assistant/supervised-finetuning/runs/770a0t41
- base model: [andreaskoepf/pythia-12b-pre-2000](https://huggingface.co/andreaskoepf/pythia-12b-pre-2000)
- checkpoint: 4000 steps
command: `deepspeed trainer_sft.py --configs defaults reference-data reference-pythia-12b --cache_dir /home/ubuntu/data_cache --output_dir .saved/oasst-sft-3-pythia-12b-reference_2kpre --num_train_epochs 8 --residual_dropout 0.2 --deepspeed --use_flash_attention true --model_name andreaskoepf/pythia-12b-pre-2000`
data:
```
reference-data:
datasets:
- oasst_export:
lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk"
input_file_path: 2023-03-25_oasst_research_ready_synth_labels.jsonl.gz
val_split: 0.05
- alpaca
sort_by_length: false
use_custom_sampler: false
```
pythia:
```
reference-pythia-12b:
dtype: fp16
log_dir: "pythia_log_12b"
learning_rate: 6e-6
model_name: EleutherAI/pythia-12b-deduped
output_dir: pythia_model_12b
weight_decay: 0.0
max_length: 2048
warmup_steps: 100
gradient_checkpointing: true
gradient_accumulation_steps: 2
per_device_train_batch_size: 4
per_device_eval_batch_size: 4
eval_steps: 100
save_steps: 1000
num_train_epochs: 8
save_total_limit: 4
```
zero config:
```
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 1e9,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 1e9,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
```
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_OpenAssistant__oasst-sft-4-pythia-12b-epoch-3.5)
| Metric |Value|
|---------------------------------|----:|
|Avg. |41.31|
|AI2 Reasoning Challenge (25-Shot)|45.73|
|HellaSwag (10-Shot) |68.59|
|MMLU (5-Shot) |26.82|
|TruthfulQA (0-shot) |37.81|
|Winogrande (5-shot) |65.90|
|GSM8k (5-shot) | 3.03|