matthieumeeus97
commited on
Commit
•
b39943f
1
Parent(s):
0ff0630
Update README.md
Browse files
README.md
CHANGED
@@ -1,38 +1,34 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
-
|
6 |
-
- dpo
|
7 |
-
- generated_from_trainer
|
8 |
-
base_model: llama-2-nl/Llama-2-7b-hf-lora-original-sft
|
9 |
datasets:
|
|
|
|
|
|
|
|
|
|
|
10 |
- BramVanroy/ultra_feedback_dutch
|
11 |
-
|
12 |
-
- name: Llama-2-7b-hf-lora-original-it
|
13 |
-
results: []
|
14 |
---
|
15 |
|
16 |
-
|
17 |
-
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
-
|
20 |
|
21 |
-
|
22 |
-
|
23 |
-
- Loss: 0.3536
|
24 |
-
- Rewards/chosen: 0.1143
|
25 |
-
- Rewards/rejected: -0.9295
|
26 |
-
- Rewards/accuracies: 0.9396
|
27 |
-
- Rewards/margins: 1.0437
|
28 |
-
- Logps/rejected: -547.4578
|
29 |
-
- Logps/chosen: -600.8353
|
30 |
-
- Logits/rejected: -0.8732
|
31 |
-
- Logits/chosen: -0.9594
|
32 |
|
33 |
-
|
34 |
|
35 |
-
```
|
36 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
37 |
|
38 |
tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct')
|
@@ -64,26 +60,68 @@ outputs = model.generate(
|
|
64 |
)
|
65 |
response = outputs[0][input_ids.shape[-1]:]
|
66 |
print(tokenizer.decode(response, skip_special_tokens=True))
|
67 |
-
|
68 |
```
|
69 |
|
70 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
71 |
|
72 |
-
|
|
|
73 |
|
74 |
-
##
|
75 |
|
76 |
-
|
77 |
|
78 |
-
|
|
|
79 |
|
80 |
-
|
81 |
|
82 |
-
|
83 |
|
84 |
-
|
85 |
|
86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
- learning_rate: 5e-07
|
88 |
- train_batch_size: 4
|
89 |
- eval_batch_size: 4
|
@@ -98,22 +136,39 @@ The following hyperparameters were used during training:
|
|
98 |
- lr_scheduler_warmup_ratio: 0.1
|
99 |
- num_epochs: 1
|
100 |
|
101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
102 |
|
103 |
-
|
104 |
-
|:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
|
105 |
-
| 0.5984 | 0.1327 | 100 | 0.5904 | 0.0549 | -0.1735 | 0.9030 | 0.2283 | -539.8975 | -601.4293 | -1.1606 | -1.1395 |
|
106 |
-
| 0.4622 | 0.2653 | 200 | 0.4581 | 0.1134 | -0.4980 | 0.9351 | 0.6113 | -543.1426 | -600.8441 | -1.2714 | -1.2180 |
|
107 |
-
| 0.3934 | 0.3980 | 300 | 0.3959 | 0.1263 | -0.7212 | 0.9366 | 0.8475 | -545.3747 | -600.7144 | -1.0528 | -1.0755 |
|
108 |
-
| 0.3629 | 0.5307 | 400 | 0.3674 | 0.1170 | -0.8608 | 0.9381 | 0.9777 | -546.7705 | -600.8080 | -1.1109 | -1.1154 |
|
109 |
-
| 0.3556 | 0.6633 | 500 | 0.3561 | 0.1136 | -0.9146 | 0.9388 | 1.0282 | -547.3090 | -600.8419 | -0.8266 | -0.9289 |
|
110 |
-
| 0.3488 | 0.7960 | 600 | 0.3540 | 0.1104 | -0.9310 | 0.9410 | 1.0415 | -547.4734 | -600.8737 | -1.0676 | -1.0877 |
|
111 |
-
| 0.3563 | 0.9287 | 700 | 0.3540 | 0.1166 | -0.9259 | 0.9396 | 1.0425 | -547.4224 | -600.8121 | -0.8736 | -0.9600 |
|
112 |
|
|
|
|
|
113 |
|
114 |
-
###
|
115 |
|
116 |
-
|
117 |
-
- Pytorch 2.1.2+cu121
|
118 |
-
- Datasets 2.19.0
|
119 |
-
- Tokenizers 0.19.1
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- nl
|
4 |
+
license: cc-by-nc-4.0
|
5 |
+
base_model: ChocoLlama/ChocoLlama-2-7B-base
|
|
|
|
|
|
|
6 |
datasets:
|
7 |
+
- BramVanroy/ultrachat_200k_dutch
|
8 |
+
- BramVanroy/stackoverflow-chat-dutch
|
9 |
+
- BramVanroy/alpaca-cleaned-dutch
|
10 |
+
- BramVanroy/dolly-15k-dutch
|
11 |
+
- BramVanroy/no_robots_dutch
|
12 |
- BramVanroy/ultra_feedback_dutch
|
13 |
+
|
|
|
|
|
14 |
---
|
15 |
|
16 |
+
<p align="center" style="margin:0;padding:0">
|
17 |
+
<img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
18 |
+
</p>
|
19 |
+
<div style="margin:auto; text-align:center">
|
20 |
+
<h1 style="margin-bottom: 0">ChocoLlama</h1>
|
21 |
+
<em>A Llama-2/3-based family of Dutch language models</em>
|
22 |
+
</div>
|
23 |
|
24 |
+
## ChocoLlama-2-7B-instruct: Getting Started
|
25 |
|
26 |
+
We here present **ChocoLlama-2-7B-instruct**, an instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
|
27 |
+
Its base model, [ChocoLlama-2-7B-base](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base), is a language-adapted version of Meta's Llama-2-7b, fine-tuned on a Dutch dataset of 104GB using LoRa.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
+
Use the code below to get started with the model.
|
30 |
|
31 |
+
```python
|
32 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
33 |
|
34 |
tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct')
|
|
|
60 |
)
|
61 |
response = outputs[0][input_ids.shape[-1]:]
|
62 |
print(tokenizer.decode(response, skip_special_tokens=True))
|
|
|
63 |
```
|
64 |
|
65 |
+
Note that the datasets used for instruction-tuning were translated using GPT-3.5/4, which means that this instruction-tuned model can not be used for commercial purposes.
|
66 |
+
Hence, for any commercial applications, we recommend finetuning the base model on your own Dutch data.
|
67 |
+
|
68 |
+
## Model Details
|
69 |
+
|
70 |
+
ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
|
71 |
+
|
72 |
+
We provide 6 variants (of which 3 base and 3 instruction-tuned models):
|
73 |
+
- **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on a Dutch dataset of 104GB using LoRa.
|
74 |
+
- **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
|
75 |
+
- **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
|
76 |
+
- **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
|
77 |
+
- **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
|
78 |
+
- **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
|
79 |
+
|
80 |
+
For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](some_url).
|
81 |
+
|
82 |
+
### Model Description
|
83 |
+
|
84 |
+
- **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
|
85 |
+
- **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA H100-80GB)
|
86 |
+
- **Language(s):** Dutch
|
87 |
+
- **License:** [Llama-2 Community License](https://ai.meta.com/llama/license/)
|
88 |
+
- **Finetuned from model:** [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
|
89 |
+
|
90 |
+
### Model Sources
|
91 |
|
92 |
+
- **Repository:** Will be released soon.
|
93 |
+
- **Paper:** Will be released soon.
|
94 |
|
95 |
+
## Uses
|
96 |
|
97 |
+
### Direct Use
|
98 |
|
99 |
+
This is an instruction-tuned (SFT + DPO) Dutch model, optimized for Dutch language generation in conversational settings.
|
100 |
+
For optimal behavior, we advice to only use the model with the correct chat template (see Python code above), potentially supported by a system prompt.
|
101 |
|
102 |
+
### Out-of-Scope Use
|
103 |
|
104 |
+
Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
|
105 |
|
106 |
+
## Bias, Risks, and Limitations
|
107 |
|
108 |
+
We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
|
109 |
+
However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
|
110 |
+
|
111 |
+
## Training Details
|
112 |
+
|
113 |
+
We adopt the same strategy as used to align GEITje-7B to [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra).
|
114 |
+
First, we apply supervised finetuning (SFT), utilizing the data made available by [Vanroy](https://arxiv.org/pdf/2312.12852):
|
115 |
+
- [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch)
|
116 |
+
- [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch)
|
117 |
+
- [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch)
|
118 |
+
- [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch)
|
119 |
+
- [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch)
|
120 |
+
|
121 |
+
Next, we apply Direct Preference Optimization (DPO) to the SFT version of all the pretrained models we here develop,
|
122 |
+
now utilizing a Dutch version of the data used to train Zephyr-7B-$\beta$, [BramVanroy/ultra_feedback_dutch](https://huggingface.co/datasets/BramVanroy/ultra_feedback_dutch).
|
123 |
+
|
124 |
+
For both the SFT and DPO stage, we update all model weights and apply the same set of hyperparameters to all models as used in GEITje-7B-ultra:
|
125 |
- learning_rate: 5e-07
|
126 |
- train_batch_size: 4
|
127 |
- eval_batch_size: 4
|
|
|
136 |
- lr_scheduler_warmup_ratio: 0.1
|
137 |
- num_epochs: 1
|
138 |
|
139 |
+
Further, we leverage the publicly available [alignment handbook](https://github.com/huggingface/alignment-handbook) and use a set of 4 NVIDIA A100 (80 GB RAM) for both stages.
|
140 |
+
|
141 |
+
## Evaluation
|
142 |
+
|
143 |
+
### Quantitative evaluation
|
144 |
+
|
145 |
+
We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
|
146 |
+
|
147 |
+
| Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. |
|
148 |
+
|----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
|
149 |
+
| **Llama-3-ChocoLlama-instruct** | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
|
150 |
+
| llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 |
|
151 |
+
| llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 |
|
152 |
+
| llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 |
|
153 |
+
| Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 |
|
154 |
+
| **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
|
155 |
+
| zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 |
|
156 |
+
| geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 |
|
157 |
+
| **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
|
158 |
+
| mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 |
|
159 |
+
| **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
|
160 |
+
| **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
|
161 |
+
| **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
|
162 |
+
| llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 |
|
163 |
+
| llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 |
|
164 |
+
|
165 |
+
On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
|
166 |
|
167 |
+
### Qualitative evaluation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
168 |
|
169 |
+
In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
|
170 |
+
For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
|
171 |
|
172 |
+
### Compute Infrastructure
|
173 |
|
174 |
+
All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA H100 GPU's with 80 GB of VRAM.
|
|
|
|
|
|