matthieumeeus97 commited on
Commit
b39943f
1 Parent(s): 0ff0630

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -51
README.md CHANGED
@@ -1,38 +1,34 @@
1
  ---
2
- license: llama2
3
- tags:
4
- - alignment-handbook
5
- - trl
6
- - dpo
7
- - generated_from_trainer
8
- base_model: llama-2-nl/Llama-2-7b-hf-lora-original-sft
9
  datasets:
 
 
 
 
 
10
  - BramVanroy/ultra_feedback_dutch
11
- model-index:
12
- - name: Llama-2-7b-hf-lora-original-it
13
- results: []
14
  ---
15
 
16
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
18
 
19
- # ChocoLlama-2-7B-instruct
20
 
21
- This model is a fine-tuned version of [ChocoLlama/ChocoLlama-2-7B-base](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base) on the BramVanroy/ultra_feedback_dutch dataset.
22
- It achieves the following results on the evaluation set:
23
- - Loss: 0.3536
24
- - Rewards/chosen: 0.1143
25
- - Rewards/rejected: -0.9295
26
- - Rewards/accuracies: 0.9396
27
- - Rewards/margins: 1.0437
28
- - Logps/rejected: -547.4578
29
- - Logps/chosen: -600.8353
30
- - Logits/rejected: -0.8732
31
- - Logits/chosen: -0.9594
32
 
33
- # Use the model
34
 
35
- ```
36
  from transformers import AutoTokenizer, AutoModelForCausalLM
37
 
38
  tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct')
@@ -64,26 +60,68 @@ outputs = model.generate(
64
  )
65
  response = outputs[0][input_ids.shape[-1]:]
66
  print(tokenizer.decode(response, skip_special_tokens=True))
67
-
68
  ```
69
 
70
- ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
- More information needed
 
73
 
74
- ## Intended uses & limitations
75
 
76
- More information needed
77
 
78
- ## Training and evaluation data
 
79
 
80
- More information needed
81
 
82
- ## Training procedure
83
 
84
- ### Training hyperparameters
85
 
86
- The following hyperparameters were used during training:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  - learning_rate: 5e-07
88
  - train_batch_size: 4
89
  - eval_batch_size: 4
@@ -98,22 +136,39 @@ The following hyperparameters were used during training:
98
  - lr_scheduler_warmup_ratio: 0.1
99
  - num_epochs: 1
100
 
101
- ### Training results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
 
103
- | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
104
- |:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
105
- | 0.5984 | 0.1327 | 100 | 0.5904 | 0.0549 | -0.1735 | 0.9030 | 0.2283 | -539.8975 | -601.4293 | -1.1606 | -1.1395 |
106
- | 0.4622 | 0.2653 | 200 | 0.4581 | 0.1134 | -0.4980 | 0.9351 | 0.6113 | -543.1426 | -600.8441 | -1.2714 | -1.2180 |
107
- | 0.3934 | 0.3980 | 300 | 0.3959 | 0.1263 | -0.7212 | 0.9366 | 0.8475 | -545.3747 | -600.7144 | -1.0528 | -1.0755 |
108
- | 0.3629 | 0.5307 | 400 | 0.3674 | 0.1170 | -0.8608 | 0.9381 | 0.9777 | -546.7705 | -600.8080 | -1.1109 | -1.1154 |
109
- | 0.3556 | 0.6633 | 500 | 0.3561 | 0.1136 | -0.9146 | 0.9388 | 1.0282 | -547.3090 | -600.8419 | -0.8266 | -0.9289 |
110
- | 0.3488 | 0.7960 | 600 | 0.3540 | 0.1104 | -0.9310 | 0.9410 | 1.0415 | -547.4734 | -600.8737 | -1.0676 | -1.0877 |
111
- | 0.3563 | 0.9287 | 700 | 0.3540 | 0.1166 | -0.9259 | 0.9396 | 1.0425 | -547.4224 | -600.8121 | -0.8736 | -0.9600 |
112
 
 
 
113
 
114
- ### Framework versions
115
 
116
- - Transformers 4.40.1
117
- - Pytorch 2.1.2+cu121
118
- - Datasets 2.19.0
119
- - Tokenizers 0.19.1
 
1
  ---
2
+ language:
3
+ - nl
4
+ license: cc-by-nc-4.0
5
+ base_model: ChocoLlama/ChocoLlama-2-7B-base
 
 
 
6
  datasets:
7
+ - BramVanroy/ultrachat_200k_dutch
8
+ - BramVanroy/stackoverflow-chat-dutch
9
+ - BramVanroy/alpaca-cleaned-dutch
10
+ - BramVanroy/dolly-15k-dutch
11
+ - BramVanroy/no_robots_dutch
12
  - BramVanroy/ultra_feedback_dutch
13
+
 
 
14
  ---
15
 
16
+ <p align="center" style="margin:0;padding:0">
17
+ <img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
18
+ </p>
19
+ <div style="margin:auto; text-align:center">
20
+ <h1 style="margin-bottom: 0">ChocoLlama</h1>
21
+ <em>A Llama-2/3-based family of Dutch language models</em>
22
+ </div>
23
 
24
+ ## ChocoLlama-2-7B-instruct: Getting Started
25
 
26
+ We here present **ChocoLlama-2-7B-instruct**, an instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
27
+ Its base model, [ChocoLlama-2-7B-base](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base), is a language-adapted version of Meta's Llama-2-7b, fine-tuned on a Dutch dataset of 104GB using LoRa.
 
 
 
 
 
 
 
 
 
28
 
29
+ Use the code below to get started with the model.
30
 
31
+ ```python
32
  from transformers import AutoTokenizer, AutoModelForCausalLM
33
 
34
  tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct')
 
60
  )
61
  response = outputs[0][input_ids.shape[-1]:]
62
  print(tokenizer.decode(response, skip_special_tokens=True))
 
63
  ```
64
 
65
+ Note that the datasets used for instruction-tuning were translated using GPT-3.5/4, which means that this instruction-tuned model can not be used for commercial purposes.
66
+ Hence, for any commercial applications, we recommend finetuning the base model on your own Dutch data.
67
+
68
+ ## Model Details
69
+
70
+ ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
71
+
72
+ We provide 6 variants (of which 3 base and 3 instruction-tuned models):
73
+ - **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on a Dutch dataset of 104GB using LoRa.
74
+ - **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
75
+ - **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
76
+ - **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
77
+ - **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
78
+ - **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
79
+
80
+ For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](some_url).
81
+
82
+ ### Model Description
83
+
84
+ - **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
85
+ - **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA H100-80GB)
86
+ - **Language(s):** Dutch
87
+ - **License:** [Llama-2 Community License](https://ai.meta.com/llama/license/)
88
+ - **Finetuned from model:** [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
89
+
90
+ ### Model Sources
91
 
92
+ - **Repository:** Will be released soon.
93
+ - **Paper:** Will be released soon.
94
 
95
+ ## Uses
96
 
97
+ ### Direct Use
98
 
99
+ This is an instruction-tuned (SFT + DPO) Dutch model, optimized for Dutch language generation in conversational settings.
100
+ For optimal behavior, we advice to only use the model with the correct chat template (see Python code above), potentially supported by a system prompt.
101
 
102
+ ### Out-of-Scope Use
103
 
104
+ Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
105
 
106
+ ## Bias, Risks, and Limitations
107
 
108
+ We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
109
+ However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
110
+
111
+ ## Training Details
112
+
113
+ We adopt the same strategy as used to align GEITje-7B to [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra).
114
+ First, we apply supervised finetuning (SFT), utilizing the data made available by [Vanroy](https://arxiv.org/pdf/2312.12852):
115
+ - [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch)
116
+ - [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch)
117
+ - [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch)
118
+ - [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch)
119
+ - [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch)
120
+
121
+ Next, we apply Direct Preference Optimization (DPO) to the SFT version of all the pretrained models we here develop,
122
+ now utilizing a Dutch version of the data used to train Zephyr-7B-$\beta$, [BramVanroy/ultra_feedback_dutch](https://huggingface.co/datasets/BramVanroy/ultra_feedback_dutch).
123
+
124
+ For both the SFT and DPO stage, we update all model weights and apply the same set of hyperparameters to all models as used in GEITje-7B-ultra:
125
  - learning_rate: 5e-07
126
  - train_batch_size: 4
127
  - eval_batch_size: 4
 
136
  - lr_scheduler_warmup_ratio: 0.1
137
  - num_epochs: 1
138
 
139
+ Further, we leverage the publicly available [alignment handbook](https://github.com/huggingface/alignment-handbook) and use a set of 4 NVIDIA A100 (80 GB RAM) for both stages.
140
+
141
+ ## Evaluation
142
+
143
+ ### Quantitative evaluation
144
+
145
+ We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
146
+
147
+ | Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. |
148
+ |----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
149
+ | **Llama-3-ChocoLlama-instruct** | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
150
+ | llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 |
151
+ | llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 |
152
+ | llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 |
153
+ | Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 |
154
+ | **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
155
+ | zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 |
156
+ | geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 |
157
+ | **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
158
+ | mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 |
159
+ | **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
160
+ | **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
161
+ | **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
162
+ | llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 |
163
+ | llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 |
164
+
165
+ On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
166
 
167
+ ### Qualitative evaluation
 
 
 
 
 
 
 
 
168
 
169
+ In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
170
+ For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
171
 
172
+ ### Compute Infrastructure
173
 
174
+ All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA H100 GPU's with 80 GB of VRAM.