ChocoLlama
/

ChocoLlama-2-7B-instruct

@@ -1,38 +1,34 @@
 ---
-license: llama2
-tags:
-- alignment-handbook
-- trl
-- dpo
-- generated_from_trainer
-base_model: llama-2-nl/Llama-2-7b-hf-lora-original-sft
 datasets:
 - BramVanroy/ultra_feedback_dutch
-model-index:
-- name: Llama-2-7b-hf-lora-original-it
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# ChocoLlama-2-7B-instruct
-This model is a fine-tuned version of [ChocoLlama/ChocoLlama-2-7B-base](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base) on the BramVanroy/ultra_feedback_dutch dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.3536
-- Rewards/chosen: 0.1143
-- Rewards/rejected: -0.9295
-- Rewards/accuracies: 0.9396
-- Rewards/margins: 1.0437
-- Logps/rejected: -547.4578
-- Logps/chosen: -600.8353
-- Logits/rejected: -0.8732
-- Logits/chosen: -0.9594
-# Use the model
-```
 from transformers import AutoTokenizer, AutoModelForCausalLM
 tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct')
@@ -64,26 +60,68 @@ outputs = model.generate(
 )
 response = outputs[0][input_ids.shape[-1]:]
 print(tokenizer.decode(response, skip_special_tokens=True))
 ```
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
 - learning_rate: 5e-07
 - train_batch_size: 4
 - eval_batch_size: 4
@@ -98,22 +136,39 @@ The following hyperparameters were used during training:
 - lr_scheduler_warmup_ratio: 0.1
 - num_epochs: 1
-### Training results
-| Training Loss | Epoch  | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
-|:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
-| 0.5984        | 0.1327 | 100  | 0.5904          | 0.0549         | -0.1735          | 0.9030             | 0.2283          | -539.8975      | -601.4293    | -1.1606         | -1.1395       |
-| 0.4622        | 0.2653 | 200  | 0.4581          | 0.1134         | -0.4980          | 0.9351             | 0.6113          | -543.1426      | -600.8441    | -1.2714         | -1.2180       |
-| 0.3934        | 0.3980 | 300  | 0.3959          | 0.1263         | -0.7212          | 0.9366             | 0.8475          | -545.3747      | -600.7144    | -1.0528         | -1.0755       |
-| 0.3629        | 0.5307 | 400  | 0.3674          | 0.1170         | -0.8608          | 0.9381             | 0.9777          | -546.7705      | -600.8080    | -1.1109         | -1.1154       |
-| 0.3556        | 0.6633 | 500  | 0.3561          | 0.1136         | -0.9146          | 0.9388             | 1.0282          | -547.3090      | -600.8419    | -0.8266         | -0.9289       |
-| 0.3488        | 0.7960 | 600  | 0.3540          | 0.1104         | -0.9310          | 0.9410             | 1.0415          | -547.4734      | -600.8737    | -1.0676         | -1.0877       |
-| 0.3563        | 0.9287 | 700  | 0.3540          | 0.1166         | -0.9259          | 0.9396             | 1.0425          | -547.4224      | -600.8121    | -0.8736         | -0.9600       |
-### Framework versions
-- Transformers 4.40.1
-- Pytorch 2.1.2+cu121
-- Datasets 2.19.0
-- Tokenizers 0.19.1

 ---
+language:
+- nl
+license: cc-by-nc-4.0
+base_model: ChocoLlama/ChocoLlama-2-7B-base
 datasets:
+- BramVanroy/ultrachat_200k_dutch
+- BramVanroy/stackoverflow-chat-dutch
+- BramVanroy/alpaca-cleaned-dutch
+- BramVanroy/dolly-15k-dutch
+- BramVanroy/no_robots_dutch
 - BramVanroy/ultra_feedback_dutch
 ---
+<p align="center" style="margin:0;padding:0">
+<img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
+</p>
+<div style="margin:auto; text-align:center">
+<h1 style="margin-bottom: 0">ChocoLlama</h1>
+<em>A Llama-2/3-based family of Dutch language models</em>
+</div>
+## ChocoLlama-2-7B-instruct: Getting Started
+We here present **ChocoLlama-2-7B-instruct**, an instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
+Its base model, [ChocoLlama-2-7B-base](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base), is a language-adapted version of Meta's Llama-2-7b, fine-tuned on a Dutch dataset of 104GB using LoRa.
+Use the code below to get started with the model.
+```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-instruct')
 )
 response = outputs[0][input_ids.shape[-1]:]
 print(tokenizer.decode(response, skip_special_tokens=True))
 ```
+Note that the datasets used for instruction-tuning were translated using GPT-3.5/4, which means that this instruction-tuned model can not be used for commercial purposes.
+Hence, for any commercial applications, we recommend finetuning the base model on your own Dutch data.
+## Model Details
+ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
+We provide 6 variants (of which 3 base and 3 instruction-tuned models):
+- **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on a Dutch dataset of 104GB using LoRa.
+- **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
+- **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
+- **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
+- **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
+- **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
+For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](some_url).
+### Model Description
+- **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
+- **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA H100-80GB)
+- **Language(s):** Dutch
+- **License:** [Llama-2 Community License](https://ai.meta.com/llama/license/)
+- **Finetuned from model:** [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
+### Model Sources
+- **Repository:** Will be released soon.
+- **Paper:** Will be released soon.
+## Uses
+### Direct Use
+This is an instruction-tuned (SFT + DPO) Dutch model, optimized for Dutch language generation in conversational settings.
+For optimal behavior, we advice to only use the model with the correct chat template (see Python code above), potentially supported by a system prompt.
+### Out-of-Scope Use
+Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
+## Bias, Risks, and Limitations
+We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
+However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
+## Training Details
+We adopt the same strategy as used to align GEITje-7B to [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra).
+First, we apply supervised finetuning (SFT), utilizing the data made available by [Vanroy](https://arxiv.org/pdf/2312.12852):
+- [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch)
+- [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch)
+- [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch)
+- [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch)
+- [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch)
+Next, we apply Direct Preference Optimization (DPO) to the SFT version of all the pretrained models we here develop,
+now utilizing a Dutch version of the data used to train Zephyr-7B-$\beta$, [BramVanroy/ultra_feedback_dutch](https://huggingface.co/datasets/BramVanroy/ultra_feedback_dutch).
+For both the SFT and DPO stage, we update all model weights and apply the same set of hyperparameters to all models as used in GEITje-7B-ultra:
 - learning_rate: 5e-07
 - train_batch_size: 4
 - eval_batch_size: 4
 - lr_scheduler_warmup_ratio: 0.1
 - num_epochs: 1
+Further, we leverage the publicly available [alignment handbook](https://github.com/huggingface/alignment-handbook) and use a set of 4 NVIDIA A100 (80 GB RAM) for both stages.
+## Evaluation
+### Quantitative evaluation
+We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
+| Model                                        | ARC            | HellaSwag      | MMLU           | TruthfulQA     | Avg.           |
+|----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
+| **Llama-3-ChocoLlama-instruct**        | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
+| llama-3-8B-rebatch                           | 0.44           | 0.64           | 0.46           | 0.48           | 0.51           |
+| llama-3-8B-instruct                          | 0.47           | 0.59           | 0.47           | 0.52           | 0.51           |
+| llama-3-8B                                   | 0.44           | 0.64           | 0.47           | 0.45           | 0.5            |
+| Reynaerde-7B-Chat                            | 0.44           | 0.62           | 0.39           | 0.52           | 0.49           |
+| **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
+| zephyr-7b-beta                               | 0.43           | 0.58           | 0.43           | 0.53           | 0.49           |
+| geitje-7b-ultra                              | 0.40           | 0.66           | 0.36           | 0.49           | 0.48           |
+| **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
+| mistral-7b-v0.1                              | 0.43           | 0.58           | 0.37           | 0.45           | 0.46           |
+| **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
+| **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
+| **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
+| llama-2-7b-chat-hf                           | 0.36           | 0.49           | 0.33           | 0.44           | 0.41           |
+| llama-2-7b-hf                                | 0.36           | 0.51           | 0.32           | 0.41           | 0.40           |
+On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
+### Qualitative evaluation
+In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
+For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
+### Compute Infrastructure
+All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA H100 GPU's with 80 GB of VRAM.