File size: 6,544 Bytes

8ccdb36
125fa3e
 
 
 
cc96941
 
125fa3e
 
 
 
 
 
 
 
 
cc96941
 
 
8ccdb36
cc96941
f792f3b
ec2ffa3
125fa3e
 
 
cc96941
f792f3b
b231245
125fa3e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b231245
125fa3e
 
 
 
 
 
 
 
 
 
 
 
b231245
cc96941
 
125fa3e
 
 
 
 
 
 
 
 
 
 
 
 
cc96941
 
 
125fa3e
 
 
 
 
cc96941
8ccdb36
 
125fa3e
 
 
 
cc96941
8ccdb36
cc96941
f792f3b
 
cc96941
 
f792f3b
cc96941
 
 
 
f792f3b
 
125fa3e
 
 
 
 
 
 
 
 
 
 
cc96941
8ccdb36
cc96941
 
f792f3b
 
 
 
 
 
cc96941
 
8ccdb36
f792f3b
cc96941
f792f3b
 
125fa3e

---
language:
- nl
license: cc-by-nc-4.0
library_name: peft
tags:
- generated_from_trainer
- alpaca
- Transformers
- PolyLM
- text-generation-inference
datasets:
- BramVanroy/alpaca-cleaned-dutch
inference: false
base_model: DAMO-NLP-MT/polylm-13b-fine-grained-shards
pipeline_tag: text-generation
model-index:
- name: polylm_13b_ft_alpaca_clean_dutch
  results: []
---

# polylm_13b_ft_alpaca_clean_dutch

## Model description

This adapter model is a fine-tuned version of [DAMO-NLP-MT/polylm-13b](https://huggingface.co/DAMO-NLP-MT/polylm-13b).
It achieves the following results on the evaluation set:
- Loss: 1.3839

Finetuning was performed on the Dutch [BramVanroy/alpaca-cleaned-dutch](https://www.huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch) dataset which contains 52K of records with instruction following-data translated from English to Dutch.

See [DAMO-NLP-MT/polylm-13b-fine-grained-shards](https://huggingface.co/DAMO-NLP-MT/polylm-13b-fine-grained-shards) for all information about the base model.

## Model usage

A basic example of how to use the finetuned model.

```
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "robinsmits/polylm_13b_ft_alpaca_clean_dutch"

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast = False, legacy = False)

model = AutoPeftModelForCausalLM.from_pretrained(model_name, device_map = "auto", load_in_4bit = True, torch_dtype = torch.bfloat16)

prompt = "### Instructie:\nWat zijn de drie belangrijkste softwareonderdelen die worden gebruikt bij webontwikkeling?\n\n### Antwoord:\n" 

inputs = tokenizer(prompt, return_tensors = "pt")
sample = model.generate(input_ids = inputs.input_ids.cuda(),
                        attention_mask = inputs.attention_mask.cuda(),
                        max_new_tokens = 128,
                        do_sample = True,
                        top_p = 0.85,
                        top_k = 50,
                        temperature = 0.5,
                        repetition_penalty = 1.2,
                        num_return_sequences = 1,
                        pad_token_id = tokenizer.eos_token_id,
                        forced_eos_token_id = tokenizer.eos_token_id)
output = tokenizer.decode(sample[0], skip_special_tokens = True)

print(output.split(prompt)[1])
```

The prompt and generated output for the above mentioned example is similar to the output shown below.

```
### Instructie:
Wat zijn de drie belangrijkste softwareonderdelen die worden gebruikt bij webontwikkeling?

### Antwoord:

De drie belangrijkste onderdelen van webontwikkeling waaraan wordt gedacht tijdens het ontwerpen en bouwen van websites, zijn HTML (Hypertext Markup Language), CSS (Cascading Style Sheets) en JavaScript. Het is belangrijk om te weten hoe deze drie elementen werken samen met elkaar voordat je een website ontwikkelt of verbeterd kunt maken. Bovendien moet je begrijpen wat elk onderdeel doet wanneer het op dezelfde pagina staat ingebed in verschillende contexten. Dit zal helpen bij het creëren van consistente inhoud zonder fouten zoals verwarring tussen browsers of compatibiliteitsproblemen. Ten slotte kan kennis over de functies
```

For more extensive usage and a lot of generated samples (both good and bad samples) see the following [Inference Notebook](https://github.com/RobinSmits/Dutch-LLMs/blob/main/PolyLM_13B_Alpaca_Clean_Dutch_Inference.ipynb)

## Intended uses & limitations

The PolyLM-13B model was trained on 18 languages. The primary focus was to create a multi-lingual Open LLM.
Dutch was one of those 18 languages. For training the model a diverse combination of multi-lingual datasets was used. 

The generated output and performance of this model for the Dutch language is very likely not always comparable to the various Open-Llama models that have been finetuned on English Alpaca datasets.

The primary intention of this finetuned model is to explore and research the use of the Dutch language in combination with an Open LLM model.

## Bias, Risks, and Limitations

The information below is copied from the base model's [official model card](https://arxiv.org/pdf/2307.06018.pdf).
This applies also to the finetuned model.

> Our contributions are fully methodological: adding the support of multilingualism to LLM during training and SFT phases. It is unavoidable that PolyLM might exhibit several common deficiencies of language models, e.g. hallucination and toxicity. PolyLM should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.

## Training and evaluation data

This model was trained on the [BramVanroy/alpaca-cleaned-dutch](https://www.huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch) dataset.

The dataset is the Dutch translation of the English Alpaca Cleaned instruction dataset.

Based on the dataset license only Non-Commercial use is allowed. Commercial use is strictly forbidden.

## Training procedure

This model was finetuned with a QLoRA setup on a Google Colab A100 GPU in about 3.5 hours.

The notebook used for training can be found here: [Training Notebook](https://github.com/RobinSmits/Dutch-LLMs/blob/main/PolyLM_13B_Alpaca_Clean_Dutch_Qlora.ipynb)

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 4
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 16
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 64
- num_epochs: 1

The following bitsandbytes quantization config was used during training:
- load_in_8bit: False
- load_in_4bit: True
- llm_int8_threshold: 6.0
- llm_int8_skip_modules: None
- llm_int8_enable_fp32_cpu_offload: False
- llm_int8_has_fp16_weight: False
- bnb_4bit_quant_type: nf4
- bnb_4bit_use_double_quant: True
- bnb_4bit_compute_dtype: bfloat16

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 1.4626        | 0.16  | 128  | 1.4613          |
| 1.4027        | 0.33  | 256  | 1.4235          |
| 1.4002        | 0.49  | 384  | 1.4054          |
| 1.3857        | 0.66  | 512  | 1.3951          |
| 1.3798        | 0.82  | 640  | 1.3870          |
| 1.3629        | 0.99  | 768  | 1.3839          |

### Framework versions

- Transformers 4.34.0
- Pytorch 2.0.1+cu118
- Datasets 2.14.5
- Tokenizers 0.14.1
- PEFT 0.5.0