File size: 6,871 Bytes
baf0eff a495e66 12e189e a495e66 4b45c15 12e189e a495e66 4b45c15 baf0eff 4b45c15 8a0d5a7 12e189e 4b45c15 5eb7542 4b45c15 12e189e 8a0d5a7 4b45c15 8a0d5a7 4b45c15 12e189e 4b45c15 8a0d5a7 4b45c15 12e189e 4b45c15 baf0eff 12e189e 4b45c15 baf0eff 4b45c15 5eb7542 4b45c15 12e189e 4b45c15 baf0eff 4b45c15 5eb7542 4b45c15 baf0eff 4b45c15 d5d6eb1 12e189e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
---
language:
- nl
license: cc-by-nc-4.0
library_name: peft
tags:
- generated_from_trainer
- alpaca
- Transformers
- PolyLM
- text-generation-inference
datasets:
- BramVanroy/alpaca-cleaned-dutch
inference: false
base_model: DAMO-NLP-MT/polylm-1.7b
pipeline_tag: text-generation
model-index:
- name: polylm_1.7b_ft_alpaca_clean_dutch
results: []
---
# polylm_1.7b_ft_alpaca_clean_dutch
## Model description
This adapter model is a fine-tuned version of [DAMO-NLP-MT/polylm-1.7b](https://huggingface.co/DAMO-NLP-MT/polylm-1.7b).
It achieves the following results on the evaluation set:
- Loss: 1.8483
Finetuning was performed on the Dutch [BramVanroy/alpaca-cleaned-dutch](https://www.huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch) dataset which contains 52K of records with instruction following-data translated from English to Dutch.
See [DAMO-NLP-MT/polylm-1.7b](https://huggingface.co/DAMO-NLP-MT/polylm-1.7b) for all information about the base model.
## Model usage
A basic example of how to use the finetuned model.
```
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "robinsmits/polylm_1.7b_ft_alpaca_clean_dutch"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast = False, legacy = False)
model = AutoPeftModelForCausalLM.from_pretrained(model_name, device_map = "auto", load_in_4bit = True, torch_dtype = torch.bfloat16)
prompt = "### Instructie:\nWat zijn de drie belangrijkste softwareonderdelen die worden gebruikt bij webontwikkeling?\n\n### Antwoord:\n"
inputs = tokenizer(prompt, return_tensors = "pt")
sample = model.generate(input_ids = inputs.input_ids.cuda(),
attention_mask = inputs.attention_mask.cuda(),
max_new_tokens = 128,
do_sample = True,
top_p = 0.85,
top_k = 50,
temperature = 0.5,
repetition_penalty = 1.2,
length_penalty = -1.0,
num_return_sequences = 1,
pad_token_id = tokenizer.eos_token_id,
forced_eos_token_id = tokenizer.eos_token_id)
output = tokenizer.decode(sample[0], skip_special_tokens = True)
print(output.split(prompt)[1])
```
The prompt and generated output for the above mentioned example is similar to the output shown below.
```
### Instructie:
Wat zijn de drie belangrijkste softwareonderdelen die worden gebruikt bij webontwikkeling?
### Antwoord:
De drie belangrijkste softwareonderdelen die worden gebruikt in webontwikkeling zijn HTML, CSS en Javascript.HTML is het hoofdbestand voor alle inhoud op een website.CSS is het hoofdbestand voor decoraties en scripts om te gebruiken zoals JavaScript en PHP.Javascript wordt meestal gebruikt om verschillende functies uit te voeren of het script te manipuleren.Het laatste bestand maakt het mogelijk om code te schrijven dat aan uw website gekoppeld kan worden door middel van enkele woorden. Daarnaast kunnen er ook andere bestanden nodig zijn als gevolg van gebruik van meerdere servers.Een voorbeeld hiervan zou zijn wanneer u bijvoorbeeld een blog-website
```
For more extensive usage and a lot of generated samples (both good and bad samples) see the following [Inference Notebook](https://github.com/RobinSmits/Dutch-LLMs/blob/main/PolyLM_1_7B_Alpaca_Clean_Dutch_Inference.ipynb)
## Intended uses & limitations
The PolyLM-1.7B model was trained on 18 languages. The primary focus was to create a multi-lingual Open LLM.
Dutch was one of those 18 languages. For training the model a diverse combination of multi-lingual datasets was used.
The generated output and performance of this model for the Dutch language is very likely not always comparable to the various Open-Llama models that have been finetuned on English Alpaca datasets.
The primary intention of this finetuned model is to explore and research the use of the Dutch language in combination with an Open LLM model.
## Bias, Risks, and Limitations
The information below is copied from the base model's [official model card](https://arxiv.org/pdf/2307.06018.pdf):
This applies also to the finetuned model.
> Our contributions are fully methodological: adding the support of multilingualism to LLM during training and SFT phases. It is unavoidable that PolyLM might exhibit several common deficiencies of language models, e.g. hallucination and toxicity. PolyLM should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.
## Training and evaluation data
This model was trained on the [BramVanroy/alpaca-cleaned-dutch](https://www.huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch) dataset.
The dataset is the Dutch translation of the English Alpaca Cleaned instruction dataset.
Based on the dataset license only Non-Commercial use is allowed. Commercial use is strictly forbidden.
## Training procedure
This model was finetuned with a QLoRA setup on a Google Colab A100 GPU in about 1.5 hours.
The notebook used for training can be found here: [Training Notebook](https://github.com/RobinSmits/Dutch-LLMs/blob/main/PolyLM_1_7B_Alpaca_Clean_Dutch_Qlora.ipynb)
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 64
- num_epochs: 2
The following bitsandbytes quantization config was used during training:
- load_in_8bit: False
- load_in_4bit: True
- llm_int8_threshold: 6.0
- llm_int8_skip_modules: None
- llm_int8_enable_fp32_cpu_offload: False
- llm_int8_has_fp16_weight: False
- bnb_4bit_quant_type: nf4
- bnb_4bit_use_double_quant: True
- bnb_4bit_compute_dtype: bfloat16
### Training results
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 2.1248 | 0.16 | 128 | 2.1129 |
| 2.0512 | 0.33 | 256 | 2.0347 |
| 1.9983 | 0.49 | 384 | 1.9948 |
| 1.9557 | 0.66 | 512 | 1.9655 |
| 1.9583 | 0.82 | 640 | 1.9386 |
| 1.916 | 0.99 | 768 | 1.9177 |
| 1.8671 | 1.15 | 896 | 1.9019 |
| 1.8626 | 1.32 | 1024 | 1.8885 |
| 1.8321 | 1.48 | 1152 | 1.8762 |
| 1.8596 | 1.65 | 1280 | 1.8631 |
| 1.843 | 1.81 | 1408 | 1.8539 |
| 1.8333 | 1.98 | 1536 | 1.8483 |
### Framework versions
- Transformers 4.31.0
- Pytorch 2.0.1+cu118
- Datasets 2.13.1
- Tokenizers 0.13.3
- PEFT 0.4.0 |