davda54 commited on
Commit
68ce748
1 Parent(s): d5c0069

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -344
README.md CHANGED
@@ -39,360 +39,48 @@ The corpus was compiled by this process:
39
  4. Since most of that dataset contains only a single dialogue turn, we generated more turns using [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
40
  5. Finally, we translated the resulting dataset into Bokmål and Nynorsk using [NorMistral-7b-warm](https://huggingface.co/norallm/normistral-7b-warm).
41
 
42
- ## About the base model
43
-
44
- NorMistral-7b-warm is a large Norwegian language model initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and
45
- continuously pretrained on a total of 260 billion subword tokens (using six repetitions of open Norwegian texts).
46
-
47
- This model is a part of the NORA.LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
48
- All the models are pre-trained on the same dataset and with the same tokenizer.
49
- NorMistral-7b-warm has over 7 billion parameters and is based on [the Mistral architecture](https://huggingface.co/mistralai/Mistral-7B-v0.1).
50
-
51
- The NORA.LLM language model family includes (as of now):
52
- - [**NorMistral-7b-warm**](https://huggingface.co/norallm/normistral-7b-warm) -- an LLM initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and continuously pretrained on Norwegian data;
53
- - [**NorMistral-7b-scratch**](https://huggingface.co/norallm/normistral-7b-scratch) -- a Mistral-based LLM pretrained from scratch on Norwegian data;
54
- - [**NorBLOOM-7b-scratch**](https://huggingface.co/norallm/NorBLOOM-7b-scratch) -- a BLOOM-based LLM pretrained from scratch on Norwegian data.
55
-
56
-
57
- *Disclaimer: This model is pretrained on raw (mostly web-based) textual data.
58
- It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts.
59
- It is primarily intended for research purposes.*
60
-
61
- _____
62
- ## Pretraining corpus
63
-
64
- The model is pretrained exclusively on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
65
- This resulted in over 34B subword tokens of Norwegian (Bokmål or Nynorsk) in total, which amounts to about 26.7B whitespace-separated tokens
66
- We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
67
- The natural language data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).
68
-
69
- _____
70
- ## Model details
71
-
72
- **Model Developers:** Language Technology Group at the University of Oslo.
73
-
74
- **Variations:** NorMistral is currently published as two 7B variants: one trained entirely from *scratch* and one *warm*-started from the Mistral model.
75
-
76
- **Input:** Textual input.
77
-
78
- **Output:** Generated text.
79
-
80
- **Model Architecture:** NorMistral is an auto-regressive language model that uses an optimized transformer architecture based on the Mistral/Llama language models.
81
-
82
- ||Training Data|Params|Context Length|Tokens|LR|
83
- |---|---|---|---|---|---|
84
- |NorMistral-7b-warm|NCC+HPLT+CulturaX+Starcoder|7B|2k|260B|1.0 x 10<sup>-4</sup>|
85
- |NorMistral-7b-scratch|NCC+HPLT+CulturaX+Starcoder|7B|2k|260B|3.0 x 10<sup>-4</sup>|
86
- |NorBLOOM-7b-scratch|NCC+HPLT+CulturaX+Starcoder|7B|2k|260B|1.2 x 10<sup>-4</sup>|
87
-
88
- **Tokenizer:** Byte-based BPE tokenizer trained on the same Norwegian corpus as this model. The vocabulary size is 32,768 tokens.
89
-
90
- **Training FLOPs** The approximate amount is 1.22e+22 FLOPs; calculated as in [Chowdhery et al. (2022)](https://arxiv.org/abs/2204.02311).
91
-
92
- **Model Dates:** The models were pretrained between December 2023 and January 2024.
93
-
94
- **Status:** These are only pretrained language models; instruction-finetuned models will follow soon.
95
-
96
- **License:** [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
97
-
98
- **Research Paper:** Forthcoming
99
-
100
- _____
101
- ## Initial evaluation
102
-
103
- *Disclaimer: our model evaluation is an ongoing phase and is not claimed to be exhaustive. We provide our initial evaluation results on standard natural language understanding and generation tasks, and our evaluation design will be extended.
104
- The user should perform evaluation for their particular model application scenario, including safety and bias evaluations.*
105
-
106
- The perplexity on the heldout [validation set from the Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) is 7.43 and the final training perplexity is 4.76.
107
-
108
- Our initial downstream evaluation is conducted on reading comprehension, sentiment analysis and machine translation tasks using open-source peer-reviewed datasets and benchmarks in native Norwegian.
109
- We release [our codebase here](https://github.com/ltgoslo/norallm). We compare against other pretrained generative language models that officially support Norwegian: [NB-GPT-J](https://huggingface.co/NbAiLab/nb-gpt-j-6B), [GPT-Sw3 6.7B](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b), [GPT-Sw3 6.7B v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2), and [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b); we also include evaluation of [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
110
-
111
-
112
- ### Sentiment analysis
113
-
114
- [NoReC](https://huggingface.co/datasets/ltg/norec_sentence) ([Øvrelid et al., 2020](https://aclanthology.org/2020.lrec-1.618/)) is a dataset for sentence-level sentiment analysis derived from the Norwegian Review Corpus [(Velldal et al., 2018)](https://aclanthology.org/L18-1661/).
115
- We use the binary formulation of this task (positive vs. negative).
116
-
117
- <details>
118
- <summary>Method (click to expand)</summary>
119
-
120
- * Evaluation setting: zero-shot and few-shot perplexity-based evaluation.
121
- * Prompt: ```"Tekst: {text}\nSentiment:{label}"```, where the ```label``` is either "positiv" or "negativ".
122
- * Few-shot results show the average scores across 5 repetitions
123
- * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/sentiment_analysis.py
124
- * Performance metric: macro-averaged F1-score.
125
-
126
- </details>
127
-
128
- <details open>
129
- <summary>Macro-averaged F1-scores on the sentence-level sentiment analysis task (NoReC)</summary>
130
-
131
- |Model|0-shot (macro F1)|1-shot (macro F1)|16-shot (macro F1)|
132
- |---|---|---|---|
133
- |NorMistral-7b-warm|60.6|**77.8**|**87.3**|
134
- |NorMistral-7b-scratch|47.3|62.2|80.1|
135
- |NorBLOOM-7b|**75.7**|73.8|65.5|
136
- |NB-GPT-J|48.4|56.5|65.2|
137
- |GPT-Sw3-6.7B|61.5|72.2|76.5|
138
- |GPT-Sw3-6.7B-v2|42.4|69.1|83.4|
139
- |Falcon-7B|53.3|61.6|74.9|
140
- |Mistral-7B-v0.1|70.2|72.9|84.8|
141
-
142
- </details>
143
-
144
-
145
-
146
- ### Reading comprehension
147
-
148
- [NorQuAD](https://huggingface.co/datasets/ltg/norquad) ([Ivanova et al., 2023](https://aclanthology.org/2023.nodalida-1.17/)) is a dataset for extractive question answering in Norwegian designed similarly to [SQuAD (Rajpurkar et al., 2016)](https://aclanthology.org/D16-1264/).
149
-
150
- <details>
151
- <summary>Method (click to expand)</summary>
152
-
153
- * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
154
- * Prompt: ```"Tittel: {title}\n\nTekst: {text}\n\nSpørsmål: {question}\n\nSvar:{answer}"``` Based on [Brown et al. (2020)](https://arxiv.org/abs/2005.14165).
155
- * Few-shot results show the average scores across 5 repetitions
156
- * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/norquad.py
157
- * Performance metrics: macro-averaged F1-score and exact match (EM).
158
-
159
- </details>
160
-
161
- <details open>
162
- <summary>Performance results on the extractive question answering task (NorQuAD)</summary>
163
-
164
- |Model|0-shot (F1/EM)|1-shot (F1/EM)|2-shot (F1/EM)|
165
- |---|---|---|---|
166
- |NorMistral-7b-warm|**48.6**/**24.8**|63.6/40.0|66.5/43.8|
167
- |NorMistral-7b-scratch|34.0/15.7|46.5/25.8|48.5/27.8|
168
- |NorBLOOM-7b|35.0/13.3|47.7/28.0|49.3/30.1|
169
- |NB-GPT-J|24.4/6.8|32.8/11.6|35.0/12.3|
170
- |GPT-Sw3-6.7B|46.5/22.0|55.9/32.0|58.1/34.3|
171
- |GPT-Sw3-6.7B-v2|46.9/22.5|61.1/38.9|66.0/44.5|
172
- |Falcon-7B|15.8/7.0|27.3/13.9|27.4/13.1|
173
- |Mistral-7B-v0.1|46.4/22.4|**64.9**/**41.1**|**71.7**/**49.4**|
174
-
175
- </details>
176
-
177
-
178
- ### Grammatical error correction
179
-
180
- [ASK-RAW](https://huggingface.co/datasets/ltg/ask-gec) is dataset for Norwegian grammatical error correction (GEC) created by [Matias Jentoft (2023)](https://www.duo.uio.no/handle/10852/103885).
181
-
182
- <details>
183
- <summary>Method (click to expand)</summary>
184
-
185
- * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
186
- * Prompt: ```"Her er eksempler på perfekt korrigering av grammatiske feil:\n\nTekst: {source_text}\nKorreksjon:{target_text}"```
187
- * Few-shot results show the average scores across 5 repetitions
188
- * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/gec.py
189
- * Performance metrics: the evaluation metric uses [ERRANT](https://github.com/chrisjbryant/errant/tree/main), which identifies edit-spans and then calculates the F_{0.5} scores between the gold edits and predicted edits.
190
-
191
- </details>
192
-
193
- <details open>
194
- <summary>Results on [the ASK corpus](https://huggingface.co/datasets/ltg/ask-gec) (ERRANT F_{0.5})</summary>
195
-
196
- |Model|0-shot (F0.5)|1-shot (F0.5)|32-shot (F0.5)|
197
- |---|---|---|---|
198
- |NorMistral-7b-warm|**40.8**|41.8|48.5|
199
- |NorMistral-7b-scratch|22.1|28.8|42.1|
200
- |NorBLOOM-7b|8.7|24.5|32.0|
201
- |NB-GPT-J|9.1|28.2|30.6|
202
- |GPT-Sw3-6.7B|30.5|42.9|**50.6**|
203
- |GPT-Sw3-6.7B-v2|40.6|**43.4**|49.8|
204
- |Falcon-7B|10.8|12.4|15.5|
205
- |Mistral-7B-v0.1|26.0|27.4|30.6|
206
-
207
- </details>
208
-
209
-
210
- ### Machine translation
211
-
212
- [Tatoeba](https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt) [(Tiedemann, 2020)](https://aclanthology.org/2020.wmt-1.139/) is a benchmark for machine translation, which includes hundreds of language pairs. We consider six language pairs (English <-> Bokmål, English <-> Nynorsk, and Bokmål <-> Nynorsk).
213
-
214
- <details>
215
- <summary>Method (click to expand)</summary>
216
-
217
- * Evaluation setting: zero-shot and few-shot settings via natural language generation using the greedy decoding strategy.
218
- * Prompt: ```"{source_language}: {source_text}\n{target_language}:{target_text}"```, where the ```source_language``` and ```target_language``` are ```Engelsk```, ```Bokmål```, or ```Nynorsk```. Based on [Garcia et al. (2023)](https://arxiv.org/abs/2302.01398).
219
- * Few-shot results show the average scores across 5 repetitions
220
- * Evaluation script: https://github.com/ltgoslo/norallm/blob/main/initial_evaluation/machine_translation.py
221
- * Performance metrics: BLEU ([Papineni et al., 2002](https://aclanthology.org/P02-1040/)) and chrF++ ([Popović, 2015](https://aclanthology.org/W15-3049/)).
222
-
223
- </details>
224
-
225
- <details open>
226
- <summary>English → Norwegian Bokmål</summary>
227
-
228
- |Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
229
- |---|---|---|---|
230
- |NorMistral-7b-warm|**55.8**/**70.7**|**56.7**/**71.5**|57.7/72.4|
231
- |NorMistral-7b-scratch|46.4/62.9|50.4/66.3|52.1/67.6|
232
- |NorBLOOM-7b|37.1/53.6|50.1/65.8|52.0/67.6|
233
- |NB-GPT-J|8.6/39.1|35.9/64.5|47.2/68.7|
234
- |GPT-Sw3-6.7B|21.8/55.2|54.5/69.6|**58.6**/**73.2**|
235
- |GPT-Sw3-6.7B-v2|20.6/53.2|51.2/66.6|58.4/73.0|
236
- |Falcon-7B|19.1/40.1|20.6/41.8|22.1/43.6|
237
- |Mistral-7B-v0.1|32.5/51.9|35.4/55.1|36.3/56.0|
238
-
239
-
240
- </details>
241
-
242
- <details open>
243
- <summary>English → Norwegian Nynorsk</summary>
244
-
245
- |Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
246
- |---|---|---|---|
247
- |NorMistral-7b-warm|**43.6**/**62.0**|**44.2**/**63.2**|44.3/**63.7**|
248
- |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
249
- |NorBLOOM-7b|35.6/54.7|36.6/56.3|38.1/57.4|
250
- |NB-GPT-J|1.7/14.7|6.3/34.1|35.2/60.4|
251
- |GPT-Sw3-6.7B|13.4/44.3|43.6/62.5|**44.5**/63.5|
252
- |GPT-Sw3-6.7B-v2|14.8/45.5|43.7/62.3|44.0/63.6|
253
- |Falcon-7B|6.4/28.6|8.3/30.5|9.3/32.1|
254
- |Mistral-7B-v0.1|11.6/35.7|13.5/38.7|15.0/40.0|
255
-
256
-
257
- </details>
258
-
259
-
260
- <details open>
261
- <summary>Norwegian Bokmål → English</summary>
262
-
263
- |Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
264
- |---|---|---|---|
265
- |NorMistral-7b-warm|**56.7**/**70.6**|**57.7**/**71.7**|**58.5**/**72.2**|
266
- |NorMistral-7b-scratch|48.1/62.9|51.5/66.6|52.6/67.6|
267
- |NorBLOOM-7b|46.0/61.5|51.3/66.7|51.7/66.9|
268
- |NB-GPT-J|23.9/55.3|32.3/63.1|48.5/68.7|
269
- |GPT-Sw3-6.7B|47.9/67.8|52.4/70.6|50.0/70.7|
270
- |GPT-Sw3-6.7B-v2|38.8/59.6|49.0/68.6|50.7/70.6|
271
- |Falcon-7B|42.4/58.5|47.3/62.3|48.6/63.3|
272
- |Mistral-7B-v0.1|53.8/68.2|54.6/69.0|56.9/70.7|
273
-
274
- </details>
275
-
276
- <details open>
277
- <summary>Norwegian Nynorsk → English</summary>
278
-
279
- |Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
280
- |---|---|---|---|
281
- |NorMistral-7b-warm|**55.1**/**68.4**|**55.5**/**69.5**|56.0/69.8|
282
- |NorMistral-7b-scratch|47.1/61.9|49.4/64.2|52.3/66.2|
283
- |NorBLOOM-7b|45.0/59.3|48.3/64.0|49.0/64.7|
284
- |NB-GPT-J|2.9/19.5|10.1/41.0|44.4/66.9|
285
- |GPT-Sw3-6.7B|47.8/66.2|49.1/68.1|49.6/69.4|
286
- |GPT-Sw3-6.7B-v2|46.3/67.5|48.9/69.3|**58.2**/**72.8**|
287
- |Falcon-7B|21.6/40.6|31.7/47.4|36.6/57.1|
288
- |Mistral-7B-v0.1|40.7/57.1|46.2/60.7|49.9/63.8|
289
-
290
- </details>
291
-
292
-
293
- <details open>
294
- <summary>Norwegian Bokmål → Norwegian Nynorsk</summary>
295
-
296
- |Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
297
- |---|---|---|---|
298
- |NorMistral-7b-warm|**75.8**/**87.5**|74.0/**86.9**|75.3/87.5|
299
- |NorMistral-7b-scratch|38.0/56.9|39.2/57.9|40.7/59.3|
300
- |NorBLOOM-7b|71.5/84.4|70.1/84.1|71.9/85.1|
301
- |NB-GPT-J|6.6/35.5|9.6/41.0|26.0/64.7|
302
- |GPT-Sw3-6.7B|63.6/82.8|74.7/86.0|75.8/86.9|
303
- |GPT-Sw3-6.7B-v2|57.5/81.1|**75.3**/86.7|**76.7**/**87.6**|
304
- |Falcon-7B|28.7/59.2|29.8/60.8|32.1/62.3|
305
- |Mistral-7B-v0.1|32.0/62.2|32.9/62.6|35.2/63.9|
306
-
307
-
308
- </details>
309
-
310
- <details open>
311
- <summary>Norwegian Nynorsk → Norwegian Bokmål</summary>
312
-
313
- |Model|0-shot (BLEU/chrF++)|1-shot (BLEU/chrF++)|5-shot (BLEU/chrF++)|
314
- |---|---|---|---|
315
- |NorMistral-7b-warm|**88.1**/**93.6**|**89.2**/**94.3**|**89.3**/**94.6**|
316
- |NorMistral-7b-scratch|85.1/91.4|86.6/92.4|87.4/93.0|
317
- |NorBLOOM-7b|78.7/88.5|84.2/90.7|87.4/93.0|
318
- |NB-GPT-J|2.7/18.5|6.9/35.6|52.9/84.3|
319
- |GPT-Sw3-6.7B|652.3/82.4|86.1/92.5|87.8/93.6|
320
- |GPT-Sw3-6.7B-v2|72.0/88.6|86.1/92.5|88.2/93.9|
321
- |Falcon-7B|36.7/61.6|38.3/63.5|45.8/68.1|
322
- |Mistral-7B-v0.1|57.0/74.8|59.9/77.5|62.6/79.1|
323
-
324
- </details>
325
-
326
-
327
-
328
- _____
329
- ## Hardware and Software
330
-
331
- **Training Factors:** The models were pretrained using the Megatron-DeepSpeed library on [the LUMI cluster in Finland](https://lumi-supercomputer.eu/).
332
-
333
- **Carbon Footprint:** Pretraining one model took approximately 70k GPU hours of computation on AMD MI250X GPUs (assuming 2 GPUs per one AMD MI250X device), each of which draws 500W.
334
- LUMI is [one of the most eco-efficient data centers in the world](https://www.lumi-supercomputer.eu/sustainable-future/), and its energy consumption is covered 100% with renewable electricity.
335
 
 
336
 
 
 
 
 
 
 
 
 
337
 
338
- _____
339
- ## Example usage
340
 
341
- Let's try to use this model for English-to-Norwegian machine translation using simple zero-shot prompting:
 
342
 
343
  ```python
344
- from transformers import AutoTokenizer, AutoModelForCausalLM
345
-
346
- # First, we will have to import the tokenizer and the language model
347
- tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-7b-warm")
348
- model = AutoModelForCausalLM.from_pretrained("norallm/normistral-7b-warm").cuda().eval()
349
-
350
- # Now we will define the zero-shot prompt template
351
- prompt = """Engelsk: {0}
352
- Bokmål:"""
353
-
354
- # A function that will take care of generating the output
355
- @torch.no_grad()
356
- def generate(text):
357
- text = prompt.format(text)
358
- input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
359
- prediction = model.generate(
360
- input_ids,
361
- max_new_tokens=64,
362
- do_sample=False,
363
- eos_token_id=tokenizer('\n').input_ids
364
- )
365
- return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()
366
-
367
- # Now you can simply call the generate function with an English text you want to translate:
368
- generate("I'm super excited about this Norwegian NORA model! Can it translate these sentences?")
369
- # > this should output: 'Jeg er super spent på denne norske NORA modellen! Kan den oversette disse setningene?'
370
  ```
371
 
372
- ## Example usage on a GPU with ~16GB VRAM (try for yourself [in Google Colab](https://colab.research.google.com/drive/1AQgJ8lN-SNOqkUKj4xpQI5rr0R7V2Xzy?usp=sharing))
373
- Install bitsandbytes if you want to load in 8bit
374
 
375
- ```bash
376
- pip install bitsandbytes
377
- pip install accelerate
378
- ```
379
 
 
380
 
381
- ```python
382
- import torch
383
- from transformers import AutoTokenizer, AutoModelForCausalLM
384
 
385
- tokenizer = AutoTokenizer.from_pretrained(
386
- "norallm/normistral-7b-warm"
387
- )
388
 
389
- # This setup needs about 8gb VRAM
390
- # Setting `load_in_8bit=False` -> 15gb VRAM
391
- # Using `torch.float32` and `load_in_8bit=False` -> 21gb VRAM
392
- model = AutoModelForCausalLM.from_pretrained(
393
- "norallm/normistral-7b-warm",
394
- device_map='auto',
395
- load_in_8bit=True,
396
- torch_dtype=torch.bfloat16
397
- )
398
- ```
 
39
  4. Since most of that dataset contains only a single dialogue turn, we generated more turns using [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).
40
  5. Finally, we translated the resulting dataset into Bokmål and Nynorsk using [NorMistral-7b-warm](https://huggingface.co/norallm/normistral-7b-warm).
41
 
42
+ ## Prompt format
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
+ NorMistral uses ChatML-like format for structuring the (multi-turn) conversations. An example of a prompt in this format looks like the following (notice the special `<|im_start|>` and `<|im_end|>` tokens).
45
 
46
+ ```
47
+ <|im_start|> user
48
+ Hva er hovedstaden i Norge?<|im_end|>
49
+ <|im_start|> assistant
50
+ Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser.<|im_end|>
51
+ <|im_start|> user
52
+ Gi meg en liste over de beste stedene å besøke i hovedstaden
53
+ <|im_start|> assistant
54
 
55
+ ```
 
56
 
57
+ This prompt-format is available as a [chat template](https://huggingface.co/docs/transformers/main/chat_templating) in the NorMistral's tokens, so it can be easily applied by calling the
58
+ `tokenizer.apply_chat_template()` method:
59
 
60
  ```python
61
+ messages = [
62
+ {"role": "user", "content": "Hva er hovedstaden i Norge?"},
63
+ {"role": "assistant", "content": "Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser."}
64
+ {"role": "user", "content": "Gi meg en liste over de beste stedene å besøke i hovedstaden"}
65
+ ]
66
+ gen_input = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
67
+ model.generate(**gen_input)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ```
69
 
70
+ When tokenizing messages for generation, set `add_generation_prompt=True` when calling `apply_chat_template()`. This will append `<|im_start|>assistant\n` to your prompt, to ensure
71
+ that the model continues with an assistant response.
72
 
 
 
 
 
73
 
74
+ ## About the base model
75
 
76
+ NorMistral-7b-warm is a large Norwegian language model initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and
77
+ continuously pretrained on a total of 260 billion subword tokens (using six repetitions of open Norwegian texts).
 
78
 
79
+ This model is a part of the NORA.LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
80
+ All the models are pre-trained on the same dataset and with the same tokenizer.
81
+ NorMistral-7b-warm has over 7 billion parameters and is based on [the Mistral architecture](https://huggingface.co/mistralai/Mistral-7B-v0.1).
82
 
83
+ The NORA.LLM language model family includes (as of now):
84
+ - [**NorMistral-7b-warm**](https://huggingface.co/norallm/normistral-7b-warm) -- an LLM initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and continuously pretrained on Norwegian data;
85
+ - [**NorMistral-7b-scratch**](https://huggingface.co/norallm/normistral-7b-scratch) -- a Mistral-based LLM pretrained from scratch on Norwegian data;
86
+ - [**NorBLOOM-7b-scratch**](https://huggingface.co/norallm/NorBLOOM-7b-scratch) -- a BLOOM-based LLM pretrained from scratch on Norwegian data.