metadata

language:
  - 'no'
  - nb
  - nn
inference: true
tags:
  - mistral
  - norwegian
  - instruction
  - chat
license: apache-2.0
pipeline_tag: text-generation
datasets:
  - CohereForAI/aya_dataset
  - OpenAssistant/oasst1
  - OpenAssistant/oasst2
  - laion/OIG
  - HuggingFaceH4/no_robots
  - databricks/databricks-dolly-15k
  - glaiveai/glaive-code-assistant-v2

Instruction-tuned NorMistral-7b-warm

This is a model instruction-tuned on open datasets released under the most permissive apache-2.0 licence — thus we can release this model under the same license and make it openly available for commercial applications.

The released weights are still a work in progress and they might change in the future. This is the first iteration of instruction-tuning our NorMistral models and it currently uses only the SFT phase without any preference optimization. Please let us know your feedback to improve the model in future releases.

Finetuning corpus

The corpus was compiled by this process:

We gathered all openly available datasets: Aya, OASST 1, OASST 2, OIG-small-chip2, No Robots, Dolly and Glaive code assistant.
These were first manually inspected and filtered, and then automatically filtered with Mixtral-8x7B to remove incorrect, offensive, non-English and American-centric responses.
The responses were augmented to be more descriptive by Mixtral-8x7B.
Since most of that dataset contains only a single dialogue turn, we generated more turns using Mixtral-8x7B.
Finally, we translated the resulting dataset into Bokmål and Nynorsk using NorMistral-7b-warm.

Prompt format

NorMistral uses ChatML-like format for structuring the (multi-turn) conversations. An example of a prompt in this format looks like the following (notice the special <|im_start|> and <|im_end|> tokens).

<|im_start|> user
Hva er hovedstaden i Norge?<|im_end|>
<|im_start|> assistant
Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser.<|im_end|>
<|im_start|> user
Gi meg en liste over de beste stedene å besøke i hovedstaden<|im_end|>
<|im_start|> assistant

This prompt-format is available as a chat template in the NorMistral's tokens, so it can be easily applied by calling the tokenizer.apply_chat_template() method:

messages = [
    {"role": "user", "content": "Hva er hovedstaden i Norge?"},
    {"role": "assistant", "content": "Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser."}
    {"role": "user", "content": "Gi meg en liste over de beste stedene å besøke i hovedstaden"}
]
gen_input = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
model.generate(**gen_input)

When tokenizing messages for generation, set add_generation_prompt=True when calling apply_chat_template(). This will append <|im_start|>assistant\n to your prompt, to ensure that the model continues with an assistant response.

About the base model

NorMistral-7b-warm is a large Norwegian language model initialized from Mistral-7b-v0.1 and continuously pretrained on a total of 260 billion subword tokens (using six repetitions of open Norwegian texts).

This model is a part of the NORA.LLM family developed in collaboration between the Language Technology Group at the University of Oslo, the High Performance Language Technologies (HPLT) project, the National Library of Norway, and the University of Turku. All the models are pre-trained on the same dataset and with the same tokenizer. NorMistral-7b-warm has over 7 billion parameters and is based on the Mistral architecture.

The NORA.LLM language model family includes (as of now):

NorMistral-7b-warm -- an LLM initialized from Mistral-7b-v0.1 and continuously pretrained on Norwegian data;
NorMistral-7b-scratch -- a Mistral-based LLM pretrained from scratch on Norwegian data;
NorBLOOM-7b-scratch -- a BLOOM-based LLM pretrained from scratch on Norwegian data.