Model Card for the Danoliterate Mistral 7B Model
A base model fine-tuned from Mistral 7B on a combination of Danish datasets for 20K updates (655M tokens.)
Model Details
Model Description
As test model part of the thesis Are GLLMs Danoliterate? Benchmarking Generative NLP in Danish with relevant details in Sections 4.1, 5.1 and 6.1.
- Developed by: Søren Vejlgaard Holm under supervision from Lars Kai Hansen and Martin Carsten Nielsen.
- Model type: Base, autoregressive LLM with Mistral 7B architecture.
- Language(s) (NLP): Danish
- License: MIT
Uses
This model is strictly a research artifact for investigating the effect of pre-training a model from scratch and is not intended to be applied directly.
Bias, Risks, and Limitations
The model has been trained on a large corpus on uncurated internet content and can thus possible generate problematic content.
Training Details
Training Data
The pretraining mix contained The Danish Gigaword + Danish Reddit corpora as compiled by the Danish Data Science Community as well as the Danish subset of CulturaX. For more details, see Section 4.1 in the thesis.
Training Procedure
See Sections 5.1 and 6.1 in the thesis
Evaluation
On the Danoliterate LLM Benchmark, this model gets an index score of 24 as of June 2024.
Model Card Contact
Contact Søren Vejlgaard Holm at swiho@dtu.dk or swh@alvenir.ai.
- Downloads last month
- 17
Model tree for sorenmulli/dano-mistral-7b-0.1
Base model
mistralai/Mistral-7B-v0.1