File size: 1,310 Bytes
b74da4f
 
 
b9a4fd5
b74da4f
 
 
 
 
 
 
 
 
 
cf8c0ad
b74da4f
cf8c0ad
b74da4f
cf8c0ad
 
 
 
 
 
 
 
 
 
 
ad0a307
 
cf8c0ad
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---
language: da
widget:
- text: En trend, der kan blive ligeså hot som<mask>.
tags:
- roberta
- danish
- masked-lm
- pytorch
license: agpl-3.0
---

# DanskBERT

This is DanskBERT, a Danish language model. Note that you should not prepend the mask with a space when using it directly!

The model is the best performing base-size model on the [ScandEval benchmark for Danish](https://scandeval.github.io/nlu-benchmark/).

DanskBERT was trained on the Danish Gigaword Corpus (Strømberg-Derczynski et al., 2021).

DanskBERT was trained using fairseq using the RoBERTa-base configuration. The model was trained with a batch size of 2k, and was trained to convergence for 500k steps using 16 V100 cards for approximately two weeks.

If you find this model useful, please cite

```
@inproceedings{snaebjarnarson-etal-2023-transfer,
    title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese",
    author = "Snæbjarnarson, Vésteinn  and
      Simonsen, Annika  and
      Glavaš, Goran  and
      Vulić, Ivan",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = "may 22--24",
    year = "2023",
    address = "Tórshavn, Faroe Islands",
    publisher = {Link{\"o}ping University Electronic Press, Sweden},
}
```