initial commit
Browse files- README.md +97 -0
- config.json +25 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tokenizer_config.json +1 -0
- vocab.txt +0 -0
README.md
ADDED
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: "da"
|
3 |
+
tags:
|
4 |
+
- ælæctra
|
5 |
+
- pytorch
|
6 |
+
- danish
|
7 |
+
- ELECTRA-Small
|
8 |
+
- replaced token detection
|
9 |
+
license: "mit"
|
10 |
+
datasets:
|
11 |
+
- DAGW
|
12 |
+
metrics:
|
13 |
+
- f1
|
14 |
+
---
|
15 |
+
|
16 |
+
# Ælæctra - A Step Towards More Efficient Danish Natural Language Processing
|
17 |
+
**Ælæctra** is a Danish Transformer-based language model created to enhance the variety of Danish NLP resources with a more efficient model compared to previous state-of-the-art (SOTA) models. Initially a cased and an uncased model are released. It was created as part of a Cognitive Science bachelor's thesis.
|
18 |
+
|
19 |
+
Ælæctra was pretrained with the ELECTRA-Small (Clark et al., 2020) pretraining approach by using the Danish Gigaword Corpus (Strømberg-Derczynski et al., 2020) and evaluated on Named Entity Recognition (NER) tasks. Since NER only presents a limited picture of Ælæctra's capabilities I am very interested in further evaluations. Therefore, if you employ it for any task, feel free to hit me up your findings!
|
20 |
+
|
21 |
+
Ælæctra was, as mentioned, created to enhance the Danish NLP capabilties and please do note how this GitHub still does not support the Danish characters "*Æ, Ø and Å*" as the title of this repository becomes "*-l-ctra*". How ironic.🙂
|
22 |
+
|
23 |
+
Here is an example on how to load both the cased and the uncased Ælæctra model in [PyTorch](https://pytorch.org/) using the [🤗Transformers](https://github.com/huggingface/transformers) library:
|
24 |
+
|
25 |
+
```python
|
26 |
+
from transformers import AutoTokenizer, AutoModelForPreTraining
|
27 |
+
|
28 |
+
tokenizer = AutoTokenizer.from_pretrained("Maltehb/-l-ctra-cased")
|
29 |
+
model = AutoModelForPreTraining.from_pretrained("Maltehb/-l-ctra-cased")
|
30 |
+
```
|
31 |
+
|
32 |
+
```python
|
33 |
+
from transformers import AutoTokenizer, AutoModelForPreTraining
|
34 |
+
|
35 |
+
tokenizer = AutoTokenizer.from_pretrained("Maltehb/-l-ctra-uncased")
|
36 |
+
model = AutoModelForPreTraining.from_pretrained("Maltehb/-l-ctra-uncased")
|
37 |
+
```
|
38 |
+
|
39 |
+
### Evaluation of current Danish Language Models
|
40 |
+
|
41 |
+
Ælæctra, Danish BERT (DaBERT) and multilingual BERT (mBERT) were evaluated:
|
42 |
+
|
43 |
+
| Model | Layers | Hidden Size | Params | AVG NER micro-f1 (DaNE-testset) | Average Inference Time (Sec/Epoch) | Download |
|
44 |
+
| --- | --- | --- | --- | --- | --- | --- |
|
45 |
+
| Ælæctra Uncased | 12 | 256 | 13.7M | 78.03 (SD = 1.28) | 10.91 | [Link for model](https://www.dropbox.com/s/cag7prs1nvdchqs/%C3%86l%C3%A6ctra.zip?dl=0) |
|
46 |
+
| Ælæctra Cased | 12 | 256 | 14.7M | 80.08 (SD = 0.26) | 10.92 | [Link for model](https://www.dropbox.com/s/cag7prs1nvdchqs/%C3%86l%C3%A6ctra.zip?dl=0) |
|
47 |
+
| DaBERT | 12 | 768 | 110M | 84.89 (SD = 0.64) | 43.03 | [Link for model](https://www.dropbox.com/s/19cjaoqvv2jicq9/danish_bert_uncased_v2.zip?dl=1) |
|
48 |
+
| mBERT Uncased | 12 | 768 | 167M | 80.44 (SD = 0.82) | 72.10 | [Link for model](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip) |
|
49 |
+
| mBERT Cased | 12 | 768 | 177M | 83.79 (SD = 0.91) | 70.56 | [Link for model](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip) |
|
50 |
+
|
51 |
+
|
52 |
+
On [DaNE](https://danlp.alexandra.dk/304bd159d5de/datasets/ddt.zip) (Hvingelby et al., 2020), Ælæctra scores slightly worse than both cased and uncased Multilingual BERT (Devlin et al., 2019) and Danish BERT (Danish BERT, 2019/2020), however, is more than 3 times faster per batch at inference time. For a full description of the evaluation and specification of the model read the thesis: 'Ælæctra - A Step Towards More Efficient Danish Natural Language Processing'.
|
53 |
+
|
54 |
+
### Pretraining
|
55 |
+
To pretrain Ælæctra it is recommended to build a Docker Container from the [Dockerfile](https://github.com/MalteHB/Ælæctra/tree/master/notebooks/fine-tuning/). Next, simply follow the [pretraining notebooks](https://github.com/MalteHB/Ælæctra/tree/master/infrastructure/Dockerfile/)
|
56 |
+
|
57 |
+
The pretraining was done by utilizing a single NVIDIA Tesla V100 GPU with 16 GiB, endowed by the Danish data company [KMD](https://www.kmd.dk/). The pretraining took approximately 4 days and 9.5 hours for both the cased and uncased model
|
58 |
+
|
59 |
+
### Fine-tuning
|
60 |
+
To fine-tune any Ælæctra model follow the [fine-tuning notebooks](https://github.com/MalteHB/Ælæctra/tree/master/notebooks/fine-tuning/)
|
61 |
+
|
62 |
+
### References
|
63 |
+
Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. ArXiv:2003.10555 [Cs]. http://arxiv.org/abs/2003.10555
|
64 |
+
|
65 |
+
Danish BERT. (2020). BotXO. https://github.com/botxo/nordic_bert (Original work published 2019)
|
66 |
+
|
67 |
+
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [Cs]. http://arxiv.org/abs/1810.04805
|
68 |
+
|
69 |
+
Hvingelby, R., Pauli, A. B., Barrett, M., Rosted, C., Lidegaard, L. M., & Søgaard, A. (2020). DaNE: A Named Entity Resource for Danish. Proceedings of the 12th Language Resources and Evaluation Conference, 4597–4604. https://www.aclweb.org/anthology/2020.lrec-1.565
|
70 |
+
|
71 |
+
Strømberg-Derczynski, L., Baglini, R., Christiansen, M. H., Ciosici, M. R., Dalsgaard, J. A., Fusaroli, R., Henrichsen, P. J., Hvingelby, R., Kirkedal, A., Kjeldsen, A. S., Ladefoged, C., Nielsen, F. Å., Petersen, M. L., Rystrøm, J. H., & Varab, D. (2020). The Danish Gigaword Project. ArXiv:2005.03521 [Cs]. http://arxiv.org/abs/2005.03521
|
72 |
+
|
73 |
+
|
74 |
+
#### Acknowledgements
|
75 |
+
As the majority of this repository is build upon [the works](https://github.com/google-research/electra) by the team at Google who created ELECTRA, a HUGE thanks to them is in order.
|
76 |
+
|
77 |
+
A Giga thanks also goes out to the incredible people who collected The Danish Gigaword Corpus (Strømberg-Derczynski et al., 2020).
|
78 |
+
|
79 |
+
Furthermore, I would like to thank my supervisor [Riccardo Fusaroli](https://github.com/fusaroli) for the support with the thesis, and a special thanks goes out to [Kenneth Enevoldsen](https://github.com/KennethEnevoldsen) for his continuous feedback.
|
80 |
+
|
81 |
+
Lastly, i would like to thank KMD, my colleagues from KMD, and my peers and co-students from Cognitive Science for encouriging me to keep on working hard and holding my head up high!
|
82 |
+
|
83 |
+
#### Contact
|
84 |
+
|
85 |
+
For help or further information feel free to connect with the author Malte Højmark-Bertelsen on [hjb@kmd.dk](mailto:hjb@kmd.dk?subject=[GitHub]%20Ælæctra) or any of the following platforms:
|
86 |
+
|
87 |
+
[<img align="left" alt="MalteHB | Twitter" width="22px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/twitter.svg" />][twitter]
|
88 |
+
[<img align="left" alt="MalteHB | LinkedIn" width="22px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/linkedin.svg" />][linkedin]
|
89 |
+
[<img align="left" alt="MalteHB | Instagram" width="22px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/instagram.svg" />][instagram]
|
90 |
+
|
91 |
+
<br />
|
92 |
+
|
93 |
+
</details>
|
94 |
+
|
95 |
+
[twitter]: https://twitter.com/malteH_B
|
96 |
+
[instagram]: https://www.instagram.com/maltemusen/
|
97 |
+
[linkedin]: https://www.linkedin.com/in/malte-h%C3%B8jmark-bertelsen-9a618017b/
|
config.json
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"ElectraForPreTraining"
|
4 |
+
],
|
5 |
+
"attention_probs_dropout_prob": 0.1,
|
6 |
+
"embedding_size": 128,
|
7 |
+
"generator_size": "0.25",
|
8 |
+
"hidden_act": "gelu",
|
9 |
+
"hidden_dropout_prob": 0.1,
|
10 |
+
"hidden_size": 256,
|
11 |
+
"initializer_range": 0.02,
|
12 |
+
"intermediate_size": 1024,
|
13 |
+
"layer_norm_eps": 1e-12,
|
14 |
+
"max_position_embeddings": 512,
|
15 |
+
"model_type": "electra",
|
16 |
+
"num_attention_heads": 4,
|
17 |
+
"num_hidden_layers": 12,
|
18 |
+
"pad_token_id": 0,
|
19 |
+
"summary_activation": "gelu",
|
20 |
+
"summary_last_dropout": 0.1,
|
21 |
+
"summary_type": "first",
|
22 |
+
"summary_use_proj": true,
|
23 |
+
"type_vocab_size": 2,
|
24 |
+
"vocab_size": 32000
|
25 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e0955ca630cff4aa08e2bb22a36f1fe0cf37a81922ae96f5d6594429bd502180
|
3 |
+
size 57979406
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"do_lower_case": false, "special_tokens_map_file": null, "full_tokenizer_file": null}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|